Garbage in, garbage out
Today: Why the discovery that a leading AI dataset contained references to child sexual abuse material should give everyone pause, Cisco plunges further into the cloud-native community, and the latest moves in enterprise tech.
Welcome to Runtime! Today: Why the discovery that a leading AI dataset contained references to child sexual abuse material should give everyone pause, Cisco plunges further into the cloud-native community, and the latest moves in enterprise tech.
(Was this email forwarded to you? Sign up here to get Runtime each week.)
Know your data
The rush to embrace generative AI tools has forced users to trust that the providers of those tools are doing everything they can to ensure that the data used to train those tools was sourced ethically and legally. However, the lack of transparency around how those tools were created — even the "open source" ones — makes it extremely difficult to verify if that trust is warranted.
The nightmare scenario of the generative AI era came to pass this week, after Stanford researchers discovered that the widely used LAION-5B image dataset contained thousands of references to child sexual abuse material (CSAM). That dataset was used to train the popular Stable Diffusion image generator and was also available through Hugging Face, which counts enterprise tech stalwarts like AMD, AWS, Google, Nvidia, IBM, Intel, Qualcomm, and Salesforce among its investors.
- LAION-5B is an open-source dataset that contains links to billions of images used to train AI models, and Stanford Internet Observatory found more than 3,000 links to images of CSAM within that dataset.
- “If you have downloaded that full dataset for whatever purpose, for training a model for research purposes, then yes, you absolutely have CSAM, unless you took some extraordinary measures to stop it,” said Stanford's David Thiel, lead author of the study, as reported by 404 Media.
- Foundation model providers do take several steps to prevent CSAM data from surfacing as output in their models, but Stanford's research suggests that it is still being used to train those models in ways we don't really understand.
- As 404 Media's Samantha Cole put it, "The finding highlights the danger of largely indiscriminate scraping of the internet for the purposes of generative artificial intelligence."
CSAM is closely monitored by law enforcement agencies around the world. It's hard to believe that less-prominent illegal or questionable material isn't part of the datasets training generative AI models, and the companies developing those models have little incentive to tell us if it is.
- Abeba Birhane, an AI researcher who identified previous problems with the LAION datasets, suggested on X that Stanford's research should raise questions about "datasets locked in corp labs like those in OpenAI, Meta, & Google. you can be sure, those closed datasets — rarely examined by independent auditors — are much worse than the open LAION dataset."
- That's an important point, given that critics of open-source AI could seize upon Stanford's research to argue against providing open datasets and open models that can be accessed and used by anyone.
- However, that same openness allowed Stanford's researchers to find and highlight the problem; they don't have nearly as much visibility into the proprietary models at the heart of the generative AI boom.
Thankfully, enterprise tech companies appear to be much more interested in the text-generation and coding-assistant properties of generative AI models, as opposed to generating sexualized images for their corporate marketing materials.
- But this incident should prompt those companies to ask their vendors more questions about the data used to train the tools they're using for those tasks.
- And it should spur them to use more of their own corporate data, rather than data scraped from the nether regions of the internet, to train their own models.
- "Garbage in, garbage out" is an old computer science maxim that reminds us that any computing result is only as good as the data that it was fed.
- As the AI hype cycle dies down and the real work begins, clean datasets could become far more important than foundation model performance.
Thanks to everyone who has supported Runtime in our first year! As you make your plans for 2024, please consider sponsoring Runtime and getting your message in front of the more than 20,000 enterprise tech industry leaders and decision makers that receive this newsletter each week. We also plan to roll out several new products next year, including special reports, sponsored content, and events, both virtual and live. If you're interested in learning more, contact us here.
Cisco goes deeper into observability
When you think of the "cloud native" generation of companies, Cisco is pretty much the antithesis of that term. However, it continues to buy younger software companies to stay relevant, snapping up Cloud Native Computing Foundation darling Isovalent Thursday for an undisclosed sum.
Isovalent, which had raised $69 million (nice) in funding according to Crunchbase, plays an important role across two interesting observability-related open-source projects. It is a major steward of eBPF, which allows developers to build secure applications that run in the Linux kernel, and Cilium, a project that has "graduated" from the CNCF's incubation process and was built atop eBPF to incorporate modern observability practices.
It also sells an enterprise version of Cilium for companies that want to incorporate that technology into their tech stacks, and that could be an interesting fit alongside Cisco's acquisition of Splunk earlier this year. "In a cloud world, there’s still boxes in there somewhere, but it’s abstracted under layers and layers of software. And so eBPF and Cilium provide that visibility for (the) cloud world,” Cisco's Tom Gillis told Techcrunch.
Marie Myers is the new CFO of HPE, after serving in a similar role at HP for the last several years.
The Runtime roundup
Anthropic is in talks to raise $750 million in new funding that could value the AI foundation model company upwards of $18 billion, according to The Information.
Snowflake is jumping into the GPUs-as-a-service business with Snowpark, which it said "provides developers with elastic, on-demand compute powered with GPUs for all types of custom LLM app development and advanced use cases."
AI-related breakthroughs dominated the list of Quanta Magazine's top discoveries in computer science during 2023, but there's more to life (and the list) than AI.
Thanks for reading — Runtime is off for the holidays, barring any seismic and unexpected CEO departures — see you in January!