Why CoreWeave thinks software wins AI business

Welcome to Runtime! Today: CoreWeave's Chen Goldberg explains how the company has worked to promise its customers a more efficient AI infrastructure service, CoreWeave's biggest benefactor keeps raking in the cash, and the latest enterprise moves.

(Was this email forwarded to you? Sign up here to get Runtime each week.)

Clusterpluck

By now, CoreWeave's story is pretty familiar to anyone paying attention to enterprise tech: It executed a well-timed pivot from cryptocurrency mining to AI model training thanks to a pile of GPUs and some major help from Nvidia, and raised $1.5 billion in an IPO earlier this year. While much of the attention surrounding the company has focused on its ability to secure a steady supply of Nvidia's latest and greatest chips, the company's long-term success rides on its ability to improve its margins by using software to help its customers use those chips as efficiently as possible.

Last year CoreWeave hired Chen Goldberg, a Google veteran who helped build Google Cloud's infrastructure and played a leading role in the development of Kubernetes, which is now the second-most widely used open-source project in the world. As senior vice president of engineering, Goldberg is responsible for making sure CoreWeave can compete for AI business with the hyperscalers, who have decades of experience running complex workloads at scale.

"Fundamentally, our belief and in what we're seeing in the market is that AI workloads require a different type of infrastructure," Goldberg said in a recent interview. "At every layer of the stack, we've made intentional decisions that are not an easy retrofit."
Traditional cloud computing infrastructure was based around the virtual machine, which made total sense at the time since so many applications had already been built around VMware's groundbreaking hypervisor and IT departments were familiar with the architecture.
However, that's not necessarily the most efficient way to approach modern AI workloads, and CoreWeave does not use a hypervisor.
"The industry felt okay with low double-digit utilization, however, when the cluster is so expensive and there is scarcity of resources, you want to achieve better utilization by moving up the stack," she said.

CoreWeave's answer to that is called Mission Control, which Goldberg said played a big role in her decision to join the company. "We expect things to break and fail, and we make sure that we solve for that as quickly as possible," she said.

Computing infrastructure — whether it's on premises or in the cloud, run by a big vendor or a small one — is bound to fail from time to time given the complexity involved, and what separates the infrastructure leaders from the laggards is the ability to minimize those disruptions yet respond as quickly as possible when they inevitably occur.
Infrastructure failures are an even bigger problem for companies training AI models, because an interruption in the training process might force them to start the whole expensive process over again.
Mission Control gives CoreWeave the ability to detect when a node in a cluster is failing and improve utilization by swapping in a healthy one without having to wait for the customer to detect a problem, Goldberg said.
CoreWeave also developed a service called SUNK, or Slurm on Kubernetes, a way for customers to run the tried-and-true Slurm cluster resource-management technology developed for high-performance computing on Kubernetes containers, which are much more nimble and efficient than the bare-metal or virtual-machine computing instances previously needed to run Slurm (yep, that's a Futurama reference).

If CoreWeave wants to generate enough revenue to fund its expansion plans without going further into debt, it's going to need to pull more customers away from the Big Three cloud providers. Goldberg thinks the company's software tools will help achieve that goal by "[empowering] developers and users to make informed decisions with all the data they need," rather than presenting its infrastructure layer as a "black box" that is difficult or impossible to understand from the outside.

She pointed to the acquisition earlier this month of Weights & Biases, which built an application-development platform that CoreWeave said earlier this year "acts as the system of record for training and fine-tuning AI models and developing AI applications with confidence," as a way of improving transparency with developers.
But given Goldberg's background, that prompted a natural follow-up question: would CoreWeave consider open-sourcing a version of Mission Control, which would allow developers to really understand how the technology works?
Unsurprisingly, Goldberg wasn't ready to commit to such a move, but said "that's definitely something that will be interesting to think about."
Hardware will always be the central component of AI workloads, but everybody has Nvidia's chips; software tools, on the other hand, are long-term points of differentiation.

Speaking of chips

Nvidia continues to be the big winner of the generative AI boom, although the growth comparisons are getting a little tougher with each year we move beyond the debut of ChatGPT. The company reported first-quarter revenue of $44 billion on Wednesday, which is a 69% jump compared to last year and ahead of Wall Street's lofty expectations.

Sales to data-center customers rose 73% year-over-year, as early glitches with the rollout of its Blackwell chips appeared to smooth out over the last three months. Data-center customers account for an astounding $39.1 billion of its overall revenue, and the company said that "large cloud providers" account for about half of the total data-center revenue, according to CNBC.

However, CEO Jensen Huang devoted a substantial portion of his prepared remarks to financial analysts following the release of the results explaining how much higher the company's revenue would have been if not for new restrictions on the sale of its H20 chip — which was designed specifically to counter previous restrictions — to Chinese customers. "China's AI moves on with or without U.S. chips," he said, as transcribed by Seeking Alpha. "Export controls should strengthen U.S. platforms, not drive half of the world's AI talent to rivals."

Enterprise moves

Vijay Kumar is the new executive vice president and chief product officer at Rimini Street, joining the enterprise software support company after seven years at Genesys Cloud.

Marc Boroditsky is the new chief revenue officer at Nebius, following similar roles at Cloudflare and Twilio.

The Runtime roundup

Salesforce reported revenue and profit that beat Wall Street expectations and raised its guidance for the full year, but investors sent its stock down more than 3% on Thursday on concerns about its staying power and ability to successfully integrate Informatica, according to CNBC.

Dell also beat the Street and raised guidance, citing "unprecedented demand" for servers designed for AI applications and a new supercomputing deal with Lawrence Berkeley National Laboratory’s National Energy Research Scientific Computing Center.

Thanks for reading — Runtime is off for the weekend, see you Tuesday!

AWS vows to "push the limits" of AI infrastructure

How coding agents could blow up software backlogs

Microsoft's Jay Parikh: Agents will get rid of your coding backlog

Tom Krazit

Why CoreWeave thinks software wins AI business

Clusterpluck

Speaking of chips

Enterprise moves

The Runtime roundup

Read next