All gas, no brakes = crashes

Welcome to Runtime! Today: how a gold rush and the inherent weirdness of GPU computing has caused reliability problems for AI services, OpenAI and Microsoft finalize their new partnership deal, and the latest funding rounds in enterprise tech.

(Please forward this email to a friend or colleague! And if it was forwarded to you, sign up here to get Runtime each week.)

Economies of scale

Cloud computing would not have become a nearly $100 billion market without the development of an industry-wide culture focused on reliability. Right now AI infrastructure providers like OpenAI and Anthropic find themselves in need of a similar breakthrough, according to insiders who spoke to Runtime in recent weeks.

A series of incidents at Anthropic in August and early September only highlighted what startups and application developers had been talking about for months: AI reliability falls short of what most businesses expect from their cloud providers, even after last week's massive AWS outage.

"It is what it is," Zencoder co-founder and CEO Andrew Filev told Runtime. "People are ready to live with it today because, one, they have no choice, and second, because of the productivity benefit" they get when everything works as it should, he said.
Engineering leaders at OpenAI and Anthropic are aware of the uptime issues, which — like most things involving massive distributed computing systems — stem from complex and fast-moving computer science and logistical problems.
However, the biggest issue might be that AI infrastructure teams are, as the old techie saying goes, building the plane while flying it.
"The biggest kind of all-encompassing challenge here is the rate at which things are growing, not the aggregate scale, or anything fundamental that cannot be overcome with good engineering and good systems engineering," said Venkat Venkataramani, vice president of app infra at OpenAI.

According to data from The Uptime Institute the Big Three cloud providers averaged around 99.97% uptime in 2024, which means they were only down for about two and a half hours over the course of the entire year. By contrast, over the last year OpenAI and Anthropic have both struggled to stay above 99% availability, which at that pace would mean their services would go dark for more than three and a half days over the course of a year.

In practice, it means customers endure a lot of short but frustrating outages; during one week in early August OpenAI reported problems with ChatGPT every single business day.
Anthropic went through its own bad stretch in August and into September, stumbling through a series of problems in August related to three infrastructure configuration issues and a hard outage on September 10th that took down several AI coding services that rely upon its APIs.
What we now consider traditional cloud infrastructure services were built around the scale-out principle of system design, which linked millions of relatively cheap servers built around Intel and AMD's x86 CPUs that could run basically any customer workload.
But GPUs behave differently than CPUs, and individual large-language models have to be deployed in very specific ways on custom hardware, said Todd Underwood, head of reliability at Anthropic.

However, both AI providers and app developers agreed that these reliability challenges are not insurmountable, and that clever engineering will go a long way toward making AI inference as reliable as traditional compute workloads. "We just have to invest a little bit more heavily into making sure we can very quickly and dynamically reroute when we're getting errors on certain endpoints," said Matan Grinberg, co-founder and CEO of AI coding tool Factory.

OpenAI recently introduced a new dashboard for customers that lets them track uptime and service disruptions without having to wait for details from the company, which could allow them to redirect their application toward a new model API when a problem is detected, Venkataramani said.
Anthropic is also working to improve the evaluations it uses to assess model performance, which it called out in its September postmortem as a problem that prevented the company from understanding how the model performance was degrading, Underwood said.
If companies actually do put AI agents at the heart of their customer-service applications or invoice-processing systems, reliability will become much more important.
"To some extent, even just maintaining a minimum level of availability in the middle of this growth is something," Underwood said. "But I think there's a bunch of work that we need to do that is just some, like, regular engineering work in this weird, technically complex context."

Read the rest of the full story on Runtime here.

The real public benefit was the friends we made along the way

OpenAI announced Tuesday that it has completed its transformation from a very complex interlocking system of non-profit and pro-profit organizations into a new structure, which is also a complex interlocking system of non-profit and for-profit organizations. The non-profit OpenAI Foundation now has an equity stake of 26% valued at $130 billion in the for-profit corporation, which is now known as OpenAI Group PBC, or public benefit corporation.

As OpenAI's primary investor, Microsoft needed to sign off on any changes to the previous structure, which guaranteed it access to OpenAI's intellectual property and a share of its revenue. As part of the deal, Microsoft's 27% equity stake allows it to continue to access OpenAI's models through 2032 and its research through 2030 unless "an independent expert panel" declares OpenAI has achieved AGI, or artificial general intelligence.

Of course, nobody still has any real notion of what AGI means, but under the old deal OpenAI could have simply declared it had reached AGI at basically any moment and closed off Microsoft's access to its models. Either way, OpenAI agreed to spend $250 billion on Azure services over an indefinite period of time, which is more than three times as much revenue as all of Azure generated during Microsoft's last fiscal year.

Enterprise funding

Crusoe raised $1.375 billion in Series E funding, which values the upstart AI cloud provider at over $10 billion.

Mercor scored $350 million in Series C funding, which also values the data labeling company at more than $10 billion.

Chainguard landed $280 million in new funding to expand its libraries of trusted software packages for enterprise open-source developers.

Uniphore raised $260 million in Series F funding for its AI development platform, which allows companies to build agents and tune models.

Fireworks AI scored $250 million in Series C funding as it builds out an AI inference platform.

Sublime Security landed $150 million in Series C funding for its email security software, which uses AI agents to detect and respond to email-based attacks.

The Runtime roundup

Nvidia will invest $1 billion in Nokia in hopes of jump-starting GPU adoption in telecom networking, which is traditionally one of the slowest sectors when it comes to embracing new enterprise technologies.

Amazon announced plans Tuesday to cut its corporate workforce by 14,000 employees, which is also expected to include employees at AWS.

Thanks for reading — see you Thursday!