Anthropic is still struggling with Claude reliability

Welcome to Runtime! Today: A series of Claude outages over the last few weeks illustrates that Anthropic's reliability challenge is not getting easier, Iran plans to target major enterprise tech installations in the Middle East, and the latest enterprise moves.

Please forward this email to a friend or colleague! If it was forwarded to you, sign up here to get Runtime each week, and if you value independent enterprise tech journalism, click the button below and become a Runtime supporter today.

Become a Runtime Supporter

More like Nopus

Demand for Anthropic's Claude AI models surged over the last six months as developers — and investors — started to realize that the promise of AI agents was coming true. But those models are only useful to enterprise tech when they are available for use, and months after Anthropic acknowledged it had a lot of work to do to improve the reliability of its services, they're still falling down at an alarming rate.

Anthropic acknowledged "elevated errors" on several services Wednesday and Thursday that prevented developers from logging into Claude Code and using its Sonnet models. And this week's troubles follow a more serious incident last week in which several versions of Claude, including its flagship Opus models, went down for several hours.

Anthropic said Wednesday's issues occurred after "our primary application database experienced severely degraded I/O performance following a routine maintenance operation, causing slow or failed requests on Claude.ai and preventing new or refreshed sign-ins for Claude Code and the Console."
Wilson, a bot that uses Claude to generate summaries of discussions on the ClaudeAI subreddit, put it this way: "The overwhelming consensus is frustration and anger over another outage that's killing productivity, especially for paying Pro and business users (emphasis in original). The hamsters powering the servers have clearly gone on strike again."
Harshith Vaddiparthy, head of growth for AI startup JustPaid, noted on X Thursday that Claude's performance advantage means nothing when it's down: "If Claude outages push people back to ChatGPT twice a week, benchmark scores are secondary. Reliability is the product now."

When Runtime last covered generative AI's reliability issues back in October, Anthropic was well aware that this was a critical problem for the company. After a bad stretch of outages in August and September it hired infrastructure veteran Rahul Patil as its new chief technology officer, and said it was working to get a better understanding of how model inference fails.

There are some novel technical issues that have made scaling AI inference trickier than scaling cloud computing services more than a decade ago, and that was really hard.
For example, Anthropic is running AI workloads across three different types of chips — Nvidia's GPUs, Google Cloud's TPUs, and AWS's Trainium — and each model has unique deployment challenges, former executive Todd Underwood told Runtime last year.
But the real issue seems to be that Anthropic's infrastructure is unable to keep up with the surge in demand for Opus and Sonnet after Claude Code started to really take off late last year, and after its battle with the Pentagon sparked a 55% jump in downloads of the Claude mobile app last week.
And it's not alone among AI providers: OpenAI has had its fair share of problems (ChatGPT had barely "two nines" over the last quarter), and this week GitHub announced that it was accelerating its migration to Microsoft Azure after a series of recent outages it blamed on soaring demand.

Ever since ChatGPT dropped in November 2022 AI companies and hyperscalers have been predicting that the majority of AI workloads over time would shift to inference, which feels like it should have been enough time to anticipate some of these problems and plan accordingly. Now that real enterprise demand for AI-powered applications and agents is starting to arrive, Anthropic engineers are under a lot of pressure to prove that AI can be reliable for use at enterprise scale.

Alex Palcuie, a member of Anthropic's technical staff, called his experience scaling Claude "the most Type II fun I've had in my career" while trying to recruit engineers on X back in February.
"For those unfamiliar with the terms, Type I fun is enjoyable in the moment. Type II fun involves some suffering while it's happening, but when you look back, you feel a deep sense of accomplishment," he wrote. "working on reliability engineering for Claude is very much the latter."

Race condition

As the War That Wasn't A War closes out its second week in the Middle East amid "the largest supply disruption in the history of the global oil market," Iran announced Wednesday that it now considers the tech operations of U.S. companies in the region fair game for its missiles. While it's hard to tell how serious a threat that really is, AWS is still dealing with the repercussions of drone strikes on its data centers in the United Arab Emirates, and tensions do not appear to be cooling.

"As the scope of the regional war expands to infrastructure war, the scope of Iran’s legitimate targets expands," Iranian state media reported Wednesday, according to Al Jazeera. Companies specifically listed include Google, Microsoft, Palantir, IBM, Nvidia and Oracle, as well as chip-making facilities in Israel.

Microsoft and Google Cloud operate data centers in Israel, Qatar, UAE, and Saudi Arabia, and while Iran's missile strikes across the region dwindled this week there were still several attacks. Modern hyperscaler data centers employ impressive physical security measures on the ground, but they were not designed to withstand attacks from the air.

Enterprise moves

Ed Jennings is the new president and CEO of DarkTrace, the third new CEO at the U.K. cybersecurity company in the last 18 months.

Abby Kearns is the new CEO of ActiveState, joining the open-source software catalog company after product leadership roles at Alembic and Puppet.

Valerie Henderson is the new CEO of Caylent, a promotion from her previous role as president and chief revenue officer of the AWS partner.

Kevin Brown is the new chief operating officer at Expereo, joining the networking company after serving as COO of NCC Group.

Abhishek “Abhi” Mathur is the new chief technology and product officer at ServiceTitan, joining the SaaS-for-tradespeople company after serving in product leadership roles at Figma, Meta, and Microsoft.

Chad Gerhardstein, Danielle Holbrook Dunn, and Uri Zelmanovich are the new chief risk and strategy officer, chief transformation officer, and chief financial officer, respectively, at Trulioo.

The Runtime roundup

Google Cloud completed its $32 billion acquisition of Wiz, which will give the third-place infrastructure cloud provider some unique insights into the security challenges that Wiz customers are having on rival clouds.

Atlassian announced plans to lay off 10% of employees, or about 1,600 people, in a push to refocus the venerable developer tools company around AI.

Microsoft might snap up some of the excess data center capacity at the Texas site where Oracle chose not to expand production, according to The Information.

Nvidia plans to spend $26 billion worth of its generative AI spoils on developing open-source models over the next five years, according to Wired.

Thanks for reading — see you Saturday!