Meet the AI data center
Today: how AI has caused one of the biggest inflection points in data-center architecture in decades, why hospitals might have to soon adopt certain security practices to get federal assistance, and the latest funding rounds in enterprise tech.
Welcome to Runtime! Today: how AI has caused one of the biggest inflection points in data-center architecture in decades, why hospitals might have to soon adopt certain security practices to get federal assistance, and the latest funding rounds in enterprise tech.
(Was this email forwarded to you? Sign up here to get Runtime each week.)
Scale up, out, and sideways
Today's data centers evolved to handle the explosion of web and mobile apps that dominated the first two decades of the 21st century. Entering 2024, data-center operators are rethinking their design, layout, and equipment choices thanks to the unique demands that AI apps place on tech infrastructure.
We chatted with several data-center experts about those changes, and here are some of the components of modern data centers that are rapidly changing amid the AI boom, as explained by the people responsible for keeping up with those changes.
Chips are the most fundamental part of any computer, and the reordering of this market thanks to AI is old news in 2024.
But the chips needed to train AI models use more electricity than their traditional computing cousins, and that is having several follow-on effects across the data center.
"Power is by far the most constrained resource in any data center," said Prasad Kalyanaraman, vice president of infrastructure services for AWS.
AWS is redesigning the electrical systems that run above the server racks in its data centers to accommodate increased demand for electricity across all of its racks.
One of the trickiest parts of operating a modern data center is finding the best way to mitigate the heat coming off thousands of very powerful computers.
Data centers operators are in various stages of shifting to liquid cooling, which uses a closed-circuit loop of chilled liquid to absorb the heat directly off the chip and cool that liquid down outside the server rack.
For now, Microsoft is using what it called a "sidekick" to cool its newest Maia 100 AI chips in data centers that were designed around air cooling, which is most of them.
AWS has been able to rely on air cooling during its 2023 AI buildout, but Kalyanaraman expects that by the end of this year and into next, it will need to incorporate liquid cooling across its AI servers.
Liquid cooling also allows data-center operators to cluster AI servers much more closely together than they could if they had to cool those servers with air, said Tiffany Osias, vice president of global colocation services at Equinix, which helps address another crucial bottleneck in the AI data center.
Marvell customers are increasingly interested in supplementing traditional copper networking equipment with faster connections that can help minimize the amount of time it takes to train models on extremely expensive AI servers.
"What happens in large AI clusters is that connectivity becomes more important to keep all the GPUs up and running properly," said Radha Nagarajan, senior vice president and chief technology officer for Marvell's optical and cloud connectivity group.
"In those workloads, latency matters, and specifically what matters is the instance to instance latency, (or) the latency between the different racks within a data center that is part of the same cluster," Kalyanaraman said.
Infiniband promises the high-performance/low-latency connections needed for AI training, but it is best suited for rack-to-rack interconnects rather than the longer connections that cloud providers like AWS have to offer across availability zones in cloud regions that can be several miles apart.
That means traditional Ethernet is unlikely to disappear from modern data centers, in part because many AI experts think the current obsession with training models will soon start to fade.
Over the next couple of years AI researchers and cloud providers expect AI inference to become a much more important part of the enterprise AI stack.
"Inference is where the money happens for companies," Equinix's Osias said. "Inference is where they gain competitive advantage."
"AI inference on a network is going to be the sweet spot for many businesses: private data stays close to wherever users physically are, while still being extremely cost-effective to run because it’s nearby,” said Matthew Prince, CEO of Cloudflare, in a press release last year.
AWS's Kalyanaraman thinks a lot of companies will still want to use cloud providers for inference because of the stability and resiliency they offer.
"Training workloads have the ability to checkpoint and then restart from a certain checkpoint if there's any kind of failure. You can't do that with inference," he said.
The Feds are fed up
Cyberattacks are especially devastating when they impact real-world critical services like health care, but should the federal government force hospitals to follow its cybersecurity blueprint?
The Messenger reported Tuesday that the Biden administration is considering a proposal to require that hospitals use multifactor authentication and commit to patching vulnerabilities in a timely fashion if they want to be eligible for federal Medicare and Medicaid funds. A senior administration official told The Messenger that such actions “really do shut the door to most of our cyber incidents,” and that they expect the new regulations to be imposed before the end of the year.
That senior administration official (DC sourcing is always amusing) isn't wrong, but a lot of the medical organizations that would be affected by such an order would likely need help implementing those practices by the end of the year, and losing federal funding could be devastating. The healthcare industry is likely to fight the plan, noting last year that a lot of cyber incidents hit hospitals through third-party suppliers.
Extrahop raised $100 million in new funding from existing investors and said it had reached $200 million in annual recurring revenue for its security incident-response software.
Aqua Security added $60 million to a Series E round first raised in 2021 and maintained its valuation above $1 billion.
The Runtime roundup
HPE acquired Juniper Networks for around $14 billion, hoping to expand its networking business at a time when, as seen above, data-center operators are redesigning their networking strategies around AI workloads.
Microsoft and the Pacific Northwest National Laboratory announced the discovery of a new material that could help produce better lithium-ion batteries while promoting Microsoft's huge investments in AI.
Around 1.3 million Fidelity National Financial customers had data stolen during a cyberattack last November that shut down the mortgage company for a week.
Thanks for reading — see you Thursday!
Editor's note: This newsletter was updated on Wed. Jan 10th to correct the positioning of the electrical systems in AWS data centers.