Today on Runtime: Google Cloud just recovered from one of the worst outages in recent memory; Tailscale's Avery Pennarun on making the easy things easy; and the quote of the week.
Two weeks: that's how long some Google Cloud customers in France were affected by one of the worst outages in several years among one of the major cloud providers, and the incident might have exposed a weakness in how Google Cloud designs its cloud regions.
On April 25th water began leaking into one of Google's data centers in what it calls the europe-west-9 region, located in Paris and launched just last year.
- Water, as you can imagine, has a negative effect on some of the most powerful computing equipment in the world, and a fire broke out in the battery room of that data center.
- "Subsequently, Google experienced an infrastructure failure that affected our europe-west-9 Cloud region, impacting multiple Google Cloud Services," Google said on its status page for the region earlier this week.
- The entire Paris region was down for more than a day, and parts of the region were down for more than two weeks before Google sounded the all-clear on Wednesday.
- And for some customers, that outage might be permanent: "Some instances located in the impacted portions of the datacenter remain unavailable," Google said Wednesday.
Outages are going to happen to every cloud provider, but modern cloud regions are supposed to be designed around availability zones, which protect an entire region from going down if an incident occurs in one building.
- "Google Cloud intends to offer a minimum of three availability zones (physically and logically distinct zones) in every general-purpose region," according to its documentation, and Paris indeed launched with three availability zones.
- The minute the fire broke out, the zone (europe-west9-a) that contained the affected data center was definitely going to be down for some time, but it is surprising that customers using some of Google's core services — including compute and its managed Kubernetes service — were affected even if they had designed their apps to run across multiple availability zones in that region.
- It's not unprecedented: So many AWS services run out of its notoriously creaky us-east-1 region that an outage at that region in late 2021 caused problems worldwide.
- But that is the oldest collection of data center buildings in AWS's arsenal and the default starting point for countless applications, whereas Google's Paris region is less than a year old.
It's clear this incident was a wake-up call for Google, if unnoticed by much of the world thanks to its geographic isolation.
- But one likely question on the mind of customers: how many other Google Cloud regions could fail if an incident occurs in just a single building in that region?
- Google Cloud declined to comment beyond a brief statement: "We will publish a full incident report with more details when the issue is fully resolved.”
- It took things further in its last status page update, promising change: "The Full Incident Report will detail the changes we are making to eliminate the global, regional, and zonal service impact, as well as the overall improvements we will be making based on the RCA (root-cause analysis)."
Reliability might be the most important competitive differentiator over the next decade of the cloud infrastructure services, given that the Big Three offer more or less the same number of services and no longer have to explain the benefits of cloud computing to customers.
- And the threats to those data centers will only increase as climate change leads to more intense weather in places previously considered safe.
Everything clicked for Avery Pennarun and his Tailscale co-founders when they sat down to list all the things that were annoying about some of the everyday tasks required to operate a tech organization, like building a dashboard.
"Almost every engineer spends most of their time doing things that don't scale," Pennarun said in a recent interview. But over the last decade, countless companies have convinced themselves that they need infrastructure designed for places like Amazon, Netflix, or Google to run their businesses, which turns relatively simple tasks like building a dashboard to monitor an application into an enormous project.
Tailscale grew out of a desire to simplify life for the 99% of companies that don't need to run what they run at Google, where Pennarun and his co-founders used to work. It's an authentication management service that runs on servers and personal devices inside a company and makes sure the right people have easy access to servers inside a corporate network.
"There's a saying in programming: make the easy things easy and the hard things possible," Pennarun said. "There's lots of products out there that make the hard things possible. But surprisingly, there's very few products that make the easy things easy."
That might be good advice for enterprise tech entrepreneurs: Tailscale has raised $150 million in funding since it was founded in 2019 and just hit the 100-employee milestone. The dirty secret of enterprise tech is that most companies don't need infrastructure that makes hard things possible to thrive on the internet; you are not Google.
Quote of the week
“We shouldn’t regulate AI until we see some meaningful harm that is actually happening, not imaginary scenarios" — Michael Schawrz, chief economist at Microsoft, inadvertently making the case for regulating AI sooner rather than later at the World Economic Forum last week.
The Runtime roundup
Toyota blamed a "cloud misconfiguration" for a data leak involving more than 2 million cars sold in Japan over the last ten years, but it wasn't clear if the error was Toyota's own or caused by a third-party provider.
AWS won't charge for moving data into and out of a new version of its Aurora database, but don't worry about its margins; it will charge you more for using that version than the standard Aurora database.
Thanks for reading — see you Tuesday!