Watching the detectives

Presented by:

Welcome to Runtime! Today: why observability isn't necessarily a magic bullet for reliability, Red Hat fires back at critics of its new plans for CentOS Stream, and this week in enterprise startup funding.

Leave a trace

Over the last two decades, enterprise tech has learned countless lessons about building reliable infrastructure for web applications. Many of those lessons could only be learned by going to instant replay through the use of monitoring, and later, observability tools that reported on what went right and what went wrong.

But as the complexity involved in operating and maintaining applications has skyrocketed, relying on those tools to diagnose problems isn't enough. That was one of the main themes of Monitorama this year, a three-day conference in Portland dedicated to helping the engineers who have to keep the modern world up and running understand how to think about solving problems.

Businesses that want to take full advantage of monitoring and observability tools need to do their homework first, according to Adriana Villela, a developer advocate at Lightstep and member of the OpenTelemetry project.

There are three key signals involved in observability — metrics, logs, and traces — but traces are the most important leg of that stool when trying to get a full picture of system health.
"Traces are key because they tell us the thing that is happening overall from start to finish of your request," she said, and businesses should increase the weight they put on traces compared to metrics and logs when conducting analysis.
Villela also urged attendees to instrument their code, which involves writing new code specifically to track application performance rather than relying on logs for an after-action report.
Developers might balk at the extra work required, but companies need to make code instrumentation a default part of the development process if they want to get the most out of observability tools, she said.

But companies also need to be careful when evaluating data produced by those tools, said Jack Neely, observability architect at Palo Alto Networks.

It's easy to fall back on measuring application performance using Google's famous "four golden signals," but there are actually five signals; none of the first four really matter unless the customer experience is satisfactory, he said.
And, as one attendee pointed out, relying on observability tools requires you to trust the output that those tools — written by flawed human beings and capable of misfiring at any point, like any piece of software — are generating, when raw data might actually lead to a different conclusion.
"Who will monitor the monitors?" Neely joked in response, acknowledging the problem.
He advised setting up separate infrastructure running the open-source Prometheus tool to monitor the performance of those tools, in order to really understand what's happening.

And, in an ironic twist, having tools that help solve easy infrastructure problems could be making the problems that do sneak through much worse, said Dylan Ratcliffe, founder and CEO of Overmind.

"By improving the understanding of our systems, we're making outages more complex," he said.
In other words, the low-hanging fruit has been picked and what remains are the "unknown unknowns," he said, or the looming infrastructure problems you'll never be able to foresee.
Understanding the root cause of an outage is of course important, but corporations being what they are, postmortem reports produced after outages often lead to new internal processes for deploying software, which slows down that deployment process, which makes each deployment larger, which increases the severity and complexity of a mistake.

But don't be fooled by vendors rushing to attach themselves to the observability movement by promising a magic fix, said Paige Cruz, senior developer advocate at Chronosphere.

One of the most overused terms in enterprise software over the last few years is the notion that any one tool can provide "a single pane of glass," sometimes also expressed as "a single source of truth," which sounds really attractive to managers overwhelmed by "tool sprawl," she said.
But purpose-built tools are actually useful to different people working in different job functions, as long as employees feel they're getting value out of those tools, she said.
"The ability to go from a services point of view to a system point of view down into a specific service interaction, that ability to zoom in and out of the system, that is what you need your suite of monitoring and observability tools to provide for you," she said.

A MESSAGE FROM HASHICORP

Operational cloud maturity is the key to helping enterprises get the most from multi-cloud, slash costs, and maximize ROI with respect to speed, risk, and efficiency. Highly mature organizations are less likely to waste money on avoidable cloud spending, have an easier time dealing with cloud security issues, and better cope with the ongoing shortage of cloud skills. See the third annual State of Cloud Strategy Survey, commissioned by HashiCorp and conducted by Forrester Consulting.

Seeing red

Red Hat's Mike McGrath came out firing Monday in response to criticism of its decision to limit the ability of other companies and organizations to redistribute clones of Red Hat Enterprise Linux.

After repositioning CentOS in 2020 as essentially a beta version of upcoming RHEL releases, the company continued to publish the RHEL source code previously used to create versions of CentOS on a public site, where anyone could take it and build their own operating system that would be compatible with RHEL. Several free clones were created around that code and widely adopted by enterprises, but Red Hat announced last week that it would only release that code to current customers, who don't appear to be allowed to redistribute it.

"Simply rebuilding code, without adding value or changing it in any way, represents a real threat to open source companies everywhere. This is a real threat to open source, and one that has the potential to revert open source back into a hobbyist- and hackers-only activity," McGrath wrote Monday, acknowledging the torrent of criticism from some open-source developers and companies that rely on the clones that came along with that decision.

For their part, clone makers such as Rocky Linux and AlmaLinux vowed to continue their efforts, but Red Hat's move is yet another example of the friction between traditional open-source practices and the corporate need to increase revenue.

Enterprise funding

Redpanda raised $100 million in Series C funding for its version of an Apache Kafka-based streaming data service.

Cyera also raised $100 million, but in Series B funding, as it looks to ramp up sales of its cloud data security service.

Warp landed $50 million in Series B funding from Sequoia to expand its modern take on the old-fashioned command-line terminal.

Faros AI scored $20 million in Series A funding for a tool built by ex-Salesforce engineers that hopes to streamline the software development process.

Acryl Data raised $21 million in Series A funding to build out its observability tool for ensuring data quality.

The Runtime roundup

Google Cloud's Kelsey Hightower announced his retirement, saying "I hope to spend the rest of my life learning how to live." Check out my 2020 profile of Hightower to learn about the remarkable life he's already had.

IBM acquired Apptio for $4.6 billion from Vista Equity Partners, betting that its cloud cost-management services can help boost revenue.

AWS announced plans to invest $7.8 billion in Ohio to expand its data center capacity in the Buckeye State, home to the second of its US-East regions.

AWS also jumped into the application integration game with AppFabric, a new service designed to help SaaS apps talk to each other.

HashiCorp acquired BluBucket, a small startup working on code security, and plans to integrate its technology into Vault.

Microsoft Outlook users in North America that prefer the web version were having trouble accessing the service throughout most of Tuesday.

New Relic laid off around 10% of the company, a move that follows earlier cuts as the company struggles to transition into the observability era.