Databricks CEO Ali Ghodsi's goal of a universal data format is close. But new efforts are aiming higher

Format wars tend to produce a winner; think VHS over Betamax, Blu-Ray over HD-DVD, or LTE over WiMax. A clear winner makes it easier for the late adopters to know they're picking the right horse, but creates a real problem for early adopters that aligned themselves with the losing format.

Another option is to forge compatibility between dueling formats, which is exactly what the cloud data management industry is currently trying to do with Delta Lake — created by Databricks — and Iceberg, which counts support from Databricks' main rival Snowflake as well as a parade of other companies that have endorsed the format. Backers of Hudi, a third, similar format, also see a path forward that could allow users of all three open-source formats to use whatever data query engine they like without having to go through the expensive and time-consuming process of converting their data to work with a new tool.

Last year at his company's Data & AI Summit, Databricks CEO Ali Ghodsi devoted several minutes of his post-keynote press conference evangelizing the idea of a "USB-C format" for data, which would give end users much more flexibility and allow companies that bet on Delta Lake in the early days of the data lake movement to keep pace with the industry momentum behind Iceberg. Key to that idea was Databricks' acquisition of Tabular, which was founded by the creators of Iceberg: "Our ulterior motive with [the deal] is to bring the formats closer so that we can get interoperability between these two formats, Apache Iceberg and Delta Lake," he said last year.

In a recent interview with Runtime, Ghodsi reiterated that goal. "I think I'm way more bullish on this project now, on this sort of grand unification of formats, than I was when we even did the acquisition," he said. This project, however, is an extremely complicated undertaking.

All of these formats were developed for different use cases using different specifications, and achieving a lowest-common denominator format might actually hurt the performance of certain types of applications. And data engineers are increasingly adopting catalogs and other emerging technologies that allow companies to work with their data without having to worry about formats.

"I think we're trying to force a standard top-down," said Vinoth Chandar, founder and CEO of Onehouse and creator of the Hudi format. "It's like, 'okay, everybody should be on Iceberg,' but for what?"

Table stakes

Delta Lake, Iceberg, and Hudi are all implementations of Parquet, an open-source "column-oriented data file format" maintained by the Apache Foundation. The three formats are known as table formats, which means they operate on top of file formats to make it easier for data query engines inside products like Databricks and Snowflake to read from and write to data warehouses and data lakes by adding context about the content of files.

Delta Lake is considered the most widely used open format given that it has been the default format on Databricks for years. Snowflake's platform is used by a large number of companies to manage their data, but Snowflake customers were required to use a proprietary format until last year and the company only announced full support for Iceberg last month.

However, enterprise tech has really thrown its weight behind Iceberg in the last year over concerns that Delta Lake is too tightly controlled by Databricks. AWS and Microsoft sponsored the 2025 Iceberg Summit in April, which also featured presentations from Apple, Autodesk, Google Cloud, IBM, and Tencent alongside Databricks and Snowflake.

Delta Lake and Iceberg are more similar than they are different, and the data community has made great strides toward bringing the formats closer together over the last several years. And the latest version of Iceberg, which has yet to be released but is almost fully baked, addresses a lot of previous incompatibility issues, Ghodsi said.

Customers don't want to choose one format; they need fast writes and need fast reads.

But one thorny problem remains; when it comes to writing data to the storage layer, Delta Lake and Iceberg use incompatible techniques that were designed to achieve different goals.

Delta Lake prioritizes writing data as quickly as possible to storage, which is great when used with massive data sets such as the ones needed to train AI models, but reading a Delta Lake table takes longer because the data isn't as organized. Iceberg, on the other hand, is slower when writing data to storage because it takes some additional steps to organize that data, but that makes it a much faster reader of that data when conducting queries needed for business intelligence reports.

"Customers don't want to choose one format; they need fast writes and need fast reads," said Chris Child, vice president of product at Snowflake, in a recent interview. "Our view is that the more we can standardize, the better it becomes."

Fighting the last war?

While Snowflake's decision to go all-in on Iceberg created a data-format standoff between the two bitter rivals, data vendors and consumers are increasingly looking beyond formats and eyeing data catalogs and other tools that make it easier to work with data across multiple formats and locations.

A data catalog "really is like a catalog in the simplest sense of, 'what are the things that are available and how do I know that?'" Child said. "The other thing that it provides that is important is locking, and [tracking], 'who is writing to this table right now,' so it allows you to enforce asset compliance like a traditional database."

Snowflake announced the Polaris catalog at its Snowflake Summit event in June 2024, and Databricks followed by open-sourcing its Unity catalog the following week. Both companies are expected to discuss their catalog products in greater detail over the next several weeks at the 2025 editions of their conferences.

While catalogs are the answer to questions about format incompatibility for a lot of customers, some worry the catalog just creates a new form of vendor lock-in at a different layer of the data stack.

"[Query] engines and catalogs are very close together," OneHouse's Chandar said. "If you're within those walled gardens, you typically use the catalog that is native within that ecosystem," which means Snowflake/Iceberg shops will probably use Polaris and Databricks customers will stick with Unity, he said.

We built a layer on top, and we say, 'well, how about you have your cake and eat it too.' Put your data in any storage [system] you want, in any format you want, and throw any compute you want at it.

A startup called Nextdata thinks data engineers are ready for a new abstraction layer that sits above the catalog. Founded by Zhamak Dehghani, who invented the concept of the data mesh, Nextdata "provides this domain-centric, domain-oriented, standardized way of discovering, governing and managing that data, regardless of what compute [engine] and storage [format] that data happens to be in," she said in a recent interview.

Data vendors have tried to sell customers on the tantalizing notion that they can use one unified pool of data for all their needs, but "people have accepted that's not going to be true," Dehghani said. While achieving that harmony would solve a lot of problems, chasing that elusive goal is frustrating, she said.

"We built a layer on top, and we say, 'well, how about you have your cake and eat it too.' Put your data in any storage [system] you want, in any format you want, and throw any compute you want at it," Dehghani said.

And while Ghodsi remains committed to working out the remaining compatibility issues between Delta Lake and Iceberg, he acknowledged that efforts like Tabular's support for the REST catalog protocol, which allows different catalogs to talk to each other, could have a more lasting impact.

"One thing that's really interesting here is that more and more capabilities going forward can be put on this catalog interface," he said. "So in other words, this sort of war on formats — do you put the bit this way, or do you store the invariant this way — maybe it's not needed. Maybe all that's needed is you talk to the catalog, you tell it, 'I want this data', and then it gives it back to you in the format you want it. And how did it actually store it behind that catalog? You don't need to know, as long as it's serving you what you want."

The huge stakes behind AI-driven coding

IBM Powers up; Deepgram talks code

Don't rank Grok

Databricks CEO Ali Ghodsi's goal of a universal data format is close. But new efforts are aiming higher

Table stakes

Fighting the last war?

Tom Krazit

Read next