Article updated on June 28, 2022 to reflect new Delta Lake open source announcement and other updates.
Article was updated on April 13, 2023 first route of revisions including some updates of summary of github stats in main summary chart and some newer pie charts of % contributions. I made some updates to engine support, will make deeper update later on.
Update to engine support August 7th, 2023 main changes seen in Apache Beam and Redshift in the engines being tracked.
What Is a Table Format?
Table formats allow us to interact with data lakes as easily as we interact with databases, using our favorite tools and languages. A table format allows us to abstract different data files as a singular dataset, a table.
Data in a data lake can often be stretched across several files. We can engineer and analyze this data using R, Python, Scala and Java using tools like Spark and Flink. Being able to define groups of these files as a single dataset, such as a table, makes analyzing them much easier (versus manually grouping files, or analyzing one file at a time). On top of that, SQL depends on the idea of a table and SQL is probably the most accessible language for conducting analytics.
Hive: A First-Generation Table Format
The original table format was Apache Hive. In Hive, a table is defined as all the files in one or more particular directories. While this enabled SQL expressions and other analytics to be run on a data lake, It couldn’t effectively scale to the volumes and complexity of analytics needed to meet today’s needs. Other table formats were developed to provide the scalability required.
The Next Generation
Introducing: Apache Iceberg, Apache Hudi, and Databricks Delta Lake. All three take a similar approach of leveraging metadata to handle the heavy lifting. Metadata structures are used to define:
What is the table?
What is the table’s schema?
How is the table partitioned?
What data files make up the table?
While starting from a similar premise, each format has many differences, which may make one table format more compelling than another when it comes to enabling analytics on your data lake.
In this article we will compare these three formats across the features they aim to provide, the compatible tooling, and community contributions that ensure they are good formats to invest in long term.
Schema Evolution (later chart with more detail)
Apache Project with a diverse PMC (top-level project)
Apache Project with a diverse PMC (top-level project)
Linux Foundation Project with an all-Databricks TSC
This article will primarily focus on comparing open-source table formats that enable you to run analytics using open architecture on your data lake using different engines and tools so we will be focusing on the open-source version of Delta Lake. Open architectures help minimize costs, avoid vendor lock-in, and ensure the latest and best-in-breed tools can always be available on your data.
Keep in mind Databricks has its own proprietary fork of Delta Lake, which has features only available on the Databricks platform. This is also true of Spark - Databricks-managed Spark clusters run a proprietary fork of Spark with features only available to Databricks customers. These proprietary forks aren’t open to enable other engines and tools to take full advantage of them, so they are not this article's focus.
One of the benefits of moving away from Hive’s directory-based approach is that it opens a new possibility of having ACID (Atomicity, Consistency, Isolation, Durability) guarantees on more types of transactions, such as inserts, deletes, and updates. In addition to ACID functionality, next-generation table formats enable these operations to run concurrently.
Apache Iceberg’s approach is to define the table through three categories of metadata. These categories are:
“metadata files” that define the table
“manifest lists” that define a snapshot of the table
“manifests” that define groups of data files that may be part of one or more snapshots
Query optimization and all of Iceberg’s features are enabled by the data in these three layers of metadata.
Through the metadata tree (i.e., metadata files, manifest lists, and manifests), Iceberg provides snapshot isolation and ACID support. When a query is run, Iceberg will use the latest snapshot unless otherwise stated. Writes to any given table create a new snapshot, which does not affect concurrent queries. Concurrent writes are handled through optimistic concurrency (whoever writes the new snapshot first, does so, and other writes are reattempted).
Beyond the typical creates, inserts, and merges, row-level updates and deletes are also possible with Apache Iceberg. All of these transactions are possible using SQL commands.
Apache Hudi’s approach is to group all transactions into different types of actions that occur along a timeline. Hudi uses a directory-based approach with files that are timestamped and log files that track changes to the records in that data file. Hudi allows you the option to enable a metadata table for query optimization (The metadata table is now on by default starting in version 0.11.0). This table will track a list of files that can be used for query planning instead of file operations, avoiding a potential bottleneck for large datasets.
Partitions allow for more efficient queries that don’t scan the full depth of a table every time. Partitions are an important concept when you are organizing the data to be queried effectively. Often, the partitioning scheme of a table will need to change over time. With Hive, changing partitioning schemes is a very heavy operation. If data was partitioned by year and we wanted to change it to be partitioned by month, it would require a rewrite of the entire table. More efficient partitioning is needed for managing data at scale.
Partition evolution allows us to update the partition scheme of a table without having to rewrite all the previous data.
Apache Iceberg is currently the only table format with partition evolution support. Partitions are tracked based on the partition column and the transform on the column (like transforming a timestamp into a day or year).
This is different from typical approaches, which rely on the values of a particular column and often require making new columns just for partitioning. With the traditional way, pre-Iceberg, data consumers would need to know to filter by the partition column to get the benefits of the partition (a query that includes a filter on a timestamp column but not on the partition column derived from that timestamp would result in a full table scan). This is a huge barrier to enabling broad usage of any underlying system. Iceberg enables great functionality for getting maximum value from partitions and delivering performance even for non-expert users.
Since Iceberg partitions track a transform on a particular column, that transform can evolve as the need arises. Query filtering based on the transformed column will benefit from the partitioning regardless of which transform is used on any portion of the data.
For example, a timestamp column can be partitioned by year then easily switched to month going forward with an ALTER TABLE statement.
When the data is filtered by the timestamp column, the query is able to leverage the partitioning of both portions of the data (i.e., the portion partitioned by year and the portion partitioned by month). As shown above, these operations are handled via SQL. Partition evolution gives Iceberg two major benefits over other table formats:
A rewrite of the table is not required to change how data is partitioned
A query can be optimized by all partition schemes (data partitioned by different schemes will be planned separately to maximize performance)
Note: Not having to create additional partition columns that require explicit filtering to benefit from is a special Iceberg feature called Hidden Partitioning.
Hudi does not support partition evolution or hidden partitioning. A similar result to hidden partitioning can be done with the data skipping feature (Currently only supported for tables in read-optimized mode).
Delta Lake does not support partition evolution. It can achieve something similar to hidden partitioning with its generated columns feature which is currently in public preview for Databricks Delta Lake, still awaiting full support for OSS Delta Lake.
As data evolves over time, so does table schema: columns may need to be renamed, types changed, columns added, and so forth.. All three table formats support different levels of schema evolution. The chart below will detail the types of updates you can make to your table’s schema.
The ability to evolve a table’s schema is a key feature. If you can’t make necessary evolutions, your only option is to rewrite the table, which can be an expensive and time-consuming operation.
Time travel allows us to query a table at its previous states. A common use case is to test updated machine learning algorithms on the same data used in previous model tests. Comparing models against the same data is required to properly understand the changes to a model.
In general, all formats enable time travel through “snapshots.” Each snapshot contains the files associated with it. Periodically, you’ll want to clean up older, unneeded snapshots to prevent unnecessary storage costs. Each table format has different tools for maintaining snapshots, and once a snapshot is removed you can no longer time-travel to that snapshot.
Every time an update is made to an Iceberg table, a snapshot is created. You can specify a snapshot-id or timestamp and query the data as it was with Apache Iceberg. To maintain Apache Iceberg tables you’ll want to periodically expire snapshots using the expireSnapshots procedure to reduce the number of files stored (for instance, you may want to expire all snapshots older than the current year.). Once a snapshot is expired you can’t time-travel back to it.
The Hudi table format revolves around a table timeline, enabling you to query previous points along the timeline. To maintain Hudi tables use the Hoodie Cleaner application. Once you have cleaned up commits you will no longer be able to time travel to them.
Each Delta file represents the changes of the table from the previous Delta file, so you can target a particular Delta file or checkpoint to query earlier states of the table. By default, Delta Lake maintains the last 30 days of history in the table’s adjustable data retention settings. Use the vacuum utility to clean up data files from expired snapshots. With Delta Lake, you can’t time travel to points whose log files have been deleted without a checkpoint to reference.
Depending on which logs are cleaned up, you may disable time travel to a bundle of snapshots. For example, say you have logs 1-30, with a checkpoint created at log 15. Vacuuming log 1 will disable time travel to logs 1-14, since there is no earlier checkpoint to rebuild the table from.
Governance and Community Contributions
Having an open source license and a strong open source community enables table format projects to evolve, improve at greater speeds, and continue to be maintained for the long term. The chart below compares the open source community support for the three formats as of 3/28/22.
Apache Governed Project
Open Source License
First Open Source Commit
Many projects are created out of a need at a particular company. Apache Iceberg came out of Netflix, Hudi out of Uber, and Delta Lake out of Databricks.
There are many different types of open-source licensing, including the popular Apache license. Before becoming an Apache Project, reporting, governance, technical, branding, and community standards must be met. The Apache Project status assures that there is a fair governing body behind a project and that the commercial influences of any particular company aren’t steering it.
Comparing Recent Iceberg, Hudi, and Delta Lake Contributions in GitHub
In the chart above we see the summary of current GitHub stats over a 30-day period, which illustrates the current moment of contributions to a particular project. When choosing an open-source project to build your data architecture around you want strong contribution momentum to ensure the project's long-term support. Let’s look at several other metrics relating to the activity in each project’s GitHub repository and discuss why they matter.
Pull-requests are actual code from contributors being offered to add a feature or fix a bug. This is probably the strongest signal of community engagement as developers contribute their code to the project. Here we look at merged pull requests instead of closed pull requests as these represent code that has actually been added to the main code base (closed pull requests aren’t necessarily code added to the code base).
Stars are one way to show support for a project. However, while they can demonstrate interest, they don’t signify a track record of community contributions to the project like pull requests do. For that reason, community contributions are a more important metric than stars when you’re assessing the longevity of an open-source project as the basis for your data architecture.
Commits are changes to the repository. An actively growing project should have frequent and voluminous commits in its history to show continued development.
A diverse community of developers from different companies is a sign that a project will not be dominated by the interests of any particular company. Below are some charts showing the proportion of contributions each table format has from contributors at different companies. The info is based on data pulled from the GitHub API. Extra efforts were made to identify the company of any contributors who made 10 or more contributions but didn’t have their company listed on their GitHub profile.
[Note: This info is based on contributions to each project’s core repository on GitHub, measuring contributions which are issues/pull requests and commits in the GitHub repository. Activity or code merges that occur in other upstream or private repositories are not factored in since there is no visibility into that activity. Delta Lake boasts 6400 developers have contributed to Delta Lake, but this article only reflects what is independently verifiable through the open-source repository activity.]
Attributable Repository Contributors
[UPDATE] On January 13th, 2023 the number on the apache/iceberg and delta-io/delta repositories were calculated again using the same methodology as the above. The charts below represent the results; in this run, all contributors whose company could not be researched were removed so this only accounts for contributors whose contributions could be attributed to a particular company versus the charts above, which included "unknown" contributors". If you want to recalculate these values, you can find my scripts and methodology in this repository.
Greater release frequency is a sign of active development. I recommend his article from AWS’s Gary Stafford for charts regarding release frequency. Read the full article for many other interesting observations and visualizations.
When you’re looking at an open source project, two things matter quite a bit:
Number of community contributions
Whether the project is community governed
Community contributions matter because they can signal whether the project will be sustainable for the long haul. Community governed matters because when one particular party has too much control of the governance it can result in unintentional prioritization of issues and pull requests towards that party’s particular interests.
The Risk of Smaller Communities
When you are architecting your data lake for the long term it’s imperative to choose a table format that is open and community governed. For example, when it came to file formats, Apache Parquet became the industry standard because it was open, Apache governed, and community driven, allowing adopters to benefit from those attributes.
Apache top-level projects require community maintenance and are quite democratized in their evolution. Of the three table formats, Delta Lake is the only non-Apache project. Looking at Delta Lake, we can observe things like:
It is Databricks employees who respond to the vast majority of issues. Generally, community-run projects should have several members of the community across several sources respond to tissues. When one company is responsible for the majority of a project’s activity, the project can be at risk if anything happens to the company. For example, see these three recent issues (one, two and three).
Four of the five most recent pull requests are from Databricks employees (most recent being PR #1010 at the time of writing)
The majority of the issues that make it to the roadmap are issues initiated by Databricks employees
Eventually, one of these table formats will become the industry standard. If history is any indicator, the winner will have a robust feature set, community governance model, active community, and an open source license. Looking at the activity in Delta Lake’s development, it’s hard to argue that it is community driven. And when one company controls the project’s fate, it’s hard to argue that it is an open standard, regardless of the visibility of the codebase.
A table format wouldn’t be useful if the tools data professionals used didn’t work with it. In this section, we’ll discuss some of the more popular tools for analyzing and engineering data on your data lake and their support for different table formats. It’s important not only to be able to read data, but also to be able to write data so that data engineers and consumers can use their preferred tools.
The Tale of Two Sparks
One important distinction to note is that there are two versions of Spark. There is the open source Apache Spark, which has a robust community and is used widely in the industry. Then there is Databricks Spark, the Databricks-maintained fork optimized for the Databricks platform. This distinction also exists with Delta Lake: there is an open source version and a version that is tailored to the Databricks platform, and the features between them aren’t always identical (for example SHOW CREATE TABLE is supported with Databricks proprietary Spark/Delta but not with open source Spark/Delta at time of writing).
In the chart below, we consider write support available if multiple clusters using a particular engine can safely read and write to the table format.
Typically, Parquet’s binary columnar file format is the prime choice for storing data for analytics. However, there are situations where you may want your table format to use other file formats like AVRO or ORC. Below is a chart that shows which table formats are allowed to make up the data files of a table.
Questions to Ask Before Adopting a Table Format
Likely one of these three next-generation formats will displace Hive as an industry standard for representing tables on the data lake. When you choose which format to adopt for the long haul make sure to ask yourself questions like:
Which format has the most robust version of the features I need?
Which format enables me to take advantage of most of its features using SQL so it’s accessible to my data consumers?
Which format has the momentum with engine support and community support?
Which format will give me access to the most robust version-control tools?
These questions should help you future-proof your data lake and inject it with the cutting-edge features newer table formats provide.