IMG 6043

by Alex Merced, Developer Advocate at Dremio

Article updated May 23, 2022 to reflect new support for Delta Lake multi-cluster writes on S3.

Article updated on May 12, 2022 to reflect additional tooling support and updates from the newly released Hudi 0.11.0.

Article updated on June 7, 2022 to reflect new flink support bug fix for Delta Lake OSS along with updating calculation of contributions to better reflect committers employer at the time of commits for top contributors.

Article updated on June 28, 2022 to reflect new Delta Lake open source announcement and other updates.

What Is a Table Format?

Table formats allow us to interact with data lakes as easily as we interact with databases, using our favorite tools and languages. A table format allows us to abstract different data files as a singular dataset, a table.

Data in a data lake can often be stretched across several files. We can engineer and analyze this data using R, Python, Scala and Java using tools like Spark and Flink. Being able to define groups of these files as a single dataset, such as a table, makes analyzing them much easier (versus manually grouping files, or analyzing one file at a time). On top of that, SQL depends on the idea of a table and SQL is probably the most accessible language for conducting analytics.

Hive: A First-Generation Table Format

The original table format was Apache Hive. In Hive, a table is defined as all the files in one or more particular directories. While this enabled SQL expressions and other analytics to be run on a data lake, It couldn’t effectively scale to the volumes and complexity of analytics needed to meet today’s needs. Other table formats were developed to provide the scalability required.

The Next Generation

Introducing: Apache Iceberg, Apache Hudi, and Databricks Delta Lake. All three take a similar approach of leveraging metadata to handle the heavy lifting. Metadata structures are used to define:

  • What is the table?
  • What is the table’s schema?
  • How is the table partitioned?
  • What data files make up the table?

While starting from a similar premise, each format has many differences, which may make one table format more compelling than another when it comes to enabling analytics on your data lake.

In this article we will compare these three formats across the features they aim to provide, the compatible tooling,  and community contributions that ensure they are good formats to invest in long term.

IcebergDelta Lake
ACID Transactions
Partition Evolution
Schema Evolution
(later chart with more detail)
partiallimited
Time-Travel
Project GovernanceApache Project with
a diverse PMC
(top-level project)
Apache Project with a diverse PMC
(top-level project)
Linux Foundation Project with
an all-Databricks TSC
Community Contributions
(stats as of 3/28/22)
240 contributors,
2241 merged PRs,
275 open PRs
252 contributors,
2880 merged PRs,
160 open PRs
145 contributors,
16 merged PRs,
43 open PRs
Tool Read CompatibilityApache Hive, Dremio Sonar, Apache Flink, Apache Spark, Presto, Trino, Athena, Snowflake, Databricks Spark, Apache Impala, Apache DrillApache Hive, Apache Flink, Apache Spark, Presto, Trino, Athena, Databricks Spark, Redshift, Apache Impala, BigQueryApache Hive, Dremio Sonar, Apache Flink, Databricks Spark, Apache Spark, Databricks SQL Analytics, Trino, Presto, Snowflake, Redshift, Apache Beam, Athena
Tool Write CompatibilityApache Hive, Dremio Sonar, Apache Flink, Apache Spark, Trino, Athena, Databricks Spark, DebeziumApache Flink, Apache Spark, Databricks Spark, Debezium, Kafka ConnectOSS Delta Lake: Trino, Apache Spark, Databricks Spark Apache Flink, Debezium
Databricks Delta Lake:
Databricks Spark, Kafka Connect
File Format SupportParquet, ORC, AVROParquet, ORCParquet

* Note Regarding Delta Lake and Spark

This article will primarily focus on comparing open source table formats that enable you to run analytics using open architecture on your data lake using different engines and tools, so we will be focusing on the open source version of Delta Lake. Open architectures help minimize costs, avoid vendor lock-in, and make sure the latest and best-in-breed tools can always be available for use on your data.

Keep in mind Databricks has its own proprietary fork of Delta Lake, which has features only available on the Databricks platform. This is also true of Spark - Databricks-managed Spark clusters run a proprietary fork of Spark with features only available to Databricks customers. These proprietary forks aren’t open to enable other engines and tools to take full advantage of them, so are not the focus of this article.

ACID Transactions

One of the benefits of moving away from Hive’s directory-based approach is that it opens a new possibility of having ACID (Atomicity, Consistency, Isolation, Durability) guarantees on more types of transactions, such as inserts, deletes, and updates. In addition to ACID functionality, next-generation table formats enable these operations to run concurrently.

Apache Iceberg

Apache Iceberg’s approach is to define the table through three categories of metadata. These categories are:

  • “metadata files” that define the table
  • “manifest lists” that define a snapshot of the table
  • “manifests” that define groups of data files that may be part of one or more snapshots

Query optimization and all of Iceberg’s features are enabled by the data in these three layers of metadata.

Through the metadata tree (i.e., metadata files, manifest lists, and manifests), Iceberg provides snapshot isolation and ACID support. When a query is run, Iceberg will use the latest snapshot unless otherwise stated. Writes to any given table create a new snapshot, which does not affect concurrent queries. Concurrent writes are handled through optimistic concurrency (whoever writes the new snapshot first, does so, and other writes are reattempted).

Beyond the typical creates, inserts, and merges, row-level updates and deletes are also possible with Apache Iceberg. All of these transactions are possible using SQL commands.

Apache Hudi

Apache Hudi’s approach is to group all transactions into different types of actions that occur along a timeline. Hudi uses a directory-based approach with files that are timestamped and log files that track changes to the records in that data file. Hudi allows you the option to enable a metadata table for query optimization (The metadata table is now on by default starting in version 0.11.0). This table will track a list of files that can be used for query planning instead of file operations, avoiding a potential bottleneck for large datasets.

Apache Hudi also has atomic transactions and SQL support for CREATE TABLE, INSERT, UPDATE, DELETE and Queries

Delta Lake

Delta Lake’s approach is to track metadata in two types of files: 

  1. Delta Logs sequentially track changes to the table. 
  2. Checkpoints summarize all changes to the table up to that point minus transactions that cancel each other out.

Delta Lake also supports ACID transactions and includes SQ L support for creates, inserts, merges, updates, and deletes

Partition Evolution

Partitions allow for more efficient queries that don’t scan the full depth of a table every time. Partitions are an important concept when you are organizing the data to be queried effectively. Often, the partitioning scheme of a table will need to change over time. With Hive, changing partitioning schemes is a very heavy operation.  If data was partitioned by year and we wanted to change it to be partitioned by month, it would require a rewrite of the entire table. More efficient partitioning is needed for managing data at scale. 

Partition evolution allows us to update the partition scheme of a table without having to rewrite all the previous data.

Apache Iceberg

Apache Iceberg is currently the only table format with partition evolution support. Partitions are tracked based on the partition column and the transform on the column (like transforming a timestamp into a day or year). 

This is different from typical approaches, which rely on the values of a particular column and often require making new columns just for partitioning. With the traditional way, pre-Iceberg, data consumers would need to know to filter by the partition column to get the benefits of the partition (a query that includes a filter on a timestamp column but not on the partition column derived from that timestamp would result in a full table scan). This is a huge barrier to enabling broad usage of any underlying system. Iceberg enables great functionality for getting maximum value from partitions and delivering performance even for non-expert users. 

Since Iceberg partitions track a transform on a particular column, that transform can evolve as the need arises. Query filtering based on the transformed column will benefit from the partitioning regardless of which transform is used on any portion of the data.

For example, a timestamp column can be partitioned by year then easily switched to month going forward with an ALTER TABLE statement.

When the data is filtered by the timestamp column, the query is able to leverage the partitioning of both portions of the data (i.e., the portion partitioned by year and the portion partitioned by month). As shown above, these operations are handled via SQL. Partition evolution gives Iceberg two major benefits over other table formats:

  • A rewrite of the table is not required to change how data is partitioned 
  • A query can be optimized by all partition schemes (data partitioned by different schemes will be planned separately to maximize performance)

Note: Not having to create additional partition columns that require explicit filtering to benefit from is a special Iceberg feature called Hidden Partitioning.

Apache Hudi

Hudi does not support partition evolution or hidden partitioning. A similar result to hidden partitioning can be done with the data skipping feature (Currently only supported for tables in read-optimized mode).

Delta Lake

Delta Lake does not support partition evolution. It can achieve something similar to hidden partitioning with its generated columns feature which is currently in public preview for Databricks Delta Lake, still awaiting full support for OSS Delta Lake.

Schema Evolution

As data evolves over time, so does table schema: columns may need to be renamed, types changed, columns added, and so forth.. All three table formats support different levels of schema evolution. The chart below will detail the types of updates you can make to your table’s schema.

The ability to evolve a table’s schema is a key feature. If you can’t make necessary evolutions, your only option is to rewrite the table, which can be an expensive and time-consuming operation.

Support for Schema Evolution: Iceberg | Hudi | Delta Lake

IcebergDelta Lake
Add a new column
Drop a column***
Rename a column*
Update a column
Reorder columns*
* Only supported in Spark. **Coming in Delta Lake 2.0.

Time-Travel

Time travel allows us to query a table at its previous states. A common use case is to test updated machine learning algorithms on the same data used in previous model tests. Comparing models against the same data is required to properly understand the changes to a model. 

In general, all formats enable time travel through “snapshots.” Each snapshot contains the files associated with it. Periodically, you’ll want to clean up older, unneeded snapshots to prevent unnecessary storage costs. Each table format has different tools for maintaining snapshots, and once a snapshot is removed you can no longer time-travel to that snapshot.

Apache Iceberg

Every time an update is made to an Iceberg table, a snapshot is created. You can specify a snapshot-id or timestamp and query the data as it was with Apache Iceberg. To maintain Apache Iceberg tables you’ll want to periodically expire snapshots using the expireSnapshots procedure to reduce the number of files stored (for instance, you may want to expire all snapshots older than the current year.). Once a snapshot is expired you can’t time-travel back to it.

Apache Hudi

The Hudi table format revolves around a table timeline, enabling you to query previous points along the timeline. To maintain Hudi tables use the Hoodie Cleaner application. Once you have cleaned up commits you will no longer be able to time travel to them.

Delta Lake

Each Delta file represents the changes of the table from the previous Delta file, so you can target a particular Delta file or checkpoint to query earlier states of the table. By default, Delta Lake maintains the last 30 days of history in the table’s adjustable data retention settings. Use the vacuum utility to clean up data files from expired snapshots. With Delta Lake, you can’t time travel to points whose log files have been deleted without a checkpoint to reference.

Depending on which logs are cleaned up, you may disable time travel to a bundle of snapshots. For example, say you have logs 1-30, with a checkpoint created at log 15. Vacuuming log 1 will disable time travel to logs 1-14, since there is no earlier checkpoint to rebuild the table from.

Governance and Community Contributions

Having an open source license and a strong open source community enables table format projects to evolve, improve at greater speeds, and continue to be maintained for the long term. The chart below compares the open source community support for the three formats as of 3/28/22.

IcebergDelta Lake
Apache Governed Project
Open Source LicenseApacheApacheApache
First Open Source Commit12/13/201712/16/20164/12/2019

Many projects are created out of a need at a particular company. Apache Iceberg came out of Netflix, Hudi came out of Uber, and Delta Lake came out of Databricks. 

There are many different types of open source licensing, including the popular Apache license. Before becoming an Apache Project, must meet several reporting, governance, technical, branding, and community standards. The Apache Project license gives assurances that there is a fair governing body behind a project and that it isn’t being steered by the commercial influences of any particular company.

Comparing Recent Iceberg, Hudi, and Delta Lake Contributions in GitHub

In the chart above we see the summary of current GitHub stats over a 30-day time period, which illustrates the current moment of contributions to a particular project. When choosing an open-source project to build your data architecture around you want strong contribution momentum to ensure the project's long-term support. Let’s look at several other metrics relating to the activity in each project’s GitHub repository and discuss why they matter.

Overall Comparison of the Three Repositories

Repository Pull-requests:

Pull-requests are actual code from contributors being offered to add a feature or fix a bug. This is probably the strongest signal of community engagement as developers contribute their code to the project. Here we look at merged pull requests instead of closed pull requests as these represent code that has actually been added to the main code base (closed pull requests aren’t necessarily code added to the code base).

Current Number of Pull Requests
Historical Trend of Repository Pull Requests

GitHub Stars:

Stars are one way to show support for a project. However, while they can demonstrate interest, they don’t signify a track record of community contributions to the project like pull requests do. For that reason, community contributions are a more important metric than stars when you’re assessing the longevity of an open-source project as the basis for your data architecture.

Current Number of Github Stars
Historical Trend of Github Stars

Repository Commits

Commits are changes to the repository. An actively growing project should have frequent and voluminous commits in its history to show continued development.

Current Repository Commits
Historical Trend of Repository Commits

Repository Contributors

A diverse community of developers from different companies is a sign that a project will not be dominated by the interests of any particular company. Below are some charts showing the proportion of contributions each table format has from contributors at different companies. The info is based on data pulled from the GitHub API. Extra efforts were made to identify the company of any contributors who made 10 or more contributions but didn’t have their company listed on their GitHub profile.

[Note: This info is based on contributions to each project’s core repository on GitHub, measuring contributions which are issues/pull requests and commits in the GitHub repository. Activity or code merges that occur in other upstream or private repositories are not factored in since there is no visibility into that activity. Delta Lake boasts 6400 developers have contributed to Delta Lake, but this article only reflects what is independently verifiable through the open-source repository activity.]

Release Frequency

Greater release frequency is a sign of active development. I recommend his article from AWS’s Gary Stafford for charts regarding release frequency. Read the full article for many other interesting observations and visualizations.

When you’re looking at an open source project, two things matter quite a bit:

  • Number of community contributions
  • Whether the project is community governed

Community contributions matter because they can signal whether the project will be sustainable for the long haul. Community governed matters because when one particular party has too much control of the governance it can result in unintentional prioritization of issues and pull requests towards that party’s particular interests.

The Risk of Smaller Communities

When you are architecting your data lake for the long term it’s imperative to choose a table format that is open and community governed. For example, when it came to file formats, Apache Parquet became the industry standard because it was open, Apache governed, and community driven, allowing adopters to benefit from those attributes. 

Apache top-level projects require community maintenance and are quite democratized in their evolution. Of the three table formats, Delta Lake is the only non-Apache project. Looking at Delta Lake, we can observe things like:

  • It is Databricks employees who respond to the vast majority of issues. Generally, community-run projects should have several members of the community across several sources respond to tissues. When one company is responsible for the majority of a project’s activity, the project can be at risk if anything happens to the company.  For example, see these three recent issues (one, two and three).
  • Four of the  five most recent pull requests are from Databricks employees (most recent being PR #1010 at the time of writing)
  • The majority of the issues that make it to the roadmap are issues initiated by Databricks employees

[Note: At the 2022 Data+AI summit Databricks announced they will be open-sourcing all formerly proprietary parts of Delta Lake.]

Eventually, one of these table formats will become the industry standard. If history is any indicator, the winner will have a robust feature set, community governance model, active community, and an open source license. Looking at the activity in Delta Lake’s development, it’s hard to argue that it is community driven. And when one company controls the project’s fate, it’s hard to argue that it is an open standard, regardless of the visibility of the codebase.

Tool Compatibility

A table format wouldn’t be useful if the tools data professionals used didn’t work with it. In this section, we’ll discuss some of the more popular tools for analyzing and engineering data on your data lake and their support for different table formats. It’s important not only to be able to read data, but also to be able to write data so that data engineers and consumers can use their preferred tools.

The Tale of Two Sparks

One important distinction to note is that there are two versions of Spark. There is the open source Apache Spark, which has a robust community and is used widely in the industry. Then there is Databricks Spark, the Databricks-maintained fork optimized for the Databricks platform. This distinction also exists with Delta Lake: there is an open source version and a version that is tailored to the Databricks platform, and the features between them aren’t always identical (for example SHOW CREATE TABLE is supported with Databricks proprietary Spark/Delta but not with open source Spark/Delta at time of writing). 

In the chart below, we consider write support available if multiple clusters using a particular engine can safely read and write to the table format.

IcebergDelta Lake
Apache FlinkRead & WriteRead & WriteRead & Write
Apache SparkRead & WriteRead & WriteRead & Write
Databricks SparkRead & WriteRead & WriteRead & Write
Dremio SonarRead & WriteRead
PrestoRead & WriteReadRead
TrinoRead & WriteReadRead & Write
AthenaRead & WriteReadRead
Snowflake External TablesReadRead
Apache ImpalaRead & WriteRead & Write
Apache BeamRead
Apache DrillRead
Redshift SpectrumReadRead
BigQuery External TablesRead
DebeziumWriteWriteWrite
Kafka ConnectWriteWrite

File Format Support

Typically, Parquet’s binary columnar file format is the prime choice for storing data for analytics. However, there are situations where you may want your table format to use other file formats like AVRO or ORC. Below is a chart that shows which table formats are allowed to make up the data files of a table.

IcebergDelta Lake
Parquet
AVRO
ORC

Questions to Ask Before Adopting a Table Format

Likely one of these three next-generation formats will displace Hive as an industry standard for representing tables on the data lake. When you choose which format to adopt for the long haul make sure to ask yourself questions like:

  • Which format has the most robust version of the features I need?
  • Which format enables me to take advantage of most of its features using SQL so it’s accessible to my data consumers?
  • Which format has the momentum with engine support and community support?
  • Which format will give me access to the most robust version-control tools?

These questions should help you future-proof your data lake and inject it with the cutting-edge features newer table formats provide.

Checkout these follow-up comparison posts: