Why Should I Care About Table Formats Like Apache Iceberg?

Apache Iceberg has been the focus of a lot of discussions recently on the topic of turning data lakes into data lakehouses and reducing the data warehouse footprint. Mainly this has been in response to many issues with modern data warehouses, including:

To better understand how Apache Iceberg can be the solution, let’s first examine the problem.

Understanding the Status Quo

In traditional architecture, you follow a particular pattern:

When following this pattern you run into many of the problems with data warehouses mentioned earlier:

You can address these challenges by running all your data warehouse workloads directly on your data lake to minimize the data warehouse footprint. 

For this to happen, an abstraction layer is needed that allows for tooling to see the sea of files stored in the data lake as traditional tables. Note this is something data warehouses also do, but is hidden underneath proprietary layers. 

To solve this problem the original table format for data lake 1.0 was created via the Apache Hive project but came with several challenges:

These challenges resulted in the inability of the data lake to fulfill the desire to replace the data warehouse.

A better solution was needed to enable these workloads to be run on the data lake.

Two Birds, One Stone

In the status quo, you get different benefits from your data lake and data warehouse. The data lakehouse provides the benefits of both worlds:

Data lakehouse architecture mitigates the cons of both data lakes and data warehouses and combines the pros:

So you want to implement a data lakehouse, but the Hive table format doesn’t allow you to achieve the promise of a data lakehouse. This is where Apache Iceberg becomes the key piece in your data architecture.

Apache Iceberg is a table format that addresses the challenges with Hive tables, allowing for capabilities like ACID transactions, time-travel, and table evolution, truly enabling those warehouse-like features in a data lakehouse. By making a data lakehouse possible and practical, you can now begin to eliminate many of the problems with the traditional approach.

The Tip of the Iceberg

Apache Iceberg isn’t the only table format competing to be the cornerstone of the modern data lakehouse, although Apache Iceberg has a lot of unique value propositions:

Project Nessie is an open-source project that provides catalog versioning, providing Git-like capabilities for Apache Iceberg tables that enables patterns like isolating ETL work and multi-table transactions (committing changes to multiple tables simultaneously).

Conclusion

Reducing your data warehouse footprint with an Apache Iceberg-based data lakehouse will open up your data to best-in-breed tools, reduce redundant storage/compute costs, and enable cutting-edge features like partition evolution/catalog branching to enhance your data architecture. 

In the past, the Hive table format did not go far enough to make this a reality, but today Apache Iceberg offers robust features and performance for querying and manipulating your data on the lake. 

Now is the time to turn your data lake into a data lakehouse and start seeing the time to insight shrink along with your data warehouse costs.

Leave a Reply

Your email address will not be published. Required fields are marked *

Get Started Free

No time limit - totally free - just the way you like it.

Sign Up Now

See Dremio in Action

Not ready to get started today? See the platform in action.

Watch Demo

Talk to an Expert

Not sure where to start? Get your questions answered fast.

Contact Us