What is Apache Iceberg?
Background on Data Within Data Lake Storage
Data lakes are large repositories that store all structured and
unstructured data at any scale. They are used to simplify data management
by centralizing data and enabling all applications throughout an
organization to interact on a shared data repository for all processing,
analytics and reporting, significantly improving upon traditional
architectures that rely on numerous isolated and siloed systems.
Traditionally, data lakes were associated with the Apache Hadoop
Distributed File System (HDFS). Today, however, organizations increasingly
utilize object storage systems such as Amazon S3 or
Microsoft Azure Data Lake Storage
(ADLS). These
cloud data lakes
provide organizations with additional opportunities to simplify data
management by being accessible everywhere to all applications as needed.
Individual datasets within data lakes are often organized as collections of
files within directory structures, often with multiple files in one
directory representing a single table. The benefits of this approach are
that data is highly accessible and flexible. However, several concepts
provided by traditional databases and data warehouses are not addressed
solely by directories of files and require additional tooling to define.
This includes:
-
What is the schema of a dataset, including columns and data types
-
Which files comprise the dataset and how are they organized (e.g.,
partitions)
-
How different applications coordinate changes to the dataset,
including both changes to the definition of the dataset and changes
to data
To address these, organizations utilize several industry-standard systems
to further organize data within their data lake storage.
How Metadata Catalogs Help Organize Data Lakes
To better organize data within data lakes, organizations use metadata
catalogs, which define the tables within data lake storage. By using
catalogs, all applications across an organization share a common definition
and view of data within the data lake, which is helpful for processing and
producing consistent results.
Catalogs are used to define:
-
What datasets exist in the data lake
-
Where different datasets are located within the data lake
-
How datasets are structured in terms of columns, names, data types,
etc.
Hive Metastore (HMS) and AWS Glue Data Catalog are the most popular data
lake catalogs and are broadly used throughout the industry. Both Hive and AWS Glue contain
the schema, table structure and data location for datasets within data lake
storage. In doing so, catalogs provide a similar structure as relational
databases in that they are deployed on top of file storage and shared
across multiple applications. This ensures consistent results between
different applications and simplifies data management.
Why Catalogs Are Not Enough
Although catalogs provide a shared definition of the dataset structure
within data lake storage, they do not coordinate data changes or schema
evolution between applications in a transactionally consistent manner.
Consider a large dataset comprised of hundreds of thousands of files.
Catalogs such as Hive and AWS Glue contain the structure of the dataset,
including column names and data types, as well as partitions organized in
directories. However, they do not define which data files are present and
part of the dataset. As a result, applications must rely on reading file
metadata in data lake storage to identify which files are part of a dataset
at any given time.
As long as the dataset is static and does not change, different
applications can operate on a consistent view of the dataset. However,
challenges are created when one application writes to and modifies the
dataset and those changes need to be coordinated with another application
that reads from the same dataset. For example, if an ETL process updates
the dataset by adding and removing several files from storage, another
application that reads the dataset may process a partial or inconsistent
view of the dataset and generate incorrect results. This occurs when some
files have been added or removed from storage, but not all required changes
were completed.
Without automatic coordination of data changes between applications in the
data lake, organizations need to create complicated pipelines or staging
areas which can be brittle and difficult to manage manually.
The Apache Iceberg Open Table Format
Apache Iceberg
is a new table format that solves these challenges and is rapidly becoming
an industry standard for managing data in data lakes. Iceberg introduces
new capabilities that enable multiple applications to work together on the
same data in a transactionally consistent manner and defines additional
information on the state of datasets as they evolve and change over time.
The Iceberg table format has similar capabilities and functionality as SQL
tables in traditional databases but in a fully open and accessible manner
such that multiple engines (Dremio, Spark, etc.) can operate on the same
dataset. Iceberg provides many features such as:
-
Transactional consistency between multiple applications where files
can be added, removed or modified atomically, with full read
isolation and multiple concurrent writes
-
Full schema evolution to track changes to a table over time
-
Time travel to query historical data and verify changes between
updates
-
Partition layout and evolution enabling updates to partition
schemes as queries and data volumes change without relying on
hidden partitions or physical directories
-
Rollback to prior versions to quickly correct issues and return
tables to a known good state
-
Advanced planning and filtering capabilities for high performance
on large data volumes
Iceberg achieves these capabilities for a table via metadata files (aka
manifests) tracked through point-in-time snapshots by maintaining all
deltas as a table is updated over time. Each snapshot provides a complete
description of the table’s schema, partition and file information and
offers full isolation and consistency. Additionally, Iceberg intelligently
organizes snapshot metadata in a hierarchical structure. This enables fast
and efficient changes to tables without redefining all dataset files, thus
ensuring optimal performance when working at data lake scale.
With Iceberg, the full history is maintained within the Iceberg table
format and without storage system dependencies. This enables an open
architecture and the flexibility to change systems over time without
disruption to users or existing workloads. Since the historical state is
immutable and history lineage is clear, users can query prior states at any
Iceberg snapshot or any historical point in time for consistent results,
comparison or rollback to correct issues.
As an Apache project, Iceberg is 100% open source and not dependent on any
individual tools or data lake engines. It was created by Netflix and Apple,
and is deployed in production by the largest technology companies and
proven at scale on the world’s largest workloads and environments. Iceberg
supports common industry-standard file formats, including Parquet, ORC and
Avro, and is supported by major data lake engines including Dremio, Spark,
Hive and Presto.
How Iceberg Differs from Traditional Catalogs and Databases
There are significant differences when comparing Iceberg not just to data
lake catalogs such as the Hive Metastore, but to relational databases and
enterprise data warehouses as well. Unlike the Hive Metastore where changes
are made through Hive, with Iceberg all applications are equal participants
and multiple tools can update tables directly and concurrently.
Additionally, Iceberg describes the complete history of tables, including
schema and data changes. The Hive Metastore only describes a dataset’s
current schema without historical information or data changes with time
travel.
Relational databases and enterprise data warehouses offer similar
capabilities such as atomic transactions and time travel. However, they do
so through a closed, vertically integrated and proprietary system where all
access must go through and be processed by the database. Routing all access
through a single system simplifies concurrency management and updates but
also limits flexibility and increases cost. Iceberg, on the other hand,
enables all applications to directly operate on tables within data lake
storage. Doing so not only lowers cost by taking advantage of data lake
architectures but also significantly increases flexibility and agility
since all applications can work on datasets in place without migrating data
between multiple separate and closed systems.
Benefits of Using Iceberg
By defining an efficient open table format for data lake tables that is
transactionally consistent with point-in-time snapshot isolation, Iceberg
enables numerous benefits for organizations, including:
-
Multiple independent applications can process the same dataset in
place simultaneously and with consistent results
-
Updates to very large data lake-scale tables are efficiently
processed and communicated between systems
-
ETL pipelines are significantly simplified by operating on data in
place in the data lake instead of moving data between multiple
independent systems, resulting in pipelines that are more reliable
and less resistant to change
-
Improved data management as datasets change and evolve over time
-
Increased data reliability and identification and resolution of
issues when they arise
By using Iceberg, organizations can realize the full potential and benefits
of migrating to a data lake architecture. Similar capabilities and features
available with traditional databases can be provided to users but within an
open and flexible data lake environment. Data engineers can simplify data
pipelines and realize cost savings while providing increased access to data
and performance to end users.
Alternatives to Iceberg
Similar to how there are multiple file formats such as Parquet, ORC, Avro
and JSON, there are alternatives to Iceberg that offer somewhat similar
capabilities and benefits. The most popular are the Delta Lake project
developed by Databricks and Hive ACID tables.
|
Hive ACID Tables |
Delta Lake |
Apache Iceberg |
Who Is Driving? |
Hive |
Databricks |
Netflix, Apple, and other community members |
File Formats |
ORC |
Parquet |
Parquet, ORC, Avro |
Transactional |
Yes |
Yes |
Yes |
Engine-Specific |
Yes (only Hive can update) |
Yes(only Spark can update) |
No (Any) |
Fully Open Source |
Yes |
No (performance and cloud support are not OSS) |
Yes |
Cell |
Cell |
Cell |
Cell |
From a technical perspective, Delta Lake offers some common functionality
and capabilities as Iceberg, but there are significant differences. Unlike
Iceberg, which is an independently governed Apache project and contributed
to by numerous companies throughout the industry, Delta Lake is sponsored
by and controlled solely by Databricks. Additionally, Delta Lake is not
fully open source; write operations are only available through Spark and
performance optimizations to accelerate reads have not been open sourced.
Furthermore, Iceberg now has a very significant contributor community,
including Netflix, Apple, Airbnb, Stripe, Expedia and Dremio.
Unlike Delta Lake, Hive ACID tables are fully open source. Like Delta Lake,
however, Hive update operations can only be performed through a single data
lake engine and the vast majority of development is performed by a single
vendor (Cloudera). Hive ACID tables depend on the ORC file format and are
less flexible in terms of file formats supported.
Although Delta Lake and Hive ACID tables have been around longer than
Iceberg, Iceberg is quickly gaining adoption and additional features as
more companies contribute to the format.
Advantages of Iceberg Over Other Formats
Although there are multiple table format options available, Iceberg shares
key attributes with previous open source projects that became de facto
industry standards such as Apache Parquet. In particular, Iceberg is
entirely independent from a governance standpoint and is not locked or tied
to any specific engine or tool. As a result, Iceberg can be developed and
contributed to by many different organizations across multiple industries,
increasing adoption and growth. Additionally, Iceberg has several
advantages, including:
-
All applications have equal access and processing is not dependent
on or tied to any specific engine, providing organizations with the
flexibility to customize their data lake as needed
-
Performance optimizations based on best practices which enable fast
and cost-efficient access to data
-
Fully storage-system agnostic with no file system dependencies,
offering flexibility when choosing and migrating storage systems as
required
-
Multiple successful production deployments with tens of petabytes
and millions of partitions
-
100% open source and independently governed
As a result of these advantages, Iceberg is rapidly gaining adoption and
helping organizations simplify management and take full advantage of their
data lake.
Learn About Data Lake Engines
Additional Resources