Background on Data Within Data Lake Storage
Data lakes are large repositories that store all structured and unstructured data at any scale. They are used to simplify data management by centralizing data and enabling all applications throughout an organization to interact on a shared data repository for all processing, analytics and reporting, significantly improving upon traditional architectures that rely on numerous isolated and siloed systems.
Traditionally, data lakes were associated with the Apache Hadoop Distributed File System (HDFS). Today, however, organizations increasingly utilize object storage systems such as Amazon S3 or Microsoft Azure Data Lake Storage (ADLS). These cloud data lakes provide organizations with additional opportunities to simplify data management by being accessible everywhere to all applications as needed.
Individual datasets within data lakes are often organized as collections of files within directory structures, often with multiple files in one directory representing a single table. The benefits of this approach are that data is highly accessible and flexible. However, several concepts provided by traditional databases and data warehouses are not addressed solely by directories of files and require additional tooling to define. This includes:
- What is the schema of a dataset, including columns and data types
- Which files comprise the dataset and how are they organized (e.g., partitions)
- How different applications coordinate changes to the dataset, including both changes to the definition of the dataset and changes to data
To address these, organizations utilize several industry-standard systems to further organize data within their data lake storage.
How Metadata Catalogs Help Organize Data Lakes
To better organize data within data lakes, organizations use metadata catalogs, which define the tables within data lake storage. By using catalogs, all applications across an organization share a common definition and view of data within the data lake, which is helpful for processing and producing consistent results.
Catalogs are used to define:
- What datasets exist in the data lake
- Where different datasets are located within the data lake
- How datasets are structured in terms of columns, names, data types, etc.
Hive Metastore (HMS) and AWS Glue Data Catalog are the most popular data lake catalogs and are broadly used throughout the industry. Both Hive and AWS Glue contain the schema, table structure and data location for datasets within data lake storage. In doing so, catalogs provide a similar structure as relational databases in that they are deployed on top of file storage and shared across multiple applications. This ensures consistent results between different applications and simplifies data management.
Why Catalogs Are Not Enough
Although catalogs provide a shared definition of the dataset structure within data lake storage, they do not coordinate data changes or schema evolution between applications in a transactionally consistent manner.
Consider a large dataset comprised of hundreds of thousands of files. Catalogs such as Hive and AWS Glue contain the structure of the dataset, including column names and data types, as well as partitions organized in directories. However, they do not define which data files are present and part of the dataset. As a result, applications must rely on reading file metadata in data lake storage to identify which files are part of a dataset at any given time.
As long as the dataset is static and does not change, different applications can operate on a consistent view of the dataset. However, challenges are created when one application writes to and modifies the dataset and those changes need to be coordinated with another application that reads from the same dataset. For example, if an ETL process updates the dataset by adding and removing several files from storage, another application that reads the dataset may process a partial or inconsistent view of the dataset and generate incorrect results. This occurs when some files have been added or removed from storage, but not all required changes were completed.
Without automatic coordination of data changes between applications in the data lake, organizations need to create complicated pipelines or staging areas which can be brittle and difficult to manage manually.
The Apache Iceberg Open Table Format
Apache Iceberg is a new table format that solves these challenges and is rapidly becoming an industry standard for managing data in data lakes. Iceberg introduces new capabilities that enable multiple applications to work together on the same data in a transactionally consistent manner and defines additional information on the state of datasets as they evolve and change over time.
The Iceberg table format has similar capabilities and functionality as SQL tables in traditional databases but in a fully open and accessible manner such that multiple engines (Dremio, Spark, etc.) can operate on the same dataset. Iceberg provides many features such as:
- Transactional consistency between multiple applications where files can be added, removed or modified atomically, with full read isolation and multiple concurrent writes
- Full schema evolution to track changes to a table over time
- Time travel to query historical data and verify changes between updates
- Partition layout and evolution enabling updates to partition schemes as queries and data volumes change without relying on hidden partitions or physical directories
- Rollback to prior versions to quickly correct issues and return tables to a known good state
- Advanced planning and filtering capabilities for high performance on large data volumes
Iceberg achieves these capabilities for a table via metadata files (aka manifests) tracked through point-in-time snapshots by maintaining all deltas as a table is updated over time. Each snapshot provides a complete description of the table’s schema, partition and file information and offers full isolation and consistency. Additionally, Iceberg intelligently organizes snapshot metadata in a hierarchical structure. This enables fast and efficient changes to tables without redefining all dataset files, thus ensuring optimal performance when working at data lake scale.
With Iceberg, the full history is maintained within the Iceberg table format and without storage system dependencies. This enables an open architecture and the flexibility to change systems over time without disruption to users or existing workloads. Since the historical state is immutable and history lineage is clear, users can query prior states at any Iceberg snapshot or any historical point in time for consistent results, comparison or rollback to correct issues.
As an Apache project, Iceberg is 100% open source and not dependent on any individual tools or data lake engines. It was created by Netflix and Apple, and is deployed in production by the largest technology companies and proven at scale on the world’s largest workloads and environments. Iceberg supports common industry-standard file formats, including Parquet, ORC and Avro, and is supported by major data lake engines including Dremio, Spark, Hive and Presto.
How Iceberg Differs from Traditional Catalogs and Databases
There are significant differences when comparing Iceberg not just to data lake catalogs such as the Hive Metastore, but to relational databases and enterprise data warehouses as well. Unlike the Hive Metastore where changes are made through Hive, with Iceberg all applications are equal participants and multiple tools can update tables directly and concurrently. Additionally, Iceberg describes the complete history of tables, including schema and data changes. The Hive Metastore only describes a dataset’s current schema without historical information or data changes with time travel.
Relational databases and enterprise data warehouses offer similar capabilities such as atomic transactions and time travel. However, they do so through a closed, vertically integrated and proprietary system where all access must go through and be processed by the database. Routing all access through a single system simplifies concurrency management and updates but also limits flexibility and increases cost. Iceberg, on the other hand, enables all applications to directly operate on tables within data lake storage. Doing so not only lowers cost by taking advantage of data lake architectures but also significantly increases flexibility and agility since all applications can work on datasets in place without migrating data between multiple separate and closed systems.
Benefits of Using Iceberg
By defining an efficient open table format for data lake tables that is transactionally consistent with point-in-time snapshot isolation, Iceberg enables numerous benefits for organizations, including:
- Multiple independent applications can process the same dataset in place simultaneously and with consistent results
- Updates to very large data lake-scale tables are efficiently processed and communicated between systems
- ETL pipelines are significantly simplified by operating on data in place in the data lake instead of moving data between multiple independent systems, resulting in pipelines that are more reliable and less resistant to change
- Improved data management as datasets change and evolve over time
- Increased data reliability and identification and resolution of issues when they arise
By using Iceberg, organizations can realize the full potential and benefits of migrating to a data lake architecture. Similar capabilities and features available with traditional databases can be provided to users but within an open and flexible data lake environment. Data engineers can simplify data pipelines and realize cost savings while providing increased access to data and performance to end users.
Alternatives to Iceberg
Similar to how there are multiple file formats such as Parquet, ORC, Avro and JSON, there are alternatives to Iceberg that offer somewhat similar capabilities and benefits. The most popular are the Delta Lake project developed by Databricks and Hive ACID tables.
|Hive ACID Tables||Delta Lake||Apache Iceberg|
|Who Is Driving?||Hive||Databricks||Netflix, Apple, and other community members|
|File Formats||ORC||Parquet||Parquet, ORC, Avro|
|Engine-Specific||Yes (only Hive can update)||Yes(only Spark can update)||No (Any)|
|Fully Open Source||Yes||No (performance and cloud support are not OSS)||Yes|
From a technical perspective, Delta Lake offers some common functionality and capabilities as Iceberg, but there are significant differences. Unlike Iceberg, which is an independently governed Apache project and contributed to by numerous companies throughout the industry, Delta Lake is sponsored by and controlled solely by Databricks. Additionally, Delta Lake is not fully open source; write operations are only available through Spark and performance optimizations to accelerate reads have not been open sourced. Furthermore, Iceberg now has a very significant contributor community, including Netflix, Apple, Airbnb, Stripe, Expedia and Dremio.
Unlike Delta Lake, Hive ACID tables are fully open source. Like Delta Lake, however, Hive update operations can only be performed through a single data lake engine and the vast majority of development is performed by a single vendor (Cloudera). Hive ACID tables depend on the ORC file format and are less flexible in terms of file formats supported.
Although Delta Lake and Hive ACID tables have been around longer than Iceberg, Iceberg is quickly gaining adoption and additional features as more companies contribute to the format.
Advantages of Iceberg Over Other Formats
Although there are multiple table format options available, Iceberg shares key attributes with previous open source projects that became de facto industry standards such as Apache Parquet. In particular, Iceberg is entirely independent from a governance standpoint and is not locked or tied to any specific engine or tool. As a result, Iceberg can be developed and contributed to by many different organizations across multiple industries, increasing adoption and growth. Additionally, Iceberg has several advantages, including:
- All applications have equal access and processing is not dependent on or tied to any specific engine, providing organizations with the flexibility to customize their data lake as needed
- Performance optimizations based on best practices which enable fast and cost-efficient access to data
- Fully storage-system agnostic with no file system dependencies, offering flexibility when choosing and migrating storage systems as required
- Multiple successful production deployments with tens of petabytes and millions of partitions
- 100% open source and independently governed
As a result of these advantages, Iceberg is rapidly gaining adoption and helping organizations simplify management and take full advantage of their data lake.