Note: You can find this and many other great Apache Iceberg instructional videos and articles at our Apache Iceberg 101 article.
Apache Iceberg is an open table format that enables robust, affordable, and quick analytics on the data lakehouse and is poised to change the data industry in ways we can only begin to imagine. Check out this webinar recording to learn about the features and architecture of Apache Iceberg. In this article, we get hands-on with Apache Iceberg to see many of its features and utilities available from Spark.
Apache Iceberg has a tiered metadata structure which is key to how Iceberg provides high-speed queries for both reads and writes. The following summarizes the structure of Apache Iceberg to see how this works. If you are already familiar with Iceberg’s architecture, then feel free to skip ahead to “Getting Hands-On with Apache Iceberg.”
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
Data Layer
Starting from the bottom of the diagram, the data layer holds the actual data in the table, which is made up of two types of files:
Data files – Stores the data in file formats such as Parquet or ORC.
Delete files – Tracks records that still exist in the data files, but that should be considered as deleted.
Metadata Layer
Apache Iceberg uses three tiers of metadata files which cover three different scopes:
Manifest files– A subset of the snapshot, these files track the individual files in the data layer in the subset along with metadata for further pruning.
Manifest lists– Defines a snapshot of the table and lists all the manifest files that make up that snapshot with metadata on the manifest files for pruning.
Metadata files– Defines the table and tracks manifest lists, current and previous snapshots, schemas, and partition schemes.
The Catalog
The catalog tracks a reference/pointer to the current metadata file. This is usually some store that can provide some transactional guarantees like a relational database (Postgres, etc.) or metastore (Hive, Project Nessie, Glue).
So let’s take Apache Iceberg for a test drive together. To keep things as simple as possible, we use this self-contained Docker container which has everything you need:
docker pull alexmerced/spark3-3-iceberg0-14
We then open up the container in interactive mode.
docker run --name iceberg-sandbox -it
alexmerced/spark3-3-iceberg0-14
An Iceberg-enabled engine is the easiest way to use Iceberg. For this example, we use Spark 3. You can see how to use Apache Iceberg with different engines and platforms on the Apache Iceberg website. This Docker container comes with a simple command to open up SparkSQL configured for Iceberg:
iceberg-init
This command is a custom alias for the following command that opens the SparkSQL shell with Apache Iceberg ready to go:
This flag is instructing Spark to use the Apache Iceberg package; you want to make sure it’s the right package for the version of Spark and version of Iceberg you plan on using. (You can also just drop the right Iceberg JAR in the Spark JARS folder as well).
UPDATE iceberg.db.my_table SET name='Alex' WHERE name='Steve';
Let’s also update the table schema.
ALTER TABLE iceberg.db.my_table ADD COLUMNS (email string);
Now add another record.
INSERT INTO iceberg.db.my_table VALUES ('John', 56, '[email protected]');
Now let’s query our data.
SELECT * FROM iceberg.db.my_table;
We can now see all the records we created. This may all seem pretty easy, like you’re working with a normal run-of-the-mill transactional database, and that’s the beauty of it. With Apache Iceberg tables we extend our ability to query data as well as to insert, update, delete and modify schemas on large distributed datasets quickly and safely across engines and file formats.
If we wanted to create an Iceberg table from an existing source, we would just create a view from the source data and then use a CTAS (Create Table AS) to create an Iceberg table from that view. (Learn more about migrating data into Iceberg and try this Iceberg migration exercise.)
As an example, in the Docker container, there is a sample file in ~/sampledata/Woorker_Coop.csv. To create a SQL View from this data, we simply enter the following command:
Now you know how to create tables from scratch and from sources like data files and other existing tables using a CTAS.
Now let’s run a few DML transactions on this new table before inspecting the table under the hood.
Let’s delete all entries from the Bronx.
DELETE FROM iceberg.db.worker_coop
WHERE Borough = 'BRONX';
Now update all Staten Island entries to Richmond County.
UPDATE iceberg.db.worker_coop
SET Borough = 'RICHMOND COUNTY'
WHERE Borough = 'STATEN IS';
Inspecting Our Table
We can view a lot of information about our table using the Iceberg SQL extensions. Here are a few examples.
We can inspect our table's history:
SELECT * FROM iceberg.db.worker_coop.history;
We can inspect our table's snapshots:
SELECT * FROM iceberg.db.worker_coop.snapshots;
We can inspect our table's files:
Notice in the output when looking at the file paths we can see that the folders are organized by the partitioning scheme we chose.
SELECT file_path FROM iceberg.db.worker_coop.files;
We can inspect our table's manifests:
SELECT path, partition_spec_id FROM iceberg.db.worker_coop.manifests;
We can inspect our table's partitions:
SELECT partition, spec_id FROM iceberg.db.worker_coop.partitions;
Looking Under The Hood
Now let’s take a look at what Apache Iceberg created when we ran all that SQL. First, let’s quit SparkSQL by running exit;. Next, change directories into my_table which is inside a db folder that is inside that warehouse folder we created earlier.
cd warehouse/db/my_table
If you look inside this folder, notice two folders: data and metadata. Let’s look at their contents.
Data
ls data
This should display several Parquet files which represent our data at several different points.
This section tracks our current and past schemas. You can see that the current schema array has our old name/age schema followed by our name/age/email schema.
This is where any partitioning details would be included, however, we did not specify any partitions when we created our table but we could have done so by adding a PARTITIONED BY clause to our CREATE TABLE statement like the following (assuming we had a born_at date field):
CREATE TABLE iceberg.db.my_table (name string, age int) USING iceberg PARTITIONED BY (hour(born_at));
This section shows our current snapshot which points to the manifest list that covers the current snapshot. After the current snapshot, there is an array of snapshots, which is what enables time travel. Having a reference to all previous snapshots allows you to refer to previous points in time and query the data as it was in that snapshot.
You’ve had the opportunity to create and edit an Apache Iceberg table and can see how easy it can be. Using Apache Iceberg tables, you can unlock the speed and flexibility that was not possible before across different file types and engines. Apache Iceberg makes using open data architecture quite compelling.
Apache Iceberg tables can also be easily used with the Dremio Cloud platform. With Dremio Sonar you can efficiently query Iceberg tables and Dremio Arctic can serve as your Iceberg catalog, enabling git-like semantics (branching/merging) when working with Iceberg tables. You can also explore the structure of Apache Iceberg tables using Dremio.
Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg
By unifying data from diverse sources, simplifying data operations, and providing powerful tools for data management, Dremio stands out as a comprehensive solution for modern data needs. Whether you are a data engineer, business analyst, or data scientist, harnessing the combined power of Dremio and Apache Iceberg will undoubtedly be a valuable asset in your data management toolkit.
Sep 22, 2023·Dremio Blog: Open Data Insights
Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop
We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]
Oct 12, 2023·Product Insights from the Dremio Blog
Table-Driven Access Policies Using Subqueries
This blog helps you learn about table-driven access policies in Dremio Cloud and Dremio Software v24.1+.