Get Started Free
No time limit - totally free - just the way you like it.Sign Up Now
The Apache Iceberg table format creates many benefits when working with data on your data lake such as partition/schema evolution, time-travel, version rollback, and much more.
An issue that arises when ingesting streaming data is that it may arrive in smaller files that are faster to write and more conducive to real-time ingestion but not as fast to query. When it comes to querying the data it would be more efficient to have fewer larger files with more data. With the Hive table format, this is a risky, error-prone process. With Iceberg, we can leverage a process called compaction.
Working with any data lake table format can require some regular maintenance to make sure the table metadata files don’t grow too large in number and that unused files are removed. Fortunately, to deal with these challenges, Iceberg provides us APIs to expire snapshots, remove old metadata files, and delete orphan files.
In this article, we’ll discuss all of these table management practices.
With too many small files relative to overall data size, processing engines hit performance issues. Compaction is the process of taking several small files and rewriting them into fewer larger files to speed up queries.
When conducting compaction on an Iceberg table:
rewriteDataFilesprocedure, optionally specifying a filter of which files to rewrite and the desired size of the resulting files.
The old snapshots and files still exist but queries will not use them unless the query specifically uses time-travel. Following compaction, the query will run based on the new snapshot using the newly created larger data files which will run more efficiently.
In the above example snippet, we run the
rewriteDataFiles action and then specify to only compact data with
event_date values greater than 7 days ago, this way we can schedule this compaction job on a weekly basis.
You can find a list of other expression methods for passing to filter here in the documentation.
A nice thing about the snapshots that are created as the Iceberg table evolves is that they can be used for time-travel and version rollback. Although, over time you may accumulate many snapshots. As long as a snapshot is valid, the manifests listed in the snapshot and the data files listed in those manifests will not be deleted.
At a certain point, you may want to expire snapshots you no longer need for analytics to avoid unnecessary storage costs. Note that any manifest lists, manifests, and data files associated with an expired snapshot will be deleted when you delete a snapshot - unless those files are also associated with a still-valid snapshot (see the diagram below). Orphan files are not associated with any snapshot, and there is a separate process for finding and deleting them (read more on this in the section “Delete Orphan Files”).
In the code above, all snapshots that were created before the timestamp held in
tsToExpire will be expired. You can read the documentation for additional methods that will allow you to expire an arbitrary number of snapshots or snapshots with a particular id. All manifest lists and manifests no longer associated with a valid snapshot will then be deleted.
Snapshot isolation is what enables many of Iceberg’s most powerful features. Part of what makes this possible is that a new metadata file is created with each update to the table’s structure or data. The amount of these files can add up, especially when inserting data in real-time from streams, because each stream will create a new metadata file. This can lead to unnecessary file storage costs especially since newer metadata files already have a historical list of snapshots. Compaction can handle the problem of many small data files, but what about handling the accumulation of metadata files?
Iceberg allows you to turn on a setting enabling the deletion of the oldest metadata file when a new one is created. You can also set the number of metadata files the table should retain. In the example diagram below, this is set to four.
Here is the setting for turning on the deletion of the oldest metadata file when one is created, which is defaulted to false.
Here is the setting on how many metadata files to maintain, which is defaulted to 100.
Task or job failures with different engines may result in partially written files or files not associated with any snapshots, which may not be picked up in the previous cleanup operations for deletion since there’s no reference to them in the Iceberg metadata tree. These are called orphan files.
Over time, they can accumulate, leading to unnecessary storage costs. Since these files aren’t referenced by any snapshots or metadata files, they won’t be cleaned up when expiring snapshots or removing metadata files. So, we need another procedure to scan the table’s directories for these stray files.
Iceberg can help you clean these files up using its
deleteOrphanFiles action like below.
This operation will go through each valid snapshot to determine which files are accessible by those snapshots. The files that are in the table’s
data directory and not accessible by any valid snapshot will be deleted. Since a table’s files can also be stored outside the table’s
data directory, ways to deal with that are mentioned below.
This can be a lengthy process so should only be done periodically. Some helper methods you can find in the documentation include
olderThanmethod, which can help prevent the deletion of temporary files from an in-progress write happening when you run this action
locationmethod to point the operation to a particular directory location in case your data files aren’t located in the table’s main data directory, which may be the case when you migrate pre-existing tables to Iceberg using the migrate procedure.
Whether it’s a buildup of small files, snapshots, metadata files, or orphan files, Iceberg has built-in tools to help maintain and optimize the files within your table.
Tools such as:
rewriteDataFilesto optimize read times by combining small files into larger files.
expireSnapshotsprior to a certain point in time. A point in time far back enough that you don’t think you’ll ever need to rollback or time-travel too.
deleteOrphanFilesso you aren’t storing unreferenced files.