12 minute read · April 8, 2022
Maintaining Iceberg Tables – Compaction, Expiring Snapshots, and More
· Senior Tech Evangelist, Dremio
Maintenance Tasks with Iceberg Tables
The Apache Iceberg table format creates many benefits when working with data on your data lake such as partition/schema evolution, time-travel, version rollback, and much more.
An issue that arises when ingesting streaming data is that it may arrive in smaller files that are faster to write and more conducive to real-time ingestion but not as fast to query. When it comes to querying the data it would be more efficient to have fewer larger files with more data. With the Hive table format, this is a risky, error-prone process. With Iceberg, we can leverage a process called compaction.
Working with any data lake table format can require some regular maintenance to make sure the table metadata files don’t grow too large in number and that unused files are removed. Fortunately, to deal with these challenges, Iceberg provides us APIs to expire snapshots, remove old metadata files, and delete orphan files.
In this article, we’ll discuss all of these table management practices.
Compaction
With too many small files relative to overall data size, processing engines hit performance issues. Compaction is the process of taking several small files and rewriting them into fewer larger files to speed up queries.
When conducting compaction on an Iceberg table:
- We execute the
rewriteDataFiles
procedure, optionally specifying a filter of which files to rewrite and the desired size of the resulting files. - Spark reads the existing small files, combines the data in these files into larger batches, then writes these batches out in new larger data files.
- Spark then goes through the rest of the Iceberg write process, writing manifest files, manifest lists, and table metadata files, and finally committing these changes to the Iceberg catalog.
The old snapshots and files still exist but queries will not use them unless the query specifically uses time-travel. Following compaction, the query will run based on the new snapshot using the newly created larger data files which will run more efficiently.
To run a compaction job on your Iceberg tables you can use the RewriteDataFiles
action which is supported by Spark 3 & Flink. Below is an example of using this feature in Spark.
In the above example snippet, we run the rewriteDataFiles
action and then specify to only compact data with event_date
values greater than 7 days ago, this way we can schedule this compaction job on a weekly basis.
You can find a list of other expression methods for passing to filter here in the documentation.
Expire Snapshots
A nice thing about the snapshots that are created as the Iceberg table evolves is that they can be used for time-travel and version rollback. Although, over time you may accumulate many snapshots. As long as a snapshot is valid, the manifests listed in the snapshot and the data files listed in those manifests will not be deleted.
At a certain point, you may want to expire snapshots you no longer need for analytics to avoid unnecessary storage costs. Note that any manifest lists, manifests, and data files associated with an expired snapshot will be deleted when you delete a snapshot - unless those files are also associated with a still-valid snapshot (see the diagram below). Orphan files are not associated with any snapshot, and there is a separate process for finding and deleting them (read more on this in the section “Delete Orphan Files”).
In the code above, all snapshots that were created before the timestamp held in tsToExpire
will be expired. You can read the documentation for additional methods that will allow you to expire an arbitrary number of snapshots or snapshots with a particular id. All manifest lists and manifests no longer associated with a valid snapshot will then be deleted.
Removing Old Metadata files
Snapshot isolation is what enables many of Iceberg’s most powerful features. Part of what makes this possible is that a new metadata file is created with each update to the table’s structure or data. The amount of these files can add up, especially when inserting data in real-time from streams, because each stream will create a new metadata file. This can lead to unnecessary file storage costs especially since newer metadata files already have a historical list of snapshots. Compaction can handle the problem of many small data files, but what about handling the accumulation of metadata files?
Iceberg allows you to turn on a setting enabling the deletion of the oldest metadata file when a new one is created. You can also set the number of metadata files the table should retain. In the example diagram below, this is set to four.
Here is the setting for turning on the deletion of the oldest metadata file when one is created, which is defaulted to false.
Here is the setting on how many metadata files to maintain, which is defaulted to 100.
Delete Orphan Files
Task or job failures with different engines may result in partially written files or files not associated with any snapshots, which may not be picked up in the previous cleanup operations for deletion since there’s no reference to them in the Iceberg metadata tree. These are called orphan files.
Over time, they can accumulate, leading to unnecessary storage costs. Since these files aren’t referenced by any snapshots or metadata files, they won’t be cleaned up when expiring snapshots or removing metadata files. So, we need another procedure to scan the table’s directories for these stray files.
Iceberg can help you clean these files up using its deleteOrphanFiles
action like below.
This operation will go through each valid snapshot to determine which files are accessible by those snapshots. The files that are in the table’s data
directory and not accessible by any valid snapshot will be deleted. Since a table’s files can also be stored outside the table’s data
directory, ways to deal with that are mentioned below.
This can be a lengthy process so should only be done periodically. Some helper methods you can find in the documentation include
- the
olderThan
method, which can help prevent the deletion of temporary files from an in-progress write happening when you run this action - the
location
method to point the operation to a particular directory location in case your data files aren’t located in the table’s main data directory, which may be the case when you migrate pre-existing tables to Iceberg using the migrate procedure.
Conclusion
Whether it’s a buildup of small files, snapshots, metadata files, or orphan files, Iceberg has built-in tools to help maintain and optimize the files within your table.
Tools such as:
- Compaction with
rewriteDataFiles
to optimize read times by combining small files into larger files. - Snapshot expiration with
expireSnapshots
prior to a certain point in time. A point in time far back enough that you don’t think you’ll ever need to rollback or time-travel too. - Old metadata file removal. Set the max number so old metadata files are removed while new ones are added.
- Orphan file removal with
deleteOrphanFiles
so you aren’t storing unreferenced files.
You can find more details on table maintenance here in the Iceberg documentation and in this recent talk from the Subsurface conference.