h2h2h2

11 minute read · April 8, 2022

Maintaining Iceberg Tables – Compaction, Expiring Snapshots, and More

Alex Merced

Alex Merced · Developer Advocate, Dremio

Maintenance Tasks with Iceberg Tables

The Apache Iceberg table format creates many benefits when working with data on your data lake such as partition/schema evolution, time-travel, version rollback, and much more. 

An issue that arises when ingesting streaming data is that it may arrive in smaller files that are faster to write and more conducive to real-time ingestion but not as fast to query. When it comes to querying the data it would be more efficient to have fewer larger files with more data. With the Hive table format, this is a risky, error-prone process. With Iceberg, we can leverage a process called compaction.

Working with any data lake table format can require some regular maintenance to make sure the table metadata files don’t grow too large in number and that unused files are removed. Fortunately, to deal with these challenges, Iceberg provides us APIs to expire snapshots, remove old metadata files, and delete orphan files.

In this article, we’ll discuss all of these table management practices.

Compaction

With too many small files relative to overall data size, processing engines hit performance issues. Compaction is the process of taking several small files and rewriting them into fewer larger files to speed up queries. 

When conducting compaction on an Iceberg table: 

  1. We execute the rewriteDataFiles procedure, optionally specifying a filter of which files to rewrite and the desired size of the resulting files.
  2. Spark reads the existing small files, combines the data in these files into larger batches, then writes these batches out in  new larger data files.
  3. Spark then goes through the rest of the Iceberg write process, writing manifest files, manifest lists, and table metadata files, and finally committing these changes to the Iceberg catalog.

The old snapshots and files still exist but queries will not use them unless the query specifically uses time-travel. Following compaction, the query will run based on the new snapshot using the newly created larger data files which will run more efficiently.

small file to large file block diagram

To run a compaction job on your Iceberg tables you can use the RewriteDataFiles action which is supported by Spark 3 & Flink. Below is an example of using this feature in Spark.

Spark Actions code snippet

In the above example snippet, we run the rewriteDataFiles action and then specify to only compact data with event_date values greater than 7 days ago, this way we can schedule this compaction job on a weekly basis.

You can find a list of other expression methods for passing to filter here in the documentation.

Expire Snapshots

A nice thing about the snapshots that are created as the Iceberg table evolves is that they can be used for time-travel and version rollback. Although, over time you may accumulate many snapshots. As long as a snapshot is valid, the manifests listed in the snapshot and the data files listed in those manifests will not be deleted.

At a certain point, you may want to expire snapshots you no longer need for analytics to avoid unnecessary storage costs. Note that any manifest lists, manifests, and data files associated with an expired snapshot will be deleted when you delete a snapshot - unless those files are also associated with a still-valid snapshot (see the diagram below). Orphan files are not associated with any snapshot, and there is a separate process for finding and deleting them (read more on this in the section “Delete Orphan Files”).

Target date block diagram
Table expre.snapshots code snippet

In the code above, all snapshots that were created before the timestamp held in tsToExpire will be expired. You can read the documentation for additional methods that will allow you to expire an arbitrary number of snapshots or snapshots with a particular id. All manifest lists and manifests no longer associated with a valid snapshot will then be deleted.

Removing Old Metadata files

Snapshot isolation is what enables many of Iceberg’s most powerful features. Part of what makes this possible is that a new metadata file is created with each update to the table’s structure or data. The amount of these files can add up, especially when inserting data in real-time from streams, because each stream will create a new metadata file. This can lead to unnecessary file storage costs especially since newer metadata files already have a historical list of snapshots. Compaction can handle the problem of many small data files, but what about handling the accumulation of metadata files?

Iceberg allows you to turn on a setting enabling the deletion of the oldest metadata file when a new one is created. You can also set the number of metadata files the table should retain. In the example diagram below, this is set to four.

Metadata files block diagram

Here is the setting for turning on the deletion of the oldest metadata file when one is created, which is defaulted to false.

write.metadata.delete code snippet

Here is the setting on how many metadata files to maintain, which is defaulted to 100.

write.metadatacode snippet

Delete Orphan Files

Task or job failures with different engines may result in partially written files or files not associated with any snapshots, which may not be picked up in the previous cleanup operations for deletion since there’s no reference to them in the Iceberg metadata tree. These are called orphan files. 

Over time, they can accumulate, leading to unnecessary storage costs. Since these files aren’t referenced by any snapshots or metadata files, they won’t be cleaned up when expiring snapshots or removing metadata files. So, we need another procedure to scan the table’s directories for these stray files.

Catalog database block diagram

Iceberg can help you clean these files up using its deleteOrphanFiles action like below.

Table Sprak.actions code snippet

This operation will go through each valid snapshot to determine which files are accessible by those snapshots. The files that are in the table’s data directory and not accessible by any valid snapshot will be deleted. Since a table’s files can also be stored outside the table’s data directory, ways to deal with that are mentioned below.

This can be a lengthy process so should only be done periodically. Some helper methods you can find in the documentation include

  • the olderThan method, which can help prevent the deletion of temporary files from an in-progress write happening when you run this action
  • the location method to point the operation to a particular directory location in case your data files aren’t located in the table’s main data directory, which may be the case when you migrate pre-existing tables to Iceberg using the migrate procedure.

Conclusion

Whether it’s a buildup of small files, snapshots, metadata files, or orphan files, Iceberg has built-in tools to help maintain and optimize the files within your table.

Tools such as:

  • Compaction with rewriteDataFiles to optimize read times by combining small files into larger files.
  • Snapshot expiration with expireSnapshots prior to a certain point in time. A point in time far back enough that you don’t think you’ll ever need to rollback or time-travel too.
  • Old metadata file removal. Set the max number so old metadata files are removed while new ones are added.
  • Orphan file removal with deleteOrphanFiles so you aren’t storing unreferenced files.

You can find more details on table maintenance here in the Iceberg documentation and in this recent talk from the Subsurface conference.

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.