4 minute read · December 15, 2023

Announcing Automated Iceberg Table Cleanup

Ben Hudson

Ben Hudson · Principal Product Manager, Dremio

Manoj Raheja

Manoj Raheja · Principal Product Manager, Dremio

Dremio is a lakehouse platform that enables companies to run enterprise analytics workloads directly on data lake storage. As part of their data lifecycle, companies can ensure optimal query performance with Dremio’s optimization capabilities for Apache Iceberg tables.

Today, we’re excited to announce Dremio’s support for automated table cleanup, which helps companies easily minimize storage utilization and adhere to data retention policies by removing snapshot, metadata, and data files that are no longer needed.

Why Table Cleanup?

Snapshots are a fundamental concept in Iceberg. Snapshots help query engines quickly understand which data files comprise a table at a point in time, and are also useful for time travel and rollback scenarios. However, each write to an Iceberg table creates a new snapshot, or version, of that table. These snapshots accumulate over time, and therefore need regular cleanup to minimize table metadata.

Companies can manually expire snapshots and delete unused data files for individual tables using Dremio’s VACUUM TABLE SQL command. However, this can be an arduous task for companies with hundreds or thousands of tables, who need to run the VACUUM TABLE SQL command manually, or write custom schedulers to run this programmatically.

What's New?

Dremio Software: Companies can now use the VACUUM CATALOG SQL command to expire snapshots and orphaned metadata and data files for all Iceberg tables in a specified catalog. This eliminates the time and effort required to run individual VACUUM TABLE commands against individual tables. VACUUM CATALOG is supported for Nessie Catalogs as of Dremio Software version 24.3.

Dremio Cloud: Automatic table cleanup is now enabled by default for any Dremio Cloud organization using Dremio Arctic as their Iceberg catalog. Arctic automatically performs table cleanup once a day, and deletes orphaned Iceberg metadata files, as well as Iceberg snapshots (i.e., versions) that are older than the customer-defined retention period. When snapshots are deleted, Arctic deletes both the metadata and all Parquet data files that are not referenced by any other snapshot that has not been deleted.

Questions or Feedback?

If you have any questions or feedback, please post to the Dremio Community page or contact your Dremio account team and we’ll be happy to assist.

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.