15 minute read · October 2, 2025
Apache Iceberg Table Storage Management with Dremio’s VACUUM TABLE
· Head of DevRel, Dremio
Apache Iceberg’s snapshot-based architecture is one of its greatest strengths, enabling time travel queries, rollbacks, and strong auditability. But with every update, new snapshots are created and old data files linger. Over time, this leads to growing storage costs, expanding metadata, and, perhaps most importantly, questions around regulatory compliance. How do you ensure that data marked for deletion, such as personal information under GDPR “right to be forgotten” rules, is actually purged from storage and not just hidden by newer snapshots?
Dremio addresses this challenge with the VACUUM TABLE command. Unlike OPTIMIZE, which focuses on performance, VACUUM is about storage hygiene and compliance. It allows engineers to safely expire old snapshots, remove orphan files, and permanently delete data that should no longer exist. Used wisely, VACUUM ensures that your Iceberg tables strike the right balance between governance, compliance, and operational efficiency.
In this blog, we’ll dive into:
- How VACUUM TABLE works and its role in Iceberg storage management
- Options like EXPIRE SNAPSHOTS and REMOVE ORPHAN FILES, and when to use each
- Strategies for meeting GDPR and similar compliance requirements without sacrificing useful time-travel functionality
- Best practices for scheduling and monitoring VACUUM operations at scale
By the end, you’ll know how to keep your Iceberg tables lean, compliant, and free of unnecessary baggage, ensuring your data platform stays both efficient and trustworthy.
Snapshots, Time Travel, and the Need for VACUUM
One of the defining features of Apache Iceberg is its snapshot-based design. Every time you write new data, whether it’s an insert, update, or delete, Iceberg creates a new snapshot. Each snapshot points to the table’s files and manifests at that moment in time, allowing you to:
- Query historical versions of your data with time travel.
- Rollback to previous states if bad data or unexpected changes are introduced.
- Audit changes for compliance and debugging.
This architecture makes Iceberg extremely powerful for data engineering and analytics, but it also introduces a challenge: old snapshots accumulate. Even after data has been logically deleted, the physical files may still exist because they’re referenced by older snapshots.
Why This Matters
- Storage Growth – Without cleanup, storage can balloon as more snapshots and files pile up.
- Compliance Risks – Regulations like GDPR or CCPA require that personal data be permanently deleted. Hiding data behind snapshots isn’t enough, organizations must ensure those files are physically removed.
- Performance Overhead – Large numbers of snapshots and metadata files increase query planning costs, even if the underlying data is no longer needed.
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
Enter VACUUM
The VACUUM TABLE command solves this problem by expiring snapshots and removing orphan files. Once snapshots are expired and no longer referenced, their associated data files can be deleted permanently. This not only keeps storage lean but also ensures compliance with regulations by guaranteeing that deleted records are gone for good.
In short, VACUUM is the housekeeping command for Iceberg tables. It complements optimization (which focuses on performance) by handling the lifecycle of old data, making sure your tables stay both efficient and compliant over time.
How VACUUM Works
The VACUUM TABLE command is the mechanism Dremio provides to clean up Iceberg tables. It works in two main ways:
1. Expiring Snapshots
The EXPIRE SNAPSHOTS clause deletes snapshots older than a certain point in time. Once a snapshot is expired, any data files that were unique to that snapshot become eligible for deletion.
Syntax Example:
VACUUM TABLE sales_data
EXPIRE SNAPSHOTS older_than = '2025-09-01'
retain_last = 5;
- older_than – Defines the cutoff date for snapshots. Snapshots created before this timestamp are expired.
- retain_last – Ensures that a minimum number of the most recent snapshots are always kept, regardless of age.
Use Case: This is essential for enforcing data retention policies (e.g., only keep the last 30 days of snapshots) while still allowing time travel and rollback within that window.
2. Removing Orphan Files
The REMOVE ORPHAN FILES clause scans the table’s storage location for files that are not referenced by any current snapshot and deletes them.
Syntax Example:
VACUUM TABLE sales_data
REMOVE ORPHAN FILES older_than = '2025-09-01';
- older_than – Ensures only files older than a given date are deleted, which avoids removing files that might still be in the process of being written.
- location – (Optional) Lets you specify a subdirectory within the table’s root path to target.
Use Case: Useful for cleaning up after interrupted jobs, failed writes, or manually deleted partitions that left stray files behind.
Why Both Matter
- EXPIRE SNAPSHOTS is about lifecycle management, keeping only the snapshots (and thus historical data) that are relevant to your retention policies.
- REMOVE ORPHAN FILES is about hygiene, removing clutter that isn’t tied to any snapshot but still takes up space.
Together, these two clauses ensure that your Iceberg tables don’t just perform well, they also remain compliant, lean, and free of unnecessary baggage.
Compliance and GDPR Strategies
One of the most important responsibilities of data teams today is ensuring that sensitive information is not just hidden, but permanently deleted when required. Iceberg’s snapshot model means that even after a logical delete, the original data may still exist in older snapshots until they’re expired and the files physically removed.
GDPR and the “Right to Be Forgotten”
Under regulations like GDPR and CCPA, organizations must ensure that personal data is completely purged when a deletion request is made. This creates tension with Iceberg’s powerful time-travel feature: you want the ability to roll back or query history, but you cannot legally keep data that a user has asked you to erase.
VACUUM TABLE provides the mechanism to enforce this requirement:
- Use EXPIRE SNAPSHOTS aggressively for tables containing sensitive data, reducing how long deleted records remain in storage.
- Set retention policies that strike a balance between compliance (ensuring personal data is gone quickly) and operational needs (keeping a minimal rollback window).
Strategies for Balancing Compliance and Time Travel
- Tiered Retention Windows – Apply stricter snapshot expiration to tables with sensitive data, while keeping longer retention for fact tables or operational logs that don’t contain personal identifiers.
- Immediate Expiration on Deletion – For GDPR-triggered deletes, schedule an immediate VACUUM TABLE … EXPIRE SNAPSHOTS run to ensure files are removed promptly.
- Automated Retention Policies – Use Iceberg table properties (like history.expire.max-snapshot-age-ms) to automatically prune snapshots after a set time, ensuring compliance without manual intervention.
- Audit and Validate – Incorporate VACUUM into your governance pipeline, logging which snapshots were expired and verifying that deleted data files are no longer accessible.
Minimizing Impact on Operations
While compliance drives the need to expire snapshots quickly, you don’t want to lose the operational benefits of Iceberg. A practical approach is to retain a small buffer of recent snapshots (for example, the last 5–10) so that accidental changes can still be rolled back. This ensures your tables remain both compliant and operationally resilient.
Best Practices for Scheduling and Monitoring VACUUM
Running VACUUM TABLE isn’t something you want to leave to chance. The command directly controls how long snapshots are retained, when old files are deleted, and how much storage is reclaimed. Done too aggressively, it can break rollback workflows; done too infrequently, it can bloat storage and risk compliance issues. The key is to make VACUUM a regular, well-monitored part of your maintenance pipeline.
1. Align Frequency with Business Needs
- Compliance-focused tables – For datasets containing personal information, run VACUUM frequently (e.g., daily) to ensure deletions are enforced quickly.
- Analytical fact tables – For non-sensitive data, a less frequent cadence (weekly or monthly) may suffice, since the tradeoff leans toward preserving time-travel history.
- Streaming-heavy workloads – Since small files accumulate quickly, consider pairing frequent OPTIMIZE runs with VACUUM to keep tables lean.
2. Schedule During Off-Peak Hours
VACUUM can be resource-intensive, especially when expiring a large number of snapshots or scanning for orphan files. Running it during low-traffic windows ensures that production queries remain unaffected.
3. Use Monitoring and Alerts
Keep an eye on:
- Snapshot count – Too many retained snapshots can increase planning overhead.
- Storage footprint – If storage is growing faster than expected, check if old snapshots are being properly expired.
- VACUUM runtime – A sudden increase in job duration may indicate skewed partitions or an inefficient retention policy.
Set up alerts so you know when snapshot counts or storage growth exceed defined thresholds.
4. Balance Retention and Rollback
Always set a minimum number of snapshots to retain (e.g., 5–10) to preserve rollback options. This ensures you can still recover from operational errors while maintaining compliance and storage efficiency.
5. Automate Where Possible
Dremio’s Enterprise Catalog can automate aspects of snapshot expiration. Combine this with scheduled VACUUM jobs to create a fully automated lifecycle management strategy that requires minimal manual intervention.
Conclusion
Apache Iceberg’s snapshot model is a game-changer for time travel, auditing, and recovery, but it comes with a responsibility: old data must be managed carefully. Without proactive cleanup, tables can accumulate unnecessary files, driving up storage costs, slowing queries, and even creating compliance risks.
Dremio’s VACUUM TABLE command provides the control data engineers and architects need to:
- Expire outdated snapshots, keeping only the versions that align with retention policies.
- Permanently remove deleted data to meet GDPR and CCPA requirements.
- Clean up orphan files to ensure storage remains lean and predictable.
When combined with OPTIMIZE, which focuses on performance, VACUUM completes the picture of Iceberg table lifecycle management. OPTIMIZE keeps queries fast, while VACUUM ensures that data is governed, compliant, and storage-efficient.
For organizations building production-grade data lakehouses, the key is to treat VACUUM not as an occasional cleanup but as a regular governance tool. With thoughtful retention strategies, scheduling, and monitoring, you can ensure that your Iceberg tables remain both trustworthy and efficient, delivering not just performance, but also compliance and cost savings.