This article has been revised and updated from its original version published in 2022 to reflect the latest Apache Iceberg developments, including V3 deletion vectors and modern data privacy patterns.
The right to be forgotten (Article 17 of GDPR) requires organizations to permanently delete personal data upon request. On traditional data lakes built on immutable file formats like Parquet, this is extraordinarily difficult: you can't simply delete a row from a Parquet file. You must read the entire file, remove the matching rows, and write a new file.
Apache Iceberg makes GDPR compliance on the lakehouse not only possible but practical. Its row-level delete operations, combined with snapshot expiry, provide a complete pipeline for permanent data erasure that satisfies regulatory requirements.
The GDPR Challenge on Object Storage
The stakes for GDPR non-compliance are significant. Regulators can impose fines of up to 4% of global annual revenue or €20 million, whichever is greater. Beyond financial penalties, data breaches and compliance failures cause reputational damage that can take years to recover from. Organizations need a reliable, automated, and auditable process for handling deletion requests at scale. For official documentation, refer to the Iceberg delete format specification.
The right to be forgotten doesn't just mean hiding data from queries, it means physically removing the data from storage so that it cannot be recovered, even by someone with direct access to the underlying files. This requirement creates a unique technical challenge on object storage.
Object storage is append-only. Files cannot be modified in place. This creates a fundamental tension with data deletion requirements:
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
Why Simple DELETE Isn't Enough
When you run DELETE FROM users WHERE user_id = 42, Iceberg logically deletes the row, subsequent queries won't return it. But the original data file containing user 42's data still exists on storage. The data is logically invisible but physically present.
For GDPR compliance, you need physical deletion: the actual bytes must be removed from storage. Logical deletion alone is insufficient because:
Direct storage access (bypassing Iceberg) could still read the data
Backups and replicas may contain the data
Audit requirements demand proof of permanent erasure
The Three-Step Erasure Pipeline
Complete GDPR deletion in Iceberg requires three operations:
Step
Operation
What It Does
1
DELETE
Logically removes the row (creates delete file or rewrites data file)
2
Compaction
Physically removes the row from data files by rewriting without it
3
Snapshot Expiry + Orphan Cleanup
Removes old data files that contained the deleted row
Each step is necessary. Skipping any step leaves the personal data physically accessible on storage.
Step 1: Execute the DELETE
Using Merge-on-Read (Fast Initial Compliance)
DELETE FROM users WHERE user_id = 42;
With MOR mode, this writes a small delete file and commits instantly. The user's data is immediately invisible to all queries, providing fast initial compliance with the deletion request.
Using Copy-on-Write (Immediate Physical Deletion in Data Files)
-- Set COW mode for deletes
ALTER TABLE users SET TBLPROPERTIES ('write.delete.mode' = 'copy-on-write');
DELETE FROM users WHERE user_id = 42;
COW rewrites every affected data file without the deleted rows. This is slower but eliminates the need for a separate compaction step, the data is physically absent from the new data files immediately.
Batch Deletions
GDPR requests often arrive in batches. Process them efficiently:
-- Batch delete using a staging table
DELETE FROM users
WHERE user_id IN (SELECT user_id FROM gdpr_deletion_requests WHERE status = 'pending');
-- Update request status
UPDATE gdpr_deletion_requests SET status = 'completed' WHERE status = 'pending';
Step 2: Compact to Physically Remove Data
If you used MOR for the delete, the original data files still contain the deleted rows. Compaction rewrites data files with delete files applied, the output files contain only surviving rows:
After these three steps, the personal data is permanently and irreversibly removed from storage. No time travel or rollback can recover it.
V3 Deletion Vectors and GDPR
V3 deletion vectors make the initial DELETE operation even faster. Instead of writing a separate delete file, V3 marks the deleted row positions in a compact bitmap attached to the manifest entry. The three-step pipeline remains the same, you still need compaction and snapshot expiry for complete physical erasure, but Step 1 is nearly instantaneous.
Automating the GDPR Pipeline
Scheduled Pipeline with Spark
from pyspark.sql import SparkSession
from datetime import datetime, timedelta
spark = SparkSession.builder.getOrCreate()
# Step 1: Process pending deletion requests
spark.sql("""
DELETE FROM users
WHERE user_id IN (SELECT user_id FROM gdpr_requests WHERE status = 'pending')
""")
# Step 2: Compact affected tables
tables = ["users", "orders", "events", "profiles"]
for table in tables:
spark.sql(f"""
CALL catalog.system.rewrite_data_files(
table => 'db.{table}',
strategy => 'sort'
)
""")
# Step 3: Expire old snapshots
cutoff = (datetime.now() - timedelta(days=1)).strftime('%Y-%m-%d %H:%M:%S')
for table in tables:
spark.sql(f"""
CALL catalog.system.expire_snapshots(
table => 'db.{table}',
older_than => TIMESTAMP '{cutoff}',
retain_last => 1
)
""")
spark.sql(f"""
CALL catalog.system.remove_orphan_files(
table => 'db.{table}',
older_than => TIMESTAMP '{cutoff}'
)
""")
With Dremio
Dremio's managed catalog automates compaction and snapshot expiry. After running the DELETE, Dremio handles the physical cleanup automatically, simplifying GDPR compliance to a single SQL statement.
GDPR Compliance Checklist
Requirement
Iceberg Capability
Verification
Data subject requests deletion
DELETE FROM table WHERE user_id = X
Query returns 0 rows
Data physically removed
Compaction + snapshot expiry
Data file no longer contains the row
Deletion is irreversible
Snapshot expiry removes historical access
Time travel cannot retrieve the data
Audit trail
Snapshot log records the delete operation
SELECT * FROM table.snapshots
Timely response (30 days)
Automated pipeline processes requests daily
Pipeline execution logs
Multi-Table Deletion
GDPR requests often require deleting a user's data from multiple tables (orders, profiles, events, logs). Coordinate across tables:
-- Delete from all tables containing user data
DELETE FROM users WHERE user_id = 42;
DELETE FROM orders WHERE customer_id = 42;
DELETE FROM events WHERE user_id = 42;
DELETE FROM user_profiles WHERE user_id = 42;
For atomic multi-table deletions, use Nessie branches to ensure all deletes succeed or none do.
A European bank receives an average of 500 data subject deletion requests per month. Customer data is spread across 12 Iceberg tables (accounts, transactions, profiles, communications, risk_scores, etc.) totaling 8TB.
Manual approach (pre-Iceberg): Each request required a data engineer to identify all tables containing the customer's data, write custom scripts to rewrite affected files, and validate the deletion. Average processing time: 4 hours per request. At 500 requests/month, this consumed 2,000 engineer-hours, effectively 12 full-time engineers just for GDPR compliance.
Automated Iceberg pipeline: A nightly Spark job processes all pending requests:
Reads the gdpr_requests staging table
Executes batch DELETEs against all 12 tables (using MOR for speed)
Runs compaction on affected partitions
Expires pre-deletion snapshots
Generates an audit report confirming physical erasure
Total processing time: ~45 minutes per night, fully automated. Zero engineer hours for routine requests.
Beyond GDPR: Other Privacy Regulations
The same Iceberg erasure pipeline applies to other privacy regulations:
Regulation
Region
Deletion Requirement
Iceberg Approach
GDPR (Article 17)
EU/EEA
Right to erasure upon request
DELETE + Compact + Expire
CCPA/CPRA
California, USA
Right to delete personal information
Same pipeline
LGPD
Brazil
Right to deletion of unnecessary data
Same pipeline
POPIA
South Africa
Right to deletion
Same pipeline
PIPEDA
Canada
Retention limits + deletion rights
Same pipeline + TTL policies
The three-step pipeline (DELETE → Compact → Expire) is universal. Only the triggering conditions and response timeframes differ between regulations.
Crypto-Shredding as an Alternative
For extreme compliance requirements or very large tables where full compaction is expensive, consider crypto-shredding:
Encrypt all personal data columns with per-user encryption keys
Store encryption keys in a separate key management system
When a deletion request arrives, destroy the encryption key
The personal data becomes cryptographically unrecoverable without physical file deletion. This approach is faster but requires encryption infrastructure:
-- Pseudocode: encrypt personal data at write time
INSERT INTO users
SELECT
user_id,
ENCRYPT(name, get_user_key(user_id)),
ENCRYPT(email, get_user_key(user_id)),
non_personal_column
FROM staging_users;
-- To "delete": destroy the key (no data rewrite needed)
CALL key_management.delete_key(user_id => 42);
Crypto-shredding is compliant with most regulations and eliminates the need for the three-step pipeline. However, it adds encryption/decryption overhead to every read and write operation.
Data Lineage for Compliance Audits
Iceberg's metadata provides a natural audit trail for GDPR compliance. The snapshot log records:
When the deletion occurred (snapshot timestamp)
What operation was performed (DELETE operation in snapshot summary)
How many rows were affected (rows_deleted in snapshot summary)
Which files were modified (manifest diff between snapshots)
This metadata satisfies auditor requirements for demonstrating that deletion requests were processed correctly and within the required timeframe.
Frequently Asked Questions
How do I prove data was physically deleted?
After completing all three steps (DELETE → Compact → Expire), verify:
Query with time travel to the pre-deletion snapshot, it should fail (snapshot expired)
List files in the table's data directory, old files should be absent
Check the snapshot log for the DELETE operation timestamp and summary
When source data is deleted, and Reflections are refreshed, the Reflection data is also updated to exclude the deleted rows. Dremio handles this automatically; no additional steps are required.
Yes, but the DELETE statement should target the specific rows, not partitions. Iceberg's hidden partitioning ensures the DELETE is efficient, only data files in relevant partitions are scanned.
After snapshot expiry, rollback to pre-deletion snapshots is impossible, this is by design. GDPR requires irreversible deletion, and snapshot expiry ensures this. Always complete the full three-step pipeline before considering the deletion finalized.
How do I handle GDPR "right to be forgotten" requests at scale?
Use Iceberg's row-level delete operations with merge-on-read mode for efficient deletions across large tables. Create a dedicated deletion pipeline that processes batch deletion requests on a Nessie branch, validates the deletions, then merges to main. After merging, run snapshot expiry and orphan file cleanup to ensure deleted data is physically removed from storage and cannot be recovered through time travel.
Time travel is a fantastic feature that’s available when you adopt a data lakehouse table format like Apache Iceberg, but sometimes this useful feature runs contrary to regulatory requirements such as GDPR (General Data Protection Regulation) and CCPA (California Consumer Privacy Act). This is true of other systems that offer time travel as well, such as Snowflake.
This is an issue because regulations often require you to ensure that you no longer have access to protected consumer information. Typically, this can be done by merely deleting the data in question, but with systems that provide time travel, deletions are soft and the data still exists if you use time travel to query data at a previous point in time.
Apache Iceberg table format, however, offers several ways to manage tables so you can ensure compliance with the proper regulations, and guarantee data that needs to be forgotten is forgotten.
Method #1: Hard Deletes with Copy-on-Write
With Copy-on-Write, managing hard deletes is straightforward. It just takes one step:
Expire snapshots older than X days.
Since the data files are rewritten on every update or delete, newer snapshots will refer to files that don’t have the deleted records. So expiring snapshots past your desired length of retention will clean up data files with any deleted records since they are not referenced in your newer snapshots.
-- if your table uses copy-on-write
ALTER TABLE catalog.db.users SET TBLPROPERTIES (
'write.delete.mode'='copy-on-write',
'write.update.mode'='copy-on-write',
'write.merge.mode'='copy-on-write'
)
-- To hard delete files just expire snapshots prior
-- To the deletion of records
CALL catalog.system.expire_snapshots(
table => 'db.users',
older_than => TIMESTAMP '2022-06-30 00:00:00.000',
retain_last => 10
)
Method #2: Hard Deletes with Merge-on-Read
Merge-on-Read is a little more complicated because newer snapshots where records are deleted may still refer to data files with deleted records. These records are reconciled using delete files that list deleted records. Expiring snapshots alone may not delete the desired data files in this situation. So the proper flow to ensure hard deletions would be:
Run compaction to reconcile all delete files into new data files periodically.
Expire snapshots older than X days.
Since all snapshots from before the compaction job are expired, the data files with those deleted records can now be deleted.
-- if your table uses merge-on-read
ALTER TABLE catalog.db.users SET TBLPROPERTIES (
'write.delete.mode'='merge-on-read',
'write.update.mode'='merge-on-read',
'write.merge.mode'='merge-on-read'
)
-- Run compaction to eliminate delete files
CALL catalog.system.rewrite_data_files(
table => 'db.users',
strategy => 'sort',
sort_order => 'zorder(role, state)'
)
-- To hard delete files expire all prior snapshots
CALL catalog.system.expire_snapshots(
table => 'db.users',
older_than => TIMESTAMP '2022-06-30 00:00:00.000',
retain_last => 1
)
Although, if you run compaction regularly, you should be able to develop a cadence that’ll delete files when needed, while leaving some snapshots for time travel.
For example, let’s assume weekly compaction jobs with hard deletes require data deleted for 30 days or more. At the end of each month, you can run an expiration of all snapshots before the most recent four compaction jobs, which still leaves you with a month's worth of snapshots for time travel while hard deleting the files that need to be hard deleted.
Method #3: Use Key Rotation
Another option that can complement the previous methods or be used on its own is to control access to the files using key rotation.
This means you may use different keys to encrypt your files based on where you store them. For example, let’s say you change the encryption key each month. If you want to eliminate access to data before three months old you can drop older keys rendering those older files inaccessible even though they still physically exist.
Implementation differs depending on what vendor handles your file storage or how intricate you want the process to be. You may already be implementing some level of key rotation to comply with PCI DSS regulations.
Other Regulatory Best Practices
Here are some other considerations as you design your compliance plan:
Don’t commingle protected data with unprotected data in the same table, so you don’t lose data unnecessarily when deleting a row of protected data (e.g., put email and phone numbers in a different table than data that denotes user application preferences or activity).
Delete the data in batches. Instead of deleting as you go, set a flag on records that need to be deleted, then run a batch delete operation on all records that have been flagged. This flag can be tracked as an insert into a separate table that tracks delete requests, which also allows you to track metadata around the deletes like the time of the request, time of actual deletion, and so forth. The benefit is that it clusters all the deletes into fewer snapshots making it easier to predict what files will be cleaned up when expiring snapshots vs the deletes occurring over several snapshots.
Conclusion
Regulatory requirements can make data engineering more complex than it otherwise would be. Fortunately, whether you need hard deletes to occur on a Copy-on-Write or Merge-on-Read table or instead use a key rotation strategy, there are techniques to get all the benefits of time travel while still maintaining compliance.
Try Dremio Cloud free for 30 days
Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.
Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg
By unifying data from diverse sources, simplifying data operations, and providing powerful tools for data management, Dremio stands out as a comprehensive solution for modern data needs. Whether you are a data engineer, business analyst, or data scientist, harnessing the combined power of Dremio and Apache Iceberg will undoubtedly be a valuable asset in your data management toolkit.
Sep 22, 2023·Dremio Blog: Open Data Insights
Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop
We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]
Oct 12, 2023·Product Insights from the Dremio Blog
Table-Driven Access Policies Using Subqueries
This blog helps you learn about table-driven access policies in Dremio Cloud and Dremio Software v24.1+.