Dremio Blog

27 minute read · September 29, 2022

Apache Iceberg and the Right to Be Forgotten

Alex Merced Alex Merced Head of DevRel, Dremio
Start For Free
Apache Iceberg and the Right to Be Forgotten
Copied to clipboard

This article has been revised and updated from its original version published in 2022 to reflect the latest Apache Iceberg developments, including V3 deletion vectors and modern data privacy patterns.

The right to be forgotten (Article 17 of GDPR) requires organizations to permanently delete personal data upon request. On traditional data lakes built on immutable file formats like Parquet, this is extraordinarily difficult: you can't simply delete a row from a Parquet file. You must read the entire file, remove the matching rows, and write a new file.

Apache Iceberg makes GDPR compliance on the lakehouse not only possible but practical. Its row-level delete operations, combined with snapshot expiry, provide a complete pipeline for permanent data erasure that satisfies regulatory requirements.

The GDPR Challenge on Object Storage

The stakes for GDPR non-compliance are significant. Regulators can impose fines of up to 4% of global annual revenue or €20 million, whichever is greater. Beyond financial penalties, data breaches and compliance failures cause reputational damage that can take years to recover from. Organizations need a reliable, automated, and auditable process for handling deletion requests at scale. For official documentation, refer to the Iceberg delete format specification.

The right to be forgotten doesn't just mean hiding data from queries, it means physically removing the data from storage so that it cannot be recovered, even by someone with direct access to the underlying files. This requirement creates a unique technical challenge on object storage.

Object storage is append-only. Files cannot be modified in place. This creates a fundamental tension with data deletion requirements:

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Why Simple DELETE Isn't Enough

When you run DELETE FROM users WHERE user_id = 42, Iceberg logically deletes the row, subsequent queries won't return it. But the original data file containing user 42's data still exists on storage. The data is logically invisible but physically present.

For GDPR compliance, you need physical deletion: the actual bytes must be removed from storage. Logical deletion alone is insufficient because:

  1. Direct storage access (bypassing Iceberg) could still read the data
  2. Backups and replicas may contain the data
  3. Audit requirements demand proof of permanent erasure

The Three-Step Erasure Pipeline

Complete GDPR deletion in Iceberg requires three operations:

StepOperationWhat It Does
1DELETELogically removes the row (creates delete file or rewrites data file)
2CompactionPhysically removes the row from data files by rewriting without it
3Snapshot Expiry + Orphan CleanupRemoves old data files that contained the deleted row

Each step is necessary. Skipping any step leaves the personal data physically accessible on storage.

Step 1: Execute the DELETE

Using Merge-on-Read (Fast Initial Compliance)

DELETE FROM users WHERE user_id = 42;

With MOR mode, this writes a small delete file and commits instantly. The user's data is immediately invisible to all queries, providing fast initial compliance with the deletion request.

Using Copy-on-Write (Immediate Physical Deletion in Data Files)

-- Set COW mode for deletes
ALTER TABLE users SET TBLPROPERTIES ('write.delete.mode' = 'copy-on-write');

DELETE FROM users WHERE user_id = 42;

COW rewrites every affected data file without the deleted rows. This is slower but eliminates the need for a separate compaction step, the data is physically absent from the new data files immediately.

Batch Deletions

GDPR requests often arrive in batches. Process them efficiently:

-- Batch delete using a staging table
DELETE FROM users 
WHERE user_id IN (SELECT user_id FROM gdpr_deletion_requests WHERE status = 'pending');

-- Update request status
UPDATE gdpr_deletion_requests SET status = 'completed' WHERE status = 'pending';

Step 2: Compact to Physically Remove Data

If you used MOR for the delete, the original data files still contain the deleted rows. Compaction rewrites data files with delete files applied, the output files contain only surviving rows:

CALL catalog.system.rewrite_data_files(
  table => 'db.users',
  strategy => 'sort'
);

After compaction, no data file contains the deleted user's data.

Step 3: Expire Snapshots and Clean Up Orphans

Old snapshots still reference the pre-deletion data files. Snapshot expiry removes these references:

-- Expire all snapshots older than the deletion
CALL catalog.system.expire_snapshots(
  table => 'db.users',
  older_than => TIMESTAMP '2024-03-15 00:00:00',
  retain_last => 1
);

Then remove the orphaned data files from storage:

CALL catalog.system.remove_orphan_files(
  table => 'db.users',
  older_than => TIMESTAMP '2024-03-15 00:00:00'
);

After these three steps, the personal data is permanently and irreversibly removed from storage. No time travel or rollback can recover it.

V3 Deletion Vectors and GDPR

V3 deletion vectors make the initial DELETE operation even faster. Instead of writing a separate delete file, V3 marks the deleted row positions in a compact bitmap attached to the manifest entry. The three-step pipeline remains the same, you still need compaction and snapshot expiry for complete physical erasure, but Step 1 is nearly instantaneous.

Automating the GDPR Pipeline

Scheduled Pipeline with Spark

from pyspark.sql import SparkSession
from datetime import datetime, timedelta

spark = SparkSession.builder.getOrCreate()

# Step 1: Process pending deletion requests
spark.sql("""
    DELETE FROM users 
    WHERE user_id IN (SELECT user_id FROM gdpr_requests WHERE status = 'pending')
""")

# Step 2: Compact affected tables
tables = ["users", "orders", "events", "profiles"]
for table in tables:
    spark.sql(f"""
        CALL catalog.system.rewrite_data_files(
            table => 'db.{table}',
            strategy => 'sort'
        )
    """)

# Step 3: Expire old snapshots
cutoff = (datetime.now() - timedelta(days=1)).strftime('%Y-%m-%d %H:%M:%S')
for table in tables:
    spark.sql(f"""
        CALL catalog.system.expire_snapshots(
            table => 'db.{table}',
            older_than => TIMESTAMP '{cutoff}',
            retain_last => 1
        )
    """)
    spark.sql(f"""
        CALL catalog.system.remove_orphan_files(
            table => 'db.{table}',
            older_than => TIMESTAMP '{cutoff}'
        )
    """)

With Dremio

Dremio's managed catalog automates compaction and snapshot expiry. After running the DELETE, Dremio handles the physical cleanup automatically, simplifying GDPR compliance to a single SQL statement.

GDPR Compliance Checklist

RequirementIceberg CapabilityVerification
Data subject requests deletionDELETE FROM table WHERE user_id = XQuery returns 0 rows
Data physically removedCompaction + snapshot expiryData file no longer contains the row
Deletion is irreversibleSnapshot expiry removes historical accessTime travel cannot retrieve the data
Audit trailSnapshot log records the delete operationSELECT * FROM table.snapshots
Timely response (30 days)Automated pipeline processes requests dailyPipeline execution logs

Multi-Table Deletion

GDPR requests often require deleting a user's data from multiple tables (orders, profiles, events, logs). Coordinate across tables:

-- Delete from all tables containing user data
DELETE FROM users WHERE user_id = 42;
DELETE FROM orders WHERE customer_id = 42;
DELETE FROM events WHERE user_id = 42;
DELETE FROM user_profiles WHERE user_id = 42;

For atomic multi-table deletions, use Nessie branches to ensure all deletes succeed or none do.

Compared to Other Approaches

ApproachCompliance LevelComplexityTime to Complete
Iceberg DELETE + Compact + ExpireFull physical erasureAutomatedHours (batch processing)
Overwrite entire datasetFull physical erasureManualDays (on large tables)
Mark as deleted (soft delete)Logical only, NOT compliantSimpleMinutes (but insufficient)
Encrypt and destroy keysCrypto-shredding, compliantComplex (key management)Minutes

Real-World Scenario: Financial Services GDPR Pipeline

A European bank receives an average of 500 data subject deletion requests per month. Customer data is spread across 12 Iceberg tables (accounts, transactions, profiles, communications, risk_scores, etc.) totaling 8TB.

Manual approach (pre-Iceberg): Each request required a data engineer to identify all tables containing the customer's data, write custom scripts to rewrite affected files, and validate the deletion. Average processing time: 4 hours per request. At 500 requests/month, this consumed 2,000 engineer-hours, effectively 12 full-time engineers just for GDPR compliance.

Automated Iceberg pipeline: A nightly Spark job processes all pending requests:

  1. Reads the gdpr_requests staging table
  2. Executes batch DELETEs against all 12 tables (using MOR for speed)
  3. Runs compaction on affected partitions
  4. Expires pre-deletion snapshots
  5. Generates an audit report confirming physical erasure

Total processing time: ~45 minutes per night, fully automated. Zero engineer hours for routine requests.

Beyond GDPR: Other Privacy Regulations

The same Iceberg erasure pipeline applies to other privacy regulations:

RegulationRegionDeletion RequirementIceberg Approach
GDPR (Article 17)EU/EEARight to erasure upon requestDELETE + Compact + Expire
CCPA/CPRACalifornia, USARight to delete personal informationSame pipeline
LGPDBrazilRight to deletion of unnecessary dataSame pipeline
POPIASouth AfricaRight to deletionSame pipeline
PIPEDACanadaRetention limits + deletion rightsSame pipeline + TTL policies

The three-step pipeline (DELETE → Compact → Expire) is universal. Only the triggering conditions and response timeframes differ between regulations.

Crypto-Shredding as an Alternative

For extreme compliance requirements or very large tables where full compaction is expensive, consider crypto-shredding:

  1. Encrypt all personal data columns with per-user encryption keys
  2. Store encryption keys in a separate key management system
  3. When a deletion request arrives, destroy the encryption key

The personal data becomes cryptographically unrecoverable without physical file deletion. This approach is faster but requires encryption infrastructure:

-- Pseudocode: encrypt personal data at write time
INSERT INTO users 
SELECT 
  user_id,
  ENCRYPT(name, get_user_key(user_id)),
  ENCRYPT(email, get_user_key(user_id)),
  non_personal_column
FROM staging_users;

-- To "delete": destroy the key (no data rewrite needed)
CALL key_management.delete_key(user_id => 42);

Crypto-shredding is compliant with most regulations and eliminates the need for the three-step pipeline. However, it adds encryption/decryption overhead to every read and write operation.

Data Lineage for Compliance Audits

Iceberg's metadata provides a natural audit trail for GDPR compliance. The snapshot log records:

  • When the deletion occurred (snapshot timestamp)
  • What operation was performed (DELETE operation in snapshot summary)
  • How many rows were affected (rows_deleted in snapshot summary)
  • Which files were modified (manifest diff between snapshots)

This metadata satisfies auditor requirements for demonstrating that deletion requests were processed correctly and within the required timeframe.

Frequently Asked Questions

How do I prove data was physically deleted?

After completing all three steps (DELETE → Compact → Expire), verify:

  1. Query with time travel to the pre-deletion snapshot, it should fail (snapshot expired)
  2. List files in the table's data directory, old files should be absent
  3. Check the snapshot log for the DELETE operation timestamp and summary

What about data in Dremio Reflections?

When source data is deleted, and Reflections are refreshed, the Reflection data is also updated to exclude the deleted rows. Dremio handles this automatically; no additional steps are required.

Can I delete data from specific partitions only?

Yes, but the DELETE statement should target the specific rows, not partitions. Iceberg's hidden partitioning ensures the DELETE is efficient, only data files in relevant partitions are scanned.

How does GDPR deletion interact with rollback?

After snapshot expiry, rollback to pre-deletion snapshots is impossible, this is by design. GDPR requires irreversible deletion, and snapshot expiry ensures this. Always complete the full three-step pipeline before considering the deletion finalized.

How do I handle GDPR "right to be forgotten" requests at scale?

Use Iceberg's row-level delete operations with merge-on-read mode for efficient deletions across large tables. Create a dedicated deletion pipeline that processes batch deletion requests on a Nessie branch, validates the deletions, then merges to main. After merging, run snapshot expiry and orphan file cleanup to ensure deleted data is physically removed from storage and cannot be recovered through time travel.


Free Resources to Continue Your Iceberg Journey

Iceberg Lakehouse Books from Dremio Authors

Legacy Copy

Time travel is a fantastic feature that’s available when you adopt a data lakehouse table format like Apache Iceberg, but sometimes this useful feature runs contrary to regulatory requirements such as GDPR (General Data Protection Regulation) and CCPA (California Consumer Privacy Act). This is true of other systems that offer time travel as well, such as Snowflake.

This is an issue because regulations often require you to ensure that you no longer have access to protected consumer information. Typically, this can be done by merely deleting the data in question, but with systems that provide time travel, deletions are soft and the data still exists if you use time travel to query data at a previous point in time.

Apache Iceberg table format, however, offers several ways to manage tables so you can ensure compliance with the proper regulations, and guarantee data that needs to be forgotten is forgotten.

Method #1: Hard Deletes with Copy-on-Write

With Copy-on-Write, managing hard deletes is straightforward. It just takes one step:

  1. Expire snapshots older than X days.

Since the data files are rewritten on every update or delete, newer snapshots will refer to files that don’t have the deleted records. So expiring snapshots past your desired length of retention will clean up data files with any deleted records since they are not referenced in your newer snapshots.

-- if your table uses copy-on-write
ALTER TABLE catalog.db.users SET TBLPROPERTIES (
  'write.delete.mode'='copy-on-write',
  'write.update.mode'='copy-on-write',
  'write.merge.mode'='copy-on-write'
)

-- To hard delete files just expire snapshots prior
-- To the deletion of records
CALL catalog.system.expire_snapshots(
  table => 'db.users',
  older_than => TIMESTAMP '2022-06-30 00:00:00.000',
  retain_last => 10
)

Method #2: Hard Deletes with Merge-on-Read

Merge-on-Read is a little more complicated because newer snapshots where records are deleted may still refer to data files with deleted records. These records are reconciled using delete files that list deleted records. Expiring snapshots alone may not delete the desired data files in this situation. So the proper flow to ensure hard deletions would be:

  1. Run compaction to reconcile all delete files into new data files periodically.
  2. Expire snapshots older than X days.

Since all snapshots from before the compaction job are expired, the data files with those deleted records can now be deleted. 

-- if your table uses merge-on-read
ALTER TABLE catalog.db.users SET TBLPROPERTIES (
  'write.delete.mode'='merge-on-read',
  'write.update.mode'='merge-on-read',
  'write.merge.mode'='merge-on-read'
)


-- Run compaction to eliminate delete files
CALL catalog.system.rewrite_data_files(
  table => 'db.users',
  strategy => 'sort',
  sort_order => 'zorder(role, state)'
)

-- To hard delete files expire all prior snapshots
CALL catalog.system.expire_snapshots(
  table => 'db.users',
  older_than => TIMESTAMP '2022-06-30 00:00:00.000',
  retain_last => 1
)

Although, if you run compaction regularly, you should be able to develop a cadence that’ll delete files when needed, while leaving some snapshots for time travel. 

For example, let’s assume weekly compaction jobs with hard deletes require data deleted for 30 days or more. At the end of each month, you can run an expiration of all snapshots before the most recent four compaction jobs, which still leaves you with a month's worth of snapshots for time travel while hard deleting the files that need to be hard deleted.

Method #3: Use Key Rotation

Another option that can complement the previous methods or be used on its own is to control access to the files using key rotation. 

This means you may use different keys to encrypt your files based on where you store them. For example, let’s say you change the encryption key each month. If you want to eliminate access to data before three months old you can drop older keys rendering those older files inaccessible even though they still physically exist.

Implementation differs depending on what vendor handles your file storage or how intricate you want the process to be. You may already be implementing some level of key rotation to comply with PCI DSS regulations.

Other Regulatory Best Practices

Here are some other considerations as you design your compliance plan:

  • Don’t commingle protected data with unprotected data in the same table, so you don’t lose data unnecessarily when deleting a row of protected data (e.g., put email and phone numbers in a different table than data that denotes user application preferences or activity).
  • Delete the data in batches. Instead of deleting as you go, set a flag on records that need to be deleted, then run a batch delete operation on all records that have been flagged. This flag can be tracked as an insert into a separate table that tracks delete requests, which also allows you to track metadata around the deletes like the time of the request, time of actual deletion, and so forth. The benefit is that it clusters all the deletes into fewer snapshots making it easier to predict what files will be cleaned up when expiring snapshots vs the deletes occurring over several snapshots.

Conclusion

Regulatory requirements can make data engineering more complex than it otherwise would be. Fortunately, whether you need hard deletes to occur on a Copy-on-Write or Merge-on-Read table or instead use a key rotation strategy, there are techniques to get all the benefits of time travel while still maintaining compliance.

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.