Get Started Free
No time limit - totally free - just the way you like it.Sign Up Now
Time travel is a fantastic feature that’s available when you adopt a data lakehouse table format like Apache Iceberg, but sometimes this useful feature runs contrary to regulatory requirements such as GDPR (General Data Protection Regulation) and CCPA (California Consumer Privacy Act). This is true of other systems that offer time travel as well, such as Snowflake.
This is an issue because regulations often require you to ensure that you no longer have access to protected consumer information. Typically, this can be done by merely deleting the data in question, but with systems that provide time travel, deletions are soft and the data still exists if you use time travel to query data at a previous point in time.
Apache Iceberg table format, however, offers several ways to manage tables so you can ensure compliance with the proper regulations, and guarantee data that needs to be forgotten is forgotten.
With Copy-on-Write, managing hard deletes is straightforward. It just takes one step:
Since the data files are rewritten on every update or delete, newer snapshots will refer to files that don’t have the deleted records. So expiring snapshots past your desired length of retention will clean up data files with any deleted records since they are not referenced in your newer snapshots.
-- if your table uses copy-on-write ALTER TABLE catalog.db.users SET TBLPROPERTIES ( 'write.delete.mode'='copy-on-write', 'write.update.mode'='copy-on-write', 'write.merge.mode'='copy-on-write' ) -- To hard delete files just expire snapshots prior -- To the deletion of records CALL catalog.system.expire_snapshots( table => 'db.users', older_than => TIMESTAMP '2022-06-30 00:00:00.000', retain_last => 10 )
Merge-on-Read is a little more complicated because newer snapshots where records are deleted may still refer to data files with deleted records. These records are reconciled using delete files that list deleted records. Expiring snapshots alone may not delete the desired data files in this situation. So the proper flow to ensure hard deletions would be:
Since all snapshots from before the compaction job are expired, the data files with those deleted records can now be deleted.
-- if your table uses merge-on-read ALTER TABLE catalog.db.users SET TBLPROPERTIES ( 'write.delete.mode'='merge-on-read', 'write.update.mode'='merge-on-read', 'write.merge.mode'='merge-on-read' ) -- Run compaction to eliminate delete files CALL catalog.system.rewrite_data_files( table => 'db.users', strategy => 'sort', sort_order => 'zorder(role, state)' ) -- To hard delete files expire all prior snapshots CALL catalog.system.expire_snapshots( table => 'db.users', older_than => TIMESTAMP '2022-06-30 00:00:00.000', retain_last => 1 )
Although, if you run compaction regularly, you should be able to develop a cadence that’ll delete files when needed, while leaving some snapshots for time travel.
For example, let’s assume weekly compaction jobs with hard deletes require data deleted for 30 days or more. At the end of each month, you can run an expiration of all snapshots before the most recent four compaction jobs, which still leaves you with a month's worth of snapshots for time travel while hard deleting the files that need to be hard deleted.
Another option that can complement the previous methods or be used on its own is to control access to the files using key rotation.
This means you may use different keys to encrypt your files based on where you store them. For example, let’s say you change the encryption key each month. If you want to eliminate access to data before three months old you can drop older keys rendering those older files inaccessible even though they still physically exist.
Implementation differs depending on what vendor handles your file storage or how intricate you want the process to be. You may already be implementing some level of key rotation to comply with PCI DSS regulations.
Here are some other considerations as you design your compliance plan:
Regulatory requirements can make data engineering more complex than it otherwise would be. Fortunately, whether you need hard deletes to occur on a Copy-on-Write or Merge-on-Read table or instead use a key rotation strategy, there are techniques to get all the benefits of time travel while still maintaining compliance.