March 1, 2023

12:20 pm - 12:50 pm PST

Rewriting History: Migrating Petabytes of Data to Apache Iceberg Using Trino

Dataset interoperability between data platform components continues to be a difficult hurdle to overcome. This difficulty often results in siloed data and frustrated users. Although open table formats like Apache Iceberg aim to break down these silos by providing a consistent and scalable table abstraction, migrating your pre-existing data archive to a new format can still be daunting. This talk will outline challenges we faced when rewriting petabytes of Shopify’s data into the Iceberg table format using the Trino engine. A rapidly evolving landscape, I will highlight recent contributions to Trino’s Iceberg integration that made our work possible while also illustrating how we designed our system to scale. Topics will include: what to consider when designing your migration strategy, how we optimized Trino’s write performance, and how to recover from corrupt table states. Finally, we will compare the query performance of old and migrated datasets using Shopify’s datasets as benchmarks.

Session Id: BO405

Topics Covered

Open Source

