Apache Iceberg 101 – Your Guide to Learning Apache Iceberg Concepts and Practices
Apache Iceberg is an open-source data lakehouse table format that has been taking the big data analytics world by storm.
In this article, you’ll find a 101 video course along with an aggregation of all the resources you’ll need to get up to speed on Apache Iceberg in concept and practice.
The Apache Iceberg 101 Course
Below we have a series of videos to educate you about Apache Iceberg and how to use Iceberg tables to enhance your data experience. After the course you’ll find an index of resources from around the web to continue expanding your Iceberg knowledge.
- Introduction to the course
- The Problem and the Solution (Iceberg’s Origin Story)
- Iceberg and the Data Lakehouse
- Overview of Apache Iceberg’s Architecture
- Iceberg Transactions Step by Step
- Iceberg Catalogs
- Copy-on-write and Merge-on-read
- Table Tuning with Table Properties
- Migrating to Iceberg
- Time-Travel
- Maintaining Iceberg Tables
- Hard-Deletions and GDPR
Directory of Additional Iceberg Resources
After watching the series of videos above you should have a pretty good understanding of Apache Iceberg and the concepts around it.
Below is a list of additional resources to continue learning more about Apache Iceberg, including hands-on exercises and articles from companies detailing their usage of Apache Iceberg and more.
Apache Iceberg Core Concepts
Below are several resources for understanding what Apache Iceberg is and how it fundamentally works at a high-level conceptual level.
- [Blog] Apache Iceberg: An Architectural Look Under the Covers
- [Webinar] Apache Iceberg: An Architectural Look Under the Covers
- [Blog] Life of a Write Query
- [Blog] Life of a Read Query
- [Blog] How to Migrate a Hive Table to an Iceberg Table
Apache Iceberg Features
Below are resources to learn more about the many features of Apache Iceberg.
- [Blog] Fewer Accidental Full Table Scans Brought to You by Apache Iceberg’s Hidden Partitioning
- [Blog] Partition Evolution
- [Blog] Table Evolution in Apache Iceberg
- [Blog] Apache Iceberg Top 5 Features
- [Blog] Puffins and Icebergs: Additional Stats for Apache Iceberg Tables
- [Docs] Table Evolution
- [Docs] Fast Scan Planning
- [Docs] Reliability/Correctness
- [Blog] Time Travel with Dremio and Apache Iceberg
Hands-on Apache Iceberg Exercises
The resources below guide you through guided exercises and tutorials to try Apache Iceberg in action with different tools.
- [Blog] Hands-On Introduction to Apache Iceberg - Data Lakehouse Engineering
- [Blog] Introduction to Apache Iceberg with Spark
- [Blog] Configuring Spark for Apache Iceberg
- [Blog] Managing Data as Code with Dremio Arctic – Easily ensure data quality in your data lakehouse
- [Blog] Multi-Table Transactions on the Lakehouse – Enabled by Dremio Arctic
- [Blog] A Notebook for getting started with Project Nessie, Apache Iceberg, and Apache Spark
- [Blog] Managing Data as Code with Dremio Arctic: Support Machine Learning Experimentation in Your Data Lakehouse
- [Blog] A Hands-On Look at the Structure of an Apache Iceberg Table
- [Blog] Streaming Data into Apache Iceberg Tables Using AWS Kinesis and AWS Glue
- [Blog] Hands-on exercise migrating Hive tables to Apache Iceberg
- [Blog] Docker, Spark, and Iceberg: The Fastest Way to Try Iceberg!
- [Blog] Using Spark in EMR with Apache Iceberg
- [Blog] Deep Dive into Apache Iceberg via Apache Zeppelin
- [Blog] Real-time ingestion to Iceberg with Kafka Connect — Apache Iceberg Sink
- [Blog] Getting Started with Apache Iceberg Using AWS Glue and Dremio
- [Blog] Getting Started with Apache Iceberg in Databricks
- [Video] Real-time ingested historical feature store with Iceberg, Feast and Yummy
- [Markdown] How to setup a docker container Spark/Notebook environment for Iceberg Practice
- [Markdown] PySpark Configurations for each Iceberg Catalog
- [Video] Setting up a Spark/Notebook Environment for Iceberg Practice
Iceberg Video Demos
Videos showing hands use of Apache Iceberg Tables
- [Video] How to quickly get started with Apache Iceberg tables in Dremio Cloud
- [Video] Dremio Cloud and Apache Iceberg - Iceberg Catalogs
- [Video] Accessing a Dremio Arctic Catalog from Spark
- [Video] Dremio & Apache Iceberg - DML on Iceberg tables with Dremio (Upsert Example)
- [Video] Dremio & Apache Iceberg - Converting JSON & CSV files to Iceberg Tables
- [Video] Dremio & Apache Iceberg - How to Query Iceberg Metadata from Dremio
Comparison of Apache Iceberg to Other Table Formats
With the resources below you can read on how Apache Iceberg compares to other table formats.
- [Blog] Comparison of Data Lake Table Formats (Iceberg, Hudi and Delta Lake)
- [Blog] Table Format Governance and Community Contributions: Apache Iceberg, Apache Hudi, and Delta Lake
- [Blog] Table Format Partitioning Comparison: Apache Iceberg, Apache Hudi, and Delta Lake
- Meetup: Comparison of Data Lakehouse Table Formats
- [Blog] Open Source Data Lake Table Formats: Evaluating Current Interest and Rate of Adoption
Companies Sharing Their Production Apache Iceberg Usage
Below are articles from companies that have documented their deployment of Apache Iceberg into production. You can read about their experiences and lessons learned.
- [Blog] Iceberg at Adobe
- [Blog] Migrating to Apache Iceberg at Adobe Experience Platform
- [Podcast] Shopify and Ingesting Data into Iceberg Tables
- [Video] Spark and Iceberg at Apple's Scale - Leveraging differential files for efficient upserts and deletes
- [Video] Adopting Apache Iceberg on LINE Data Platform - 2021 English version
Optimizing and Maintaining Apache Iceberg Tables
Once you have Apache Iceberg tables in place you’ll want to optimize and maintain them, below are articles that walk through different features for engineering tables for best performance.
- [Blog] Compaction in Apache Iceberg: Fine-Tuning Your Iceberg Table’s Data Files
- [Blog] Row-Level Changes on the Lakehouse: Copy-On-Write vs. Merge-On-Read in Apache Iceberg
- [Blog] Maintaining Iceberg Tables – Compaction, Expiring Snapshots, and More
- [Blog] How Z-Ordering in Apache Iceberg Helps Improve Performance
- [Blog] Apache Iceberg and the Right to Be Forgotten
- [Docs] Table Maintenance
Ingesting Data into Apache Iceberg Tables
How do we get data into our Iceberg tables, the following are articles on the ingestion of data into Iceberg tables from different sources.
- [Docs] Spark Structured Streaming
- [Docs] Writing to Iceberg from Spark
- [Docs] Flink Streaming
- [Blog] Real-time ingestion to Iceberg with Kafka Connect - Apache Iceberg Sink
- [Blog] Flink + Iceberg: How to Construct a Whole-scenario Real-time Data Warehouse
- [Blog] How to Migrate a Hive Table to an Iceberg Table
- [Blog] Hands-on exercise migrating Hive tables to Apache Iceberg
- [Docs] Using Iceberg with Google Dataproc
- [Blog] How to Analyze CDC Data in Iceberg Data Lake Using Flink
Working with Cloud Object Storage
Object storage has become the standard for storing data in a data lakehouse and the resources below highlight Apache Iceberg in the context of cloud object storage.
- [Blog] Apache Iceberg and Cloud Object Storage
- [Blog] Using Iceberg’s S3FileIO Implementation To Store Your Data In MinIO
- [Podcast] Apache Iceberg and Object Storage
The Java and Python API
Below are articles on Apache Iceberg’s Java and Python API.
- [Blog] An Introduction To The Iceberg Java API - Part 1
- [Blog] An Introduction to the Iceberg Java API Part 2 - Table Scans
- [Docs] Java API
- [Docs] Python API
Streaming with Apache Iceberg
Streaming data can require lots of considerations that don’t exist in batch processing. Below are resources that deal with using Apache Iceberg in streaming data.
- [Video] Streaming Event-Time Partitioning With Apache Flink and Apache Iceberg - Julia Bennett
- [Video] MEETUP: Apple employees discuss; Streaming from Iceberg Data Lake & Multi Cluster Kafka Source
- [Video] Backfill Flink Data Pipelines with Iceberg Connector
- [Docs] Spark Structured Streaming
- [Docs] Flink Streaming
Miscellaneous Blog Articles
Here is a list of other great Apache Iceberg articles you can learn from.
- [Blog] Integrated Audits: Streamlined Data Observability With Apache Iceberg
- [Blog] Iceberg FileIO: Cloud Native Tables
- [Blog] Using Flink CDC to synchronize data from MySQL sharding tables and build real-time data lake
- [Blog] Metadata Indexing in Iceberg
- [Blog] Using Debezium to Create a Data Lake with Apache Iceberg
- [Blog] High Throughput Ingestion with Iceberg
- [Blog] FastIngest: Low-latency Gobblin with Apache Iceberg and ORC format
- [Blog] Taking Query Optimizations to the Next Level with Iceberg
- [Video] Open Source and the Data Lakehouse - presented at Tampa Bay Data Engineering Group
Iceberg Subsurface Conference Talks
Here is a list of Subsurface conference talks on Apache Iceberg.
- What can Iceberg do for you?
- The Write-Audit-Publish Pattern via Apache Iceberg
- Streaming from an Apache Iceberg Data Lake
- Tuning Row-Level operations in Apache Iceberg
- An Open Data Architecture in Action with Apache Iceberg
- Lessons Learned from running Apache Iceberg at Petabyte Scale
- Why and How Netflix created and migrated to a new table format
- Iceberg Case Studies
- Enabling Analysts to build a lakehouse with SparkSQL and Iceberg
- Iceberg at Adobe: Challenges, Lessons and Achievements
- Hiveberg: Integrating Apache Iceberg with the Hive Metastore
- Deep Dive into Apache Iceberg SQL Extensions
- Lessons Learned making open table formats Enterprise Ready
- Building a Historical Financial Data Lake at Bloomberg
- Unsolved Challenges in Data Infrastructure