39 minute read · September 12, 2022
Apache Iceberg 101 – Your Guide to Learning Apache Iceberg Concepts and Practices
· Senior Tech Evangelist, Dremio
Apache Iceberg is an open-source data lakehouse table format that has taken the big data analytics world by storm.
In this article, you’ll find a 101 video course along with an aggregation of all the resources you’ll need to get up to speed on Apache Iceberg in concept and practice.
What's a Data Lakehouse?
The Apache Iceberg 101 Course
Below are videos to educate you about Apache Iceberg and how to use Iceberg tables to enhance your data experience. After the course, you’ll find an index of resources from around the web to continue expanding your Iceberg knowledge.
- Introduction to the course
- The Problem and the Solution (Iceberg’s Origin Story)
- Iceberg and the Data Lakehouse
- Overview of Apache Iceberg’s Architecture
- Iceberg Transactions Step by Step
- Iceberg Catalogs
- Copy-on-write and Merge-on-read
- Table Tuning with Table Properties
- Migrating to Iceberg
- Time-Travel
- Maintaining Iceberg Tables
- Hard-Deletions and GDPR
Tutorial: GET HANDS ON WITH ICEBERG ON YOUR LAPTOP
Additional Hands-on tutorials that can be done from your Laptop:
- Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop
- From SQLServer -> Apache Iceberg -> BI Dashboard
- From MongoDB -> Apache Iceberg -> BI Dashboard
- From Postgres -> Apache Iceberg -> BI Dashboard
- From MySQL -> Apache Iceberg -> BI Dashboard
- From Elasticsearch -> Apache Iceberg -> BI Dashboard
- From Kafka -> Apache Iceberg -> Dremio
Tutorial Videos: Apache Iceberg Lakehouse Engineering
Next Step Guides
- What is a Data Lakehouse Table Format?
- Comparison of Lakehouse Solutions (Iceberg, Hudi, Paimon, Delta Lake)
- Guide to Migration to an Apache Iceberg Lakehouse
- Guide to Maintaining an Apache Iceberg Lakehouse
Directory of Additional Iceberg Resources
After watching the series of videos above, you should have a pretty good understanding of Apache Iceberg and its concepts.
Below is a list of additional resources to continue learning more about Apache Iceberg, including hands-on exercises, articles from companies detailing their usage of Apache Iceberg and more.
Apache Iceberg Core Concepts
Below are several resources for understanding what Apache Iceberg is and how it fundamentally works at a high-level conceptual level.
- [Blog] Apache Iceberg: An Architectural Look Under the Covers
- [Webinar] Apache Iceberg: An Architectural Look Under the Covers
- [Blog] Life of a Write Query
- [Blog] Life of a Read Query
- [Blog] How to Migrate a Hive Table to an Iceberg Table
- [Blog] How Iceberg is Designed for Optimized Performance
- [Blog] The Evolution of Apache Iceberg Catalogs
- [Video] Apache Iceberg Migration Whiteboard Style Overview
- [Video] Apache Iceberg Overview Whiteboard Style
Apache Iceberg Features
Below are resources to learn more about the many features of Apache Iceberg.
- [Blog] Fewer Accidental Full Table Scans Brought to You by Apache Iceberg’s Hidden Partitioning
- [Blog] Partition Evolution
- [Blog] Table Evolution in Apache Iceberg
- [Blog] Apache Iceberg Top 5 Features
- [Blog] Puffins and Icebergs: Additional Stats for Apache Iceberg Tables
- [Docs] Table Evolution
- [Docs] Fast Scan Planning
- [Docs] Reliability/Correctness
- [Blog] Time Travel with Dremio and Apache Iceberg
- [Video] COPY INTO and ROLLBACK for Iceberg tables on Dremio
- [Blog] Dealing with Data Incidents Using the Rollback Feature in Apache Iceberg
- [Blog] Partition and File Pruning for Dremio’s Apache Iceberg-backed Reflections
- [Blog] Exploring Branch & Tags in Apache Iceberg using Spark
- [Blog] Streamlining Data Quality in Apache Iceberg with write-audit-publish & branching
Hands-on Apache Iceberg Exercises
The resources below guide you through guided exercises and tutorials to try Apache Iceberg in action with different tools.
- [Blog] Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop
- [Blog] BI Dashboards with Apache Iceberg Using AWS Glue and Apache Superset
- [Blog] Run Graph Queries on Apache Iceberg Tables with Dremio & Puppygraph
- [Blog] Streaming and Batch Data Lakehouses with Apache Iceberg, Dremio and Upsolver
- [Blog] End-to-End Apache Iceberg Data Lakehouse Engineering with Spark, Nessie and Dremio
- [Blog] Hands-On Introduction to Apache Iceberg - Data Lakehouse Engineering
- [Blog] Introduction to Apache Iceberg with Spark
- [Blog] Configuring Spark for Apache Iceberg
- [Blog] Managing Data as Code with Dremio Arctic – Easily ensure data quality in your data lakehouse
- [Blog] Multi-Table Transactions on the Lakehouse – Enabled by Dremio Arctic
- [Blog] A Notebook for getting started with Project Nessie, Apache Iceberg, and Apache Spark
- [Blog] Managing Data as Code with Dremio Arctic: Support Machine Learning Experimentation in Your Data Lakehouse
- [Blog] A Hands-On Look at the Structure of an Apache Iceberg Table
- [Blog] Streaming Data into Apache Iceberg Tables Using AWS Kinesis and AWS Glue
- [Blog] Hands-on exercise migrating Hive tables to Apache Iceberg
- [Blog] Docker, Spark, and Iceberg: The Fastest Way to Try Iceberg!
- [Blog] Using Spark in EMR with Apache Iceberg
- [Blog] Deep Dive into Apache Iceberg via Apache Zeppelin
- [Blog] Real-time ingestion to Iceberg with Kafka Connect — Apache Iceberg Sink
- [Blog] Getting Started with Apache Iceberg Using AWS Glue and Dremio
- [Blog] Getting Started with Apache Iceberg in Databricks
- [Video] Real-time ingested historical feature store with Iceberg, Feast and Yummy
- [Markdown] How to setup a docker container Spark/Notebook environment for Iceberg Practice
- [Markdown] PySpark Configurations for each Iceberg Catalog
- [Video] Setting up a Spark/Notebook Environment for Iceberg Practice
- [Blog] Deep Dive Into Configuring Your Apache Iceberg Catalog with Apache Spark
A Unified Apache Iceberg Lakehouse
- Value Prop I: Unified Analytics
- Value Prop II: Performance
- Value Prop III: Self-Service
- 3 Reasons to have a Hybrid Apache Iceberg Lakehouse
Apache Iceberg and BI Dashboards
- [Blog] Connecting Tableau to Apache Iceberg Tables with Dremio
- [Blog] 5 Easy Steps to Migrate an Apache Superset Dashboard to Your Lakehouse
Iceberg Video Demos
Videos showing hands use of Apache Iceberg Tables
- [Video] How to quickly get started with Apache Iceberg tables in Dremio Cloud
- [Video] Dremio Cloud and Apache Iceberg - Iceberg Catalogs
- [Video] Accessing a Dremio Arctic Catalog from Spark
- [Video] Dremio & Apache Iceberg - DML on Iceberg tables with Dremio (Upsert Example)
- [Video] Dremio & Apache Iceberg - Converting JSON & CSV files to Iceberg Tables
- [Video] Dremio & Apache Iceberg - How to Query Iceberg Metadata from Dremio
- [Video] Iceberg Automated Optimization
- [Video] pyIceberg + AWS Glue + Apache Iceberg
- [Video] Ingesting Data into Iceberg/Nessie with Flink
Comparison of Apache Iceberg to Other Table Formats
With the resources below you can read on how Apache Iceberg compares to other table formats.
- [Blog] Exploring the Architecture of Apache Iceberg, Delta Lake and Apache Hudi
- [Blog] Comparison of Data Lake Table Formats (Iceberg, Hudi and Delta Lake)
- [Blog] Table Format Governance and Community Contributions: Apache Iceberg, Apache Hudi, and Delta Lake
- [Blog] Table Format Partitioning Comparison: Apache Iceberg, Apache Hudi, and Delta Lake
- Meetup: Comparison of Data Lakehouse Table Formats
- [Blog] Open Source Data Lake Table Formats: Evaluating Current Interest and Rate of Adoption
- [Blog] Iceberg and Hudi ACID Guarantees
Companies Sharing Their Production Apache Iceberg Usage
Below are articles from companies that have documented their deployment of Apache Iceberg into production. You can read about their experiences and lessons learned.
- [Blog] Iceberg at Adobe
- [Blog] Migrating to Apache Iceberg at Adobe Experience Platform
- [Podcast] Shopify and Ingesting Data into Iceberg Tables
- [Video] Spark and Iceberg at Apple's Scale - Leveraging differential files for efficient upserts and deletes
- [Video] Adopting Apache Iceberg on LINE Data Platform - 2021 English version
Optimizing and Maintaining Apache Iceberg Tables
Once you have Apache Iceberg tables in place you’ll want to optimize and maintain them, below are articles that walk through different features for engineering tables for best performance.
- [Blog] Compaction in Apache Iceberg: Fine-Tuning Your Iceberg Table’s Data Files
- [Blog] Row-Level Changes on the Lakehouse: Copy-On-Write vs. Merge-On-Read in Apache Iceberg
- [Blog] Maintaining Iceberg Tables – Compaction, Expiring Snapshots, and More
- [Blog] How Z-Ordering in Apache Iceberg Helps Improve Performance
- [Blog] Apache Iceberg and the Right to Be Forgotten
- [Blog] How not to use Apache Iceberg !
- [Docs] Table Maintenance
Ingesting Data into Apache Iceberg Tables
How do we get data into our Iceberg tables, the following are articles on the ingestion of data into Iceberg tables from different sources.
- [Docs] Spark Structured Streaming
- [Docs] Writing to Iceberg from Spark
- [Docs] Flink Streaming
- [Blog] Real-time ingestion to Iceberg with Kafka Connect - Apache Iceberg Sink
- [Blog] Flink + Iceberg: How to Construct a Whole-scenario Real-time Data Warehouse
- [Blog] How to Migrate a Hive Table to an Iceberg Table
- [Blog] Hands-on exercise migrating Hive tables to Apache Iceberg
- [Docs] Using Iceberg with Google Dataproc
- [Blog] How to Analyze CDC Data in Iceberg Data Lake Using Flink
- [Video] Ingesting Data into Iceberg with Fivetran and Querying it With Dremio
- [Blog] Building Your Data Lakehouse Just Got a Whole Lot Easier with Dremio & Fivetran
- [Blog] How to Convert CSV Files into an Apache Iceberg table with Dremio
- [Blog] 3 Ways to Convert a Delta Lake Table Into an Apache Iceberg Table
- [Blog] How to Convert JSON Files Into an Apache Iceberg Table with Dremio
- [Blog] Build your open data lakehouse on Apache Iceberg tables with Dremio and Fivetran
- [Blog] Building a Data Lakehouse with Dremio, Airbyte, S3 and Apache Iceberg
- [Video] Ingesting Data in Apache Iceberg with Airbyte OSS and querying with Dremio
Working with Cloud Object Storage
Object storage has become the standard for storing data in a data lakehouse and the resources below highlight Apache Iceberg in the context of cloud object storage.
- [Blog] Apache Iceberg and Cloud Object Storage
- [Blog] Using Iceberg’s S3FileIO Implementation To Store Your Data In MinIO
- [Podcast] Apache Iceberg and Object Storage
The Java and Python API
Below are articles on Apache Iceberg’s Java and Python API.
- [Blog] An Introduction To The Iceberg Java API - Part 1
- [Blog] An Introduction to the Iceberg Java API Part 2 - Table Scans
- [Docs] Java API
- [Docs] Python API
- [Blog] 3 Ways to Use Python with Apache Iceberg
Streaming with Apache Iceberg
Streaming data can require lots of considerations that don’t exist in batch processing. Below are resources that deal with using Apache Iceberg in streaming data.
- [Blog] Getting Started with Apache Flink, Apache Iceberg and Nessie Tutorial
- [Blog] Getting Started with Flink SQL and Apache Iceberg
- [Video] Streaming Event-Time Partitioning With Apache Flink and Apache Iceberg - Julia Bennett
- [Video] MEETUP: Apple employees discuss; Streaming from Iceberg Data Lake & Multi Cluster Kafka Source
- [Video] Backfill Flink Data Pipelines with Iceberg Connector
- [Docs] Spark Structured Streaming
- [Docs] Flink Streaming
- [Blog] Apache Iceberg Sync for Apache Kafka
- [Blog] Streaming Event Data to Iceberg with Kafka Connect
Data as Code
Take your Apache Iceberg tables to the next level with Project Nessie/Dremio Arctic catalog, which allows you to create catalog-level branches for isolating ETL, catalog rollback, multi-table transactions, and more. Here are some talks and blogs on the subject.
- [Video] Managing your data with data as code (Presentation and Demo)
- [Video] What is project Nessie
- [Video] Demonstrating Data as Code
- [Blog] What is Nessie and Why as a Data Engineer or Architect you should care?
- [Blog] Resources for Learning more about Catalog level versioning with Project Nessie & Dremio Arctic (Rollbacks, Branching, Tagging and Multi-Table Txns)
- [Video] Where DataOps and Data Lakehouses Converge
Apache Iceberg Office Hours
Recordings of Apache Iceberg Office Hours, held as part of the Gnarly Data Waves podcast.
- Office Hours #1 (December 7th, 2022)
- Office Hours #2 (February 7th, 2023)
- Office Hours #3 (August 8th, 2023)
Miscellaneous Blog Articles
Here is a list of other great Apache Iceberg articles you can learn from.
- [Blog] Integrated Audits: Streamlined Data Observability With Apache Iceberg
- [Blog] Iceberg FileIO: Cloud Native Tables
- [Blog] Metadata Indexing in Iceberg
- [Blog] Using Debezium to Create a Data Lake with Apache Iceberg
- [Blog] High Throughput Ingestion with Iceberg
- [Blog] FastIngest: Low-latency Gobblin with Apache Iceberg and ORC format
- [Blog] Taking Query Optimizations to the Next Level with Iceberg
- [Blog] Lakehouse Migration with Apache XTable
- [Video] Open Source and the Data Lakehouse - presented at Tampa Bay Data Engineering Group
Gnarly Data Waves
Episodes of the Gnarly Data Waves Podcast dedicated to Apache Iceberg, subscribe on Youtube or Spotify.
- Migrating from Delta Lake to Apache Iceberg
- Managing Your Data-as-Code
- Building your Apache Iceberg Data Lakehouse with Fivetran and Iceberg
- Optimizing your data files in Apache Iceberg
- How to Modernize your Hive Data Lakehouse with Apache Iceberg and Dremio
- Automatic Apache Iceberg Table Optimization with Dremio Arctic
- What’s New in the Apache Iceberg Project: Version 1.2.0 Updates, PyIceberg, Compute Engines
- Versioning and the Data Lakehouse
Iceberg Subsurface Conference Talks
Here is a list of Subsurface conference talks on Apache Iceberg.
- What can Iceberg do for you?
- The Write-Audit-Publish Pattern via Apache Iceberg
- Streaming from an Apache Iceberg Data Lake
- Tuning Row-Level operations in Apache Iceberg
- An Open Data Architecture in Action with Apache Iceberg
- Lessons Learned from running Apache Iceberg at Petabyte Scale
- Why and How Netflix created and migrated to a new table format
- Iceberg Case Studies
- Enabling Analysts to build a lakehouse with SparkSQL and Iceberg
- Iceberg at Adobe: Challenges, Lessons and Achievements
- Hiveberg: Integrating Apache Iceberg with the Hive Metastore
- Deep Dive into Apache Iceberg SQL Extensions
- Lessons Learned making open table formats Enterprise Ready
- Building a Historical Financial Data Lake at Bloomberg
- Unsolved Challenges in Data Infrastructure
Even more talks from Subsurface Live 2023!
- The State of Apache Iceberg
- DataOps in Action with Nessie, Iceberg and Great Expectations
- Apache Iceberg's Best Kept Secret, Metadata Tables
- How Insiders Iceberg Migration Saved them 90% on S3 Costs
- How to speed up Object Storage with Dremio and Apache Iceberg
- Managing Data Files in Apache Iceberg
- What's New In Apache Iceberg
- CI/CD on the Data Lakehouse with Apache Iceberg and Dremio Arctic
- Dive Deep on Apache Iceberg in AWS
- How to Migrate to Apache Iceberg
- Partition and File Pruning for Dremio's Apache Iceberg based Data Reflections
- The Technical Evolution of Apache Iceberg
- Migrating Petabytes of Data to Apache Iceberg
- Taming the Small Files Problem for Steaming Ingestion into Apache Iceberg
- Scaling Row Level Deletions at Pinterest
- Smart Iceberg Table Optimizer