Dremio Blog

31 minute read · April 15, 2022

How to Migrate a Hive Table to an Iceberg Table

Alex Merced Head of DevRel, Dremio

Start For Free

Copied to clipboard

How to Migrate a Hive Table to an Iceberg Table

Why Migrate from Hive to Iceberg?

Migration Approach 1: In-Place Migration (Recommended)

Migration Approach 2: Snapshot Migration (Safest)

Migration Approach 3: CTAS Migration (Full Copy)

Post-Migration Steps

Migration with Dremio

Real-World Migration: E-Commerce Platform

Common Migration Pitfalls

Dremio's Role in Hive-to-Iceberg Migration

Nessie Catalog: Git-Like Version Control for Migrated Tables

Large-Scale Migration Patterns

Migration Planning Checklist

Frequently Asked Questions

Free Resources to Continue Your Iceberg Journey

Legacy Content

Option 1: In-Place Data Migration of Hive Tables

Option 2: Shadow Migration of Hive Tables

This article has been revised and updated from its original version published in 2022 to reflect the latest Apache Iceberg migration techniques and tools.

Migrating from Hive tables to Apache Iceberg unlocks time travel, hidden partitioning, schema evolution, and dramatically faster queries through file-level pruning. The good news: you don't have to rewrite your data to migrate. Iceberg provides zero-copy migration paths that add Iceberg metadata on top of existing Parquet files.

Organizations that have migrated from Hive to Iceberg consistently report massive improvements: query planning times drop from minutes to seconds (because Iceberg eliminates slow directory listing operations), partition management becomes instant instead of requiring multi-hour rewrites, and data teams gain access to modern features like time travel and rollback that simply don't exist in Hive. For organizations already using Dremio, the migration unlocks even greater value through Dremio's Reflections, which are stored as Iceberg tables and can deliver sub-second query performance on top of the migrated data.

This guide covers three migration approaches (in-place migration, snapshot migration, and CTAS migration) with detailed trade-offs and production considerations for each.

Why Migrate from Hive to Iceberg?

For official documentation, refer to the Iceberg migration procedures.

Capability	Hive Tables	Iceberg Tables
Time travel	Not supported	Full support
Rollback	Not supported	Instant rollback
Partition evolution	Requires full rewrite	Metadata-only change
Schema evolution	Limited (add columns only)	Full (add, rename, reorder, widen)
Row-level updates	Not supported	Full COW/MOR support
Query planning speed	Slow (file listing)	Fast (manifest-based)
ACID guarantees	Limited	Full serializable isolation
Cloud storage optimization	Poor	Excellent

Migration Approach 1: In-Place Migration (Recommended)

In-place migration converts the Hive table to Iceberg without copying data. Iceberg creates metadata files that point to the existing Parquet data files:

CALL catalog.system.migrate('hive_db.orders');

What Happens

Iceberg reads the Hive Metastore to discover all partitions and files
Creates manifest files listing every existing data file with column statistics
Creates a metadata file with the table schema, partition spec, and initial snapshot
Updates the catalog entry to point to the new metadata file

Advantages

Zero data copy: No data files are written
Fast: Only metadata operations, typically completes in seconds to minutes
Preserves data layout: Existing partitioning and file organization remain

Considerations

Irreversible: The original Hive metadata is replaced
Requires compatible file formats: Data files must be Parquet or ORC
Hive partition directories: Existing directory-based partition layout is preserved

Migration Approach 2: Snapshot Migration (Safest)

Snapshot migration creates an Iceberg table that points to the same data files as the Hive table, without modifying the original Hive table:

CALL catalog.system.snapshot('hive_db.orders', 'iceberg_db.orders');

What Happens

Iceberg reads the Hive table's metadata to discover all files
Creates a new Iceberg table (iceberg_db.orders) with manifests pointing to the same data files
The original Hive table (hive_db.orders) remains unchanged

Advantages

Non-destructive: The original Hive table is untouched
Parallel operation: Both Hive and Iceberg can read the same data
Easy rollback: If migration fails, drop the Iceberg table; Hive table is unaffected

Considerations

Writes go to Iceberg only: New writes to the Iceberg table create new data files not visible to Hive
Eventual cut-over: You must eventually decommission the Hive table to avoid confusion

Migration Approach 3: CTAS Migration (Full Copy)

Create Table As Select creates a completely new Iceberg table with a full data copy:

CREATE TABLE iceberg_db.orders 
USING iceberg
PARTITIONED BY (month(order_date))
AS SELECT * FROM hive_db.orders;

When to Use CTAS

When you want to change the partition scheme during migration
When data files need to be reorganized or optimally sized
When migrating from non-Parquet formats (CSV, JSON, Avro)
When you want to apply Z-ordering during migration

Post-Migration Steps

1. Verify Data Integrity

-- Compare row counts
SELECT COUNT(*) FROM hive_db.orders;
SELECT COUNT(*) FROM iceberg_db.orders;

-- Compare a sample of rows
SELECT SUM(amount), COUNT(DISTINCT customer_id) FROM hive_db.orders;
SELECT SUM(amount), COUNT(DISTINCT customer_id) FROM iceberg_db.orders;

2. Evolve Partitions

Now that you're on Iceberg, you can evolve partitions without rewriting data:

ALTER TABLE iceberg_db.orders SET PARTITION SPEC (day(order_date), bucket(16, customer_id));

3. Run Compaction

Hive tables often have suboptimal file sizes. Run compaction to right-size files:

CALL catalog.system.rewrite_data_files(table => 'iceberg_db.orders');

4. Compute Statistics

Generate Puffin statistics for optimal query planning:

ANALYZE TABLE iceberg_db.orders COMPUTE STATISTICS FOR ALL COLUMNS;

5. Update Downstream Consumers

Update all pipelines, dashboards, and applications to point to the new Iceberg table. With Dremio, you can use virtual datasets (views) as an abstraction layer, retarget the view from the Hive source to the Iceberg source without changing downstream consumers.

Migration with Dremio

Dremio simplifies Hive-to-Iceberg migration:

Connect both sources: Register your Hive Metastore and Iceberg catalog as Dremio sources
Create Iceberg tables: Use Dremio's CTAS to create optimized Iceberg tables
Use Reflections: Create Reflections on Iceberg tables for accelerated queries
Cut over: Update views to point to Iceberg sources

Real-World Migration: E-Commerce Platform

A SaaS company migrated 200 Hive tables (50TB total) to Iceberg over three weeks:

Week 1: Snapshot migration of read-heavy dimension tables
Week 2: In-place migration of fact tables
Week 3: CTAS migration of tables needing partition changes

Results after migration:

Query planning time: 45 seconds → 2 seconds (Iceberg eliminates file listing)
Storage costs: 20% reduction after compaction right-sized files
Operational overhead: Zero table maintenance windows (partition evolution replaces full rewrites)

Common Migration Pitfalls

Pitfall 1: Not Validating Data After Migration

Always compare row counts, checksums, and sample queries between old and new tables.

Pitfall 2: Forgetting to Update ACLs

Migrating metadata doesn't migrate access controls. Ensure the new Iceberg catalog has appropriate permissions.

Pitfall 3: Leaving Hive Tables Active

After cut-over, decommission Hive tables to prevent confusion. Mark them as deprecated first, then drop after a validation period.

Dremio's Role in Hive-to-Iceberg Migration

Dremio simplifies every phase of Hive-to-Iceberg migration beyond what Spark alone can provide:

Phase 1: Pre-Migration Assessment with Dremio

Connect your existing Hive Metastore as a Dremio source. Dremio can query Hive tables natively, allowing you to profile your data before migration:

-- Profile table sizes and partition counts through Dremio
SELECT 
  COUNT(*) as total_rows,
  COUNT(DISTINCT order_date) as unique_dates,
  MIN(order_date) as earliest,
  MAX(order_date) as latest
FROM hive_source.db.orders;

This assessment helps you plan the optimal Iceberg partition strategy before migration begins.

Phase 2: Migration with Dremio CTAS

Dremio supports creating Iceberg tables directly using CTAS from any connected source, including Hive tables:

-- Create an Iceberg table from a Hive source through Dremio
CREATE TABLE nessie_catalog.db.orders AS
SELECT * FROM hive_source.db.orders;

Dremio handles the data movement, writing optimally-sized Parquet files with the partition spec you configure.

Phase 3: Post-Migration Optimization

After migration, use Dremio's built-in table maintenance commands:

-- Optimize file sizes and sort order
OPTIMIZE TABLE nessie_catalog.db.orders;

-- Clean up old snapshots
VACUUM TABLE nessie_catalog.db.orders EXPIRE SNAPSHOTS older_than = '2024-01-01 00:00:00';

These commands handle compaction, manifest optimization, and snapshot cleanup in a single operation.

Phase 4: Query Acceleration with Reflections

After migration, create Dremio Reflections on your new Iceberg tables for sub-second query performance:

Raw Reflections for frequently queried datasets with optimal sort order
Aggregation Reflections for BI dashboards with pre-computed GROUP BY results
Starflake Reflections for star schema joins that BI tools generate

Reflections are themselves stored as Iceberg tables, so they benefit from all the same performance optimizations.

Nessie Catalog: Git-Like Version Control for Migrated Tables

When migrating to Iceberg, consider using a Nessie catalog instead of a simple Hive-compatible catalog. Nessie provides Git-like branching and tagging for your entire data catalog:

-- Create a migration branch to test without affecting production
CREATE BRANCH migration_test IN nessie_catalog;

-- Switch to the branch and test migration
USE BRANCH migration_test IN nessie_catalog;
CREATE TABLE orders AS SELECT * FROM hive_source.db.orders;

-- Validate the migration
SELECT COUNT(*) FROM orders;

-- If everything looks good, merge to main
MERGE BRANCH migration_test INTO main IN nessie_catalog;

This branching model gives you a safe, reversible migration process, you can test the migration on a branch, validate results, and only merge to production when you're confident everything is correct.

Large-Scale Migration Patterns

Pattern 1: Incremental Migration by Partition

For petabyte-scale tables, migrate one partition range at a time:

-- Migrate 2024 data first
CREATE TABLE iceberg_db.orders_2024 AS
SELECT * FROM hive_db.orders WHERE order_date >= '2024-01-01';

-- Then migrate 2023, 2022, etc.
INSERT INTO iceberg_db.orders_2024
SELECT * FROM hive_db.orders WHERE order_date BETWEEN '2023-01-01' AND '2023-12-31';

Pattern 2: Dual-Write During Transition

During the migration period, write to both Hive and Iceberg:

Set up the new Iceberg table with the desired partition spec
Configure your ETL pipeline to write to both targets
Backfill historical data from Hive to Iceberg in the background
Once backfill is complete, cut over reads to Iceberg
Decommission the Hive write path

Pattern 3: View-Based Abstraction

Use Dremio views to abstract the migration from downstream consumers:

-- Create a view that consumers use
CREATE VIEW analytics.orders AS
  SELECT * FROM iceberg_catalog.db.orders;

-- During migration, point the view at either source
-- After migration, consumers never change their queries

Migration Planning Checklist

Before starting migration, verify:

[ ] All source data files are in a compatible format (Parquet or ORC)
[ ] Target catalog is configured (Nessie, Glue, or REST catalog)
[ ] Iceberg partition strategy is planned based on actual query patterns
[ ] Downstream consumers are identified and informed
[ ] Rollback plan is documented (keep Hive table until validation period ends)
[ ] Post-migration optimization steps are scheduled (compaction, statistics)
[ ] Monitoring is in place for Iceberg table health (file counts, snapshot growth)
[ ] Access controls and permissions are configured on the new catalog

Frequently Asked Questions

How long does a zero-copy migration take?

The in-place and snapshot migration approaches create only metadata, no data files are copied. For a table with 100,000 data files, migration typically completes in 30-120 seconds. The time scales linearly with the number of files, not with data volume.

Can I migrate while the Hive table is being written to?

Yes, for snapshot migration. The snapshot captures the current state of the Hive table. New writes to Hive after the snapshot won't appear in the Iceberg table. For in-place migration, pause writes during the conversion to avoid race conditions.

Do I need to change my downstream SQL after migration?

No, if you use Dremio views as an abstraction layer. Otherwise, you may need to update table references from hive_db.table to iceberg_catalog.db.table. The SQL syntax for querying data (SELECT, JOIN, WHERE) remains identical.

What about tables with complex Hive SerDe formats?

Tables using custom SerDe formats (JSON SerDe, CSV SerDe, custom Avro) cannot be migrated zero-copy. Use CTAS to convert the data to Parquet during migration. This is an opportunity to also apply optimal partitioning and Z-ordering.

Free Resources to Continue Your Iceberg Journey

Iceberg Lakehouse Books from Dremio Authors

Legacy Content

So, you’ve heard about the trend that data lakes are migrating to Apache Iceberg as their table format. You’ve heard how Apache Iceberg can enable ACID transactions, partition evolution, schema evolution, time travel, version rollback, concurrent writing, and more on your data lake. If you aren’t familiar with Apache Iceberg read this great deep dive into its architecture.

This all sounds great, but all your existing data lake tables are currently in the legacy Hive format and you’d like to know how best to migrate this data. In this article, I hope to illuminate the differences between the two migration options, along with the benefits and drawbacks of each approach:

in-place migration
shadow migration

After reading this article laying out the principles of migrating make sure to checkout this article with a hands-on exercise to see these steps in action.

Option 1: In-Place Data Migration of Hive Tables

In-place data migration avoids rewriting the data. Instead, you write new Apache Iceberg tables that comprise the already existing data files in your data lake.

In-Place Benefits

In-place migration can be less time-consuming as all the data won’t be restated.
If there is an error in writing the Iceberg metadata, you only need to re-write the metadata, not the data as well.
Data lineage is preserved, as the metadata from the pre-existing catalog still exists.

In-Place Drawbacks

If data is added to the dataset during the metadata write you’ll need to reattempt to include the new data.
To prevent having to reattempt this operation you’ll have write downtime, which may not be feasible in some situations.
This approach won’t work if any of the data needs to be restated.

To allow for this kind of migration, Iceberg has several procedures built into its Spark extensions. Which procedure you should use will be based on whether you are using a Spark Catalog.

To replace an existing Hive table, making it an Iceberg table

Use the migrate procedure to write a new table using the source table’s data files. The new Iceberg table uses the same data files as the source Hive table. Ideally, first you’d create a temporary test table using the snapshot procedure and then use the migrate procedure when ready to migrate. The source data files must be Parquet, AVRO, or ORC.

To add the data files from an existing Hive table to an existing Iceberg table

Use the add_files procedure. This procedure adds the existing data files in the source Hive table to an existing Iceberg table with a new snapshot that includes the files. The source Hive table and the destination Iceberg table will both be referring to the same data files.

Option 2: Shadow Migration of Hive Tables

With a shadow migration you are not only creating a new Iceberg table and all the metadata that comes with it, but also restating all data files. This new table is known as the shadow. It becomes the main table once all data is in sync. For larger datasets, you may want to restate them in batches. This approach will have its own set of pros and cons

Shadow Benefits

Shadow migration allows you to audit and validate the data, as the data files must read to be restated.
You’re able to apply schema and partition changes up front.
Data corruption issues are unlikely, as data is audited, validated, and counted through the migration process.

Shadow Drawbacks

You will double the storage for the dataset, as you will store both the original table and the new Iceberg table. (You will retire the old table once the migration is complete, so it’s a temporary issue.)
Because you are writing data and metadata, this migration path can take much longer than in-place migration.
It is difficult to keep tables in sync if there are changes to the original table during the migration. (This issue is also temporary, as syncing updates won’t matter after you retire the original table at migration completion.)

For smaller datasets, restating a table in whole can easily be done with CTAS statements. For larger datasets, you may want to restate an initial batch of data with a CTAS statement and follow up with batched inserts until the entire source is ingested into the new table.

Architecting the Migration

Whether you restate the data with a shadow migration or avoid restating the data with an in-place migration, the actual migration process needs careful architecture.

In general, it should go through a four-phase process:

As the process begins, the new Iceberg table is not yet created or in sync with the source. User-facing read and write operations remain operating on the source table.
The table is created but not fully in sync. Read operations are applied on the source table and write operations are applied to the source and new table.
Once the new table is in sync, you can switch to read operations on the new table. Until you are certain the migration is successful, continue to apply write operations to the source and new table.
When everything is tested, synced, and working properly, you can apply all read and write operations to the new Iceberg table and retire the source table

As you progress through these phases, be sure to check for consistency, logging errors, and logging progress. These checks can help you manage quality, troubleshoot, and measure the length of the migration.

While Iceberg’s procedures and CTAS statements give you a simple path to migrating existing Hive tables, the true art is in architecting your migration plan to minimize or eliminate disruption. Follow the advice in this article to have a solid migration plan and you’ll be enjoying the benefits of Apache Iceberg on your data lake in no time.

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.

Start For Free

Article Topics

Dremio Blog: Various Insights

Blog coverpage for Ingesting Data into Aparche Iceberg with Dremio

Feb 1, 2024 Product Insights from the Dremio Blog

Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg

By unifying data from diverse sources, simplifying data operations, and providing powerful tools for data management, Dremio stands out as a comprehensive solution for modern data needs. Whether you are a data engineer, business analyst, or data scientist, harnessing the combined power of Dremio and Apache Iceberg will undoubtedly be a valuable asset in your data management toolkit.

Alex Merced

Sep 22, 2023 Dremio Blog: Open Data Insights

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]

Alex Merced

Oct 12, 2023 Product Insights from the Dremio Blog

Table-Driven Access Policies Using Subqueries

This blog helps you learn about table-driven access policies in Dremio Cloud and Dremio Software v24.1+.

Albert Vernon

Table of Contents

Why Migrate from Hive to Iceberg?

Migration Approach 1: In-Place Migration (Recommended)

What Happens

Advantages

Considerations

Migration Approach 2: Snapshot Migration (Safest)

What Happens

Advantages

Considerations

Migration Approach 3: CTAS Migration (Full Copy)

When to Use CTAS

Post-Migration Steps

1. Verify Data Integrity

2. Evolve Partitions

3. Run Compaction

4. Compute Statistics

5. Update Downstream Consumers

Migration with Dremio

Real-World Migration: E-Commerce Platform

Common Migration Pitfalls

Pitfall 1: Not Validating Data After Migration

Pitfall 2: Forgetting to Update ACLs

Pitfall 3: Leaving Hive Tables Active

Dremio's Role in Hive-to-Iceberg Migration

Phase 1: Pre-Migration Assessment with Dremio

Phase 2: Migration with Dremio CTAS

Phase 3: Post-Migration Optimization

Phase 4: Query Acceleration with Reflections

Nessie Catalog: Git-Like Version Control for Migrated Tables

Large-Scale Migration Patterns

Pattern 1: Incremental Migration by Partition

Pattern 2: Dual-Write During Transition

Pattern 3: View-Based Abstraction

Migration Planning Checklist

Frequently Asked Questions

How long does a zero-copy migration take?

Can I migrate while the Hive table is being written to?

Do I need to change my downstream SQL after migration?

What about tables with complex Hive SerDe formats?

Free Resources to Continue Your Iceberg Journey

Iceberg Lakehouse Books from Dremio Authors

Legacy Content

Option 1: In-Place Data Migration of Hive Tables

In-Place Benefits

In-Place Drawbacks

To replace an existing Hive table, making it an Iceberg table

To add the data files from an existing Hive table to an existing Iceberg table

Option 2: Shadow Migration of Hive Tables

Shadow Benefits

Shadow Drawbacks

Architecting the Migration

Try Dremio Cloud free for 30 days

Related Dremio Articles

Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

Table-Driven Access Policies Using Subqueries

Ready to Get Started?