Dremio Blog

31 minute read · April 15, 2022

How to Migrate a Hive Table to an Iceberg Table

Alex Merced Alex Merced Head of DevRel, Dremio
Start For Free
How to Migrate a Hive Table to an Iceberg Table
Copied to clipboard

This article has been revised and updated from its original version published in 2022 to reflect the latest Apache Iceberg migration techniques and tools.

Migrating from Hive tables to Apache Iceberg unlocks time travelhidden partitioningschema evolution, and dramatically faster queries through file-level pruning. The good news: you don't have to rewrite your data to migrate. Iceberg provides zero-copy migration paths that add Iceberg metadata on top of existing Parquet files.

Organizations that have migrated from Hive to Iceberg consistently report massive improvements: query planning times drop from minutes to seconds (because Iceberg eliminates slow directory listing operations), partition management becomes instant instead of requiring multi-hour rewrites, and data teams gain access to modern features like time travel and rollback that simply don't exist in Hive. For organizations already using Dremio, the migration unlocks even greater value through Dremio's Reflections, which are stored as Iceberg tables and can deliver sub-second query performance on top of the migrated data.

This guide covers three migration approaches (in-place migration, snapshot migration, and CTAS migration) with detailed trade-offs and production considerations for each.

Why Migrate from Hive to Iceberg?

For official documentation, refer to the Iceberg migration procedures.

CapabilityHive TablesIceberg Tables
Time travelNot supportedFull support
RollbackNot supportedInstant rollback
Partition evolutionRequires full rewriteMetadata-only change
Schema evolutionLimited (add columns only)Full (add, rename, reorder, widen)
Row-level updatesNot supportedFull COW/MOR support
Query planning speedSlow (file listing)Fast (manifest-based)
ACID guaranteesLimitedFull serializable isolation
Cloud storage optimizationPoorExcellent

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

In-place migration converts the Hive table to Iceberg without copying data. Iceberg creates metadata files that point to the existing Parquet data files:

CALL catalog.system.migrate('hive_db.orders');

What Happens

  1. Iceberg reads the Hive Metastore to discover all partitions and files
  2. Creates manifest files listing every existing data file with column statistics
  3. Creates a metadata file with the table schema, partition spec, and initial snapshot
  4. Updates the catalog entry to point to the new metadata file

Advantages

  • Zero data copy: No data files are written
  • Fast: Only metadata operations, typically completes in seconds to minutes
  • Preserves data layout: Existing partitioning and file organization remain

Considerations

  • Irreversible: The original Hive metadata is replaced
  • Requires compatible file formats: Data files must be Parquet or ORC
  • Hive partition directories: Existing directory-based partition layout is preserved

Migration Approach 2: Snapshot Migration (Safest)

Snapshot migration creates an Iceberg table that points to the same data files as the Hive table, without modifying the original Hive table:

CALL catalog.system.snapshot('hive_db.orders', 'iceberg_db.orders');

What Happens

  1. Iceberg reads the Hive table's metadata to discover all files
  2. Creates a new Iceberg table (iceberg_db.orders) with manifests pointing to the same data files
  3. The original Hive table (hive_db.orders) remains unchanged

Advantages

  • Non-destructive: The original Hive table is untouched
  • Parallel operation: Both Hive and Iceberg can read the same data
  • Easy rollback: If migration fails, drop the Iceberg table; Hive table is unaffected

Considerations

  • Writes go to Iceberg only: New writes to the Iceberg table create new data files not visible to Hive
  • Eventual cut-over: You must eventually decommission the Hive table to avoid confusion

Migration Approach 3: CTAS Migration (Full Copy)

Create Table As Select creates a completely new Iceberg table with a full data copy:

CREATE TABLE iceberg_db.orders 
USING iceberg
PARTITIONED BY (month(order_date))
AS SELECT * FROM hive_db.orders;

When to Use CTAS

  • When you want to change the partition scheme during migration
  • When data files need to be reorganized or optimally sized
  • When migrating from non-Parquet formats (CSV, JSON, Avro)
  • When you want to apply Z-ordering during migration

Post-Migration Steps

1. Verify Data Integrity

-- Compare row counts
SELECT COUNT(*) FROM hive_db.orders;
SELECT COUNT(*) FROM iceberg_db.orders;

-- Compare a sample of rows
SELECT SUM(amount), COUNT(DISTINCT customer_id) FROM hive_db.orders;
SELECT SUM(amount), COUNT(DISTINCT customer_id) FROM iceberg_db.orders;

2. Evolve Partitions

Now that you're on Iceberg, you can evolve partitions without rewriting data:

ALTER TABLE iceberg_db.orders SET PARTITION SPEC (day(order_date), bucket(16, customer_id));

3. Run Compaction

Hive tables often have suboptimal file sizes. Run compaction to right-size files:

CALL catalog.system.rewrite_data_files(table => 'iceberg_db.orders');

4. Compute Statistics

Generate Puffin statistics for optimal query planning:

ANALYZE TABLE iceberg_db.orders COMPUTE STATISTICS FOR ALL COLUMNS;

5. Update Downstream Consumers

Update all pipelines, dashboards, and applications to point to the new Iceberg table. With Dremio, you can use virtual datasets (views) as an abstraction layer, retarget the view from the Hive source to the Iceberg source without changing downstream consumers.

Migration with Dremio

Dremio simplifies Hive-to-Iceberg migration:

  1. Connect both sources: Register your Hive Metastore and Iceberg catalog as Dremio sources
  2. Create Iceberg tables: Use Dremio's CTAS to create optimized Iceberg tables
  3. Use Reflections: Create Reflections on Iceberg tables for accelerated queries
  4. Cut over: Update views to point to Iceberg sources

Real-World Migration: E-Commerce Platform

A SaaS company migrated 200 Hive tables (50TB total) to Iceberg over three weeks:

  • Week 1: Snapshot migration of read-heavy dimension tables
  • Week 2: In-place migration of fact tables
  • Week 3: CTAS migration of tables needing partition changes

Results after migration:

  • Query planning time: 45 seconds → 2 seconds (Iceberg eliminates file listing)
  • Storage costs: 20% reduction after compaction right-sized files
  • Operational overhead: Zero table maintenance windows (partition evolution replaces full rewrites)

Common Migration Pitfalls

Pitfall 1: Not Validating Data After Migration

Always compare row counts, checksums, and sample queries between old and new tables.

Pitfall 2: Forgetting to Update ACLs

Migrating metadata doesn't migrate access controls. Ensure the new Iceberg catalog has appropriate permissions.

Pitfall 3: Leaving Hive Tables Active

After cut-over, decommission Hive tables to prevent confusion. Mark them as deprecated first, then drop after a validation period.

Dremio's Role in Hive-to-Iceberg Migration

Dremio simplifies every phase of Hive-to-Iceberg migration beyond what Spark alone can provide:

Phase 1: Pre-Migration Assessment with Dremio

Connect your existing Hive Metastore as a Dremio source. Dremio can query Hive tables natively, allowing you to profile your data before migration:

-- Profile table sizes and partition counts through Dremio
SELECT 
  COUNT(*) as total_rows,
  COUNT(DISTINCT order_date) as unique_dates,
  MIN(order_date) as earliest,
  MAX(order_date) as latest
FROM hive_source.db.orders;

This assessment helps you plan the optimal Iceberg partition strategy before migration begins.

Phase 2: Migration with Dremio CTAS

Dremio supports creating Iceberg tables directly using CTAS from any connected source, including Hive tables:

-- Create an Iceberg table from a Hive source through Dremio
CREATE TABLE nessie_catalog.db.orders AS
SELECT * FROM hive_source.db.orders;

Dremio handles the data movement, writing optimally-sized Parquet files with the partition spec you configure.

Phase 3: Post-Migration Optimization

After migration, use Dremio's built-in table maintenance commands:

-- Optimize file sizes and sort order
OPTIMIZE TABLE nessie_catalog.db.orders;

-- Clean up old snapshots
VACUUM TABLE nessie_catalog.db.orders EXPIRE SNAPSHOTS older_than = '2024-01-01 00:00:00';

These commands handle compaction, manifest optimization, and snapshot cleanup in a single operation.

Phase 4: Query Acceleration with Reflections

After migration, create Dremio Reflections on your new Iceberg tables for sub-second query performance:

  • Raw Reflections for frequently queried datasets with optimal sort order
  • Aggregation Reflections for BI dashboards with pre-computed GROUP BY results
  • Starflake Reflections for star schema joins that BI tools generate

Reflections are themselves stored as Iceberg tables, so they benefit from all the same performance optimizations.

Nessie Catalog: Git-Like Version Control for Migrated Tables

When migrating to Iceberg, consider using a Nessie catalog instead of a simple Hive-compatible catalog. Nessie provides Git-like branching and tagging for your entire data catalog:

-- Create a migration branch to test without affecting production
CREATE BRANCH migration_test IN nessie_catalog;

-- Switch to the branch and test migration
USE BRANCH migration_test IN nessie_catalog;
CREATE TABLE orders AS SELECT * FROM hive_source.db.orders;

-- Validate the migration
SELECT COUNT(*) FROM orders;

-- If everything looks good, merge to main
MERGE BRANCH migration_test INTO main IN nessie_catalog;

This branching model gives you a safe, reversible migration process, you can test the migration on a branch, validate results, and only merge to production when you're confident everything is correct.

Large-Scale Migration Patterns

Pattern 1: Incremental Migration by Partition

For petabyte-scale tables, migrate one partition range at a time:

-- Migrate 2024 data first
CREATE TABLE iceberg_db.orders_2024 AS
SELECT * FROM hive_db.orders WHERE order_date >= '2024-01-01';

-- Then migrate 2023, 2022, etc.
INSERT INTO iceberg_db.orders_2024
SELECT * FROM hive_db.orders WHERE order_date BETWEEN '2023-01-01' AND '2023-12-31';

Pattern 2: Dual-Write During Transition

During the migration period, write to both Hive and Iceberg:

  1. Set up the new Iceberg table with the desired partition spec
  2. Configure your ETL pipeline to write to both targets
  3. Backfill historical data from Hive to Iceberg in the background
  4. Once backfill is complete, cut over reads to Iceberg
  5. Decommission the Hive write path

Pattern 3: View-Based Abstraction

Use Dremio views to abstract the migration from downstream consumers:

-- Create a view that consumers use
CREATE VIEW analytics.orders AS
  SELECT * FROM iceberg_catalog.db.orders;

-- During migration, point the view at either source
-- After migration, consumers never change their queries

Migration Planning Checklist

Before starting migration, verify:

  • [ ] All source data files are in a compatible format (Parquet or ORC)
  • [ ] Target catalog is configured (Nessie, Glue, or REST catalog)
  • [ ] Iceberg partition strategy is planned based on actual query patterns
  • [ ] Downstream consumers are identified and informed
  • [ ] Rollback plan is documented (keep Hive table until validation period ends)
  • [ ] Post-migration optimization steps are scheduled (compactionstatistics)
  • [ ] Monitoring is in place for Iceberg table health (file counts, snapshot growth)
  • [ ] Access controls and permissions are configured on the new catalog

Frequently Asked Questions

How long does a zero-copy migration take?

The in-place and snapshot migration approaches create only metadata, no data files are copied. For a table with 100,000 data files, migration typically completes in 30-120 seconds. The time scales linearly with the number of files, not with data volume.

Can I migrate while the Hive table is being written to?

Yes, for snapshot migration. The snapshot captures the current state of the Hive table. New writes to Hive after the snapshot won't appear in the Iceberg table. For in-place migration, pause writes during the conversion to avoid race conditions.

Do I need to change my downstream SQL after migration?

No, if you use Dremio views as an abstraction layer. Otherwise, you may need to update table references from hive_db.table to iceberg_catalog.db.table. The SQL syntax for querying data (SELECT, JOIN, WHERE) remains identical.

What about tables with complex Hive SerDe formats?

Tables using custom SerDe formats (JSON SerDe, CSV SerDe, custom Avro) cannot be migrated zero-copy. Use CTAS to convert the data to Parquet during migration. This is an opportunity to also apply optimal partitioning and Z-ordering.


Free Resources to Continue Your Iceberg Journey

Iceberg Lakehouse Books from Dremio Authors


Legacy Content

So, you’ve heard about the trend that data lakes are migrating to Apache Iceberg as their table format. You’ve heard how Apache Iceberg can enable ACID transactions, partition evolution, schema evolution, time travel, version rollback, concurrent writing, and more on your data lake.  If you aren’t familiar with Apache Iceberg read this great deep dive into its architecture.

This all sounds great, but all your existing data lake tables are currently in the legacy Hive format and you’d like to know how best to migrate this data. In this article, I hope to illuminate the differences between the two migration options, along with the benefits and drawbacks of each approach:

  1. in-place migration 
  2. shadow migration

After reading this article laying out the principles of migrating make sure to checkout this article with a hands-on exercise to see these steps in action.

Option 1: In-Place Data Migration of Hive Tables

In-place data migration avoids rewriting the data. Instead, you write new Apache Iceberg tables that comprise the already existing data files in your data lake.

In-Place Benefits

  • In-place migration can be less time-consuming as all the data won’t be restated.
  • If there is an error in writing the Iceberg metadata, you only need to re-write the metadata, not the data as well.
  • Data lineage is preserved, as the metadata from the pre-existing catalog still exists.

In-Place Drawbacks

  • If data is added to the dataset during the metadata write you’ll need to reattempt to include the new data.
  • To prevent having to reattempt this operation you’ll have write downtime, which may not be feasible in some situations.
  • This approach won’t work if any of the data needs to be restated.

To allow for this kind of migration, Iceberg has several procedures built into its Spark extensions. Which procedure you should use will be based on whether you are using a Spark Catalog.

To replace an existing Hive table, making it an Iceberg table

Use the migrate procedure to write a new table using the source table’s data files. The new Iceberg table uses the same data files as the source Hive table. Ideally, first you’d create a temporary test table using the snapshot procedure and then use the migrate procedure when ready to migrate. The source data files must be Parquet, AVRO, or ORC.

To add the data files from an existing Hive table to an existing Iceberg table

Use the add_files procedure. This procedure adds the existing data files in the source Hive table to an existing Iceberg table with a new snapshot that includes the files. The source Hive table and the destination Iceberg table will both be referring to the same data files.

Option 2: Shadow Migration of Hive Tables

With a shadow migration you are not only creating a new Iceberg table and all the metadata that comes with it, but also restating all data files. This new table is known as the shadow. It becomes the main table once all data is in sync. For larger datasets, you may want to restate them in batches. This approach will have its own set of pros and cons

Shadow Benefits

  • Shadow migration allows you to audit and validate the data, as the data files must read to be restated.
  • You’re able to apply schema and partition changes up front.
  • Data corruption issues are unlikely, as data is audited, validated, and counted through the migration process.

Shadow Drawbacks

  • You will double the storage for the dataset, as you will store both the original table and the new Iceberg table. (You will retire the old table once the migration is complete, so it’s a temporary issue.)
  • Because you are writing data and metadata, this migration path can take much longer than in-place migration.
  • It is difficult to keep tables in sync if there are changes to the original table during the migration. (This issue is also temporary, as syncing updates won’t matter after you retire the original table at migration completion.)

For smaller datasets, restating a table in whole can easily be done with CTAS statements. For larger datasets, you may want to restate an initial batch of data with a CTAS statement and follow up with batched inserts until the entire source is ingested into the new table.

Architecting the Migration

Whether you restate the data with a shadow migration or avoid restating the data with an in-place migration, the actual migration process needs careful architecture.

In general, it should go through a four-phase process:

  1. As the process begins, the new Iceberg table is not yet created or in sync with the source. User-facing read and write operations remain operating on the source table.
  2. The table is created but not fully in sync. Read operations are applied on the source table and write operations are applied to the source and new table.
  3. Once the new table is in sync, you can switch to read operations on the new table. Until you are certain the migration is successful, continue to apply write operations to the source and new table.
  4. When everything is tested, synced, and working properly, you can apply all read and write operations to the new Iceberg table and retire the source table

As you progress through these phases, be sure to check for consistency, logging errors, and logging progress. These checks can help you manage quality, troubleshoot, and measure the length of the migration.

While Iceberg’s procedures and CTAS statements give you a simple path to migrating existing Hive tables, the true art is in architecting your migration plan to minimize or eliminate disruption. Follow the advice in this article to have a solid migration plan and you’ll be enjoying the benefits of Apache Iceberg on your data lake in no time.

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.