This article has been revised and updated from its original version published in 2022 to reflect the latest partitioning approaches, including Delta Lake's Liquid Clustering, Iceberg's V3 improvements, and Hudi's dynamic partitioning.

Partitioning is one of the most impactful decisions in data lakehouse design. The right partition strategy can make queries 100x faster by eliminating unnecessary data scanning. The wrong strategy leads to full table scans, small file proliferation, and expensive partition rewrites that consume engineering time instead of delivering business value.

The cost of poor partitioning compounds over time. A table that starts with monthly partitioning may need daily partitioning as data volume grows. In Hive and most traditional systems, changing partitioning means rewriting every data file, an operation that can take hours or days on large tables, requires maintenance windows, and risks data corruption if interrupted. This is why partition evolution, the ability to change partition strategy without rewriting data, is such a critical differentiator among table formats.

Each of the three major table formats (Apache Iceberg, Delta Lake, and Apache Hudi) takes a fundamentally different approach to partitioning. These differences affect not just performance, but also operational complexity, data model flexibility, and long-term maintainability. Understanding them is essential for any data engineer choosing a table format.

Dremio's unified lakehouse platform is built on Apache Iceberg because Iceberg's hidden partitioning and partition evolution provide the most developer-friendly and performant partitioning model. Combined with Dremio's file-level pruning and Reflections, queries on partitioned Iceberg tables achieve sub-second response times.

The Hive Legacy: Why Traditional Partitioning Fails

Before open table formats, Hive defined the partitioning model for data lakes: For official documentation, refer to the Iceberg partitioning spec.

-- Hive-style partitioning
CREATE TABLE orders (
  order_id BIGINT,
  amount DECIMAL(12,2)
)
PARTITIONED BY (year INT, month INT, day INT);

-- Users must specify partition values explicitly
INSERT INTO orders PARTITION(year=2024, month=3, day=15)
SELECT order_id, amount FROM staging_orders;

Problems with Hive partitioning:

Partition columns leak into the schema: users must know and specify partition columns in queries
Over-partitioning: daily partitions on large tables create thousands of tiny directories
No evolution: changing partition strategy requires full table rewrite
User errors: forgetting partition filters causes accidental full table scans

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Apache Iceberg: Hidden Partitioning and Evolution

Iceberg's partitioning model is architecturally distinct from Hive, Delta, and Hudi:

Hidden Partitioning

Iceberg applies partition transforms automatically, users never see partition columns in queries:

-- Create table with hidden partitioning
CREATE TABLE orders (
  order_id BIGINT,
  customer_id INT,
  order_date TIMESTAMP,
  amount DECIMAL(12,2),
  region VARCHAR
)
PARTITION BY (month(order_date), bucket(16, customer_id));

Queries don't reference partition columns:

-- Iceberg automatically prunes to the correct month partition
SELECT * FROM orders WHERE order_date BETWEEN '2024-03-01' AND '2024-03-31';

There's no risk of accidental full table scans because the query engine always evaluates partition transforms against the WHERE clause. See Hidden Partitioning detailed look.

Partition Evolution

Iceberg's most unique feature: change the partition strategy without rewriting data:

-- Original: monthly partitioning
ALTER TABLE orders SET PARTITION SPEC (month(order_date));

-- As data grows, switch to daily partitioning
ALTER TABLE orders SET PARTITION SPEC (day(order_date));

-- Add bucket partitioning for high-cardinality columns
ALTER TABLE orders SET PARTITION SPEC (day(order_date), bucket(16, region));

This is a metadata-only operation. Old data retains the original partition spec; new data uses the updated spec. The query planner evaluates both specs during planning, ensuring correct results regardless of when data was written.

Available Partition Transforms

Transform	Input	Output	Example
`year(col)`	Timestamp/Date	Year integer	`year(order_date)` → 2024
`month(col)`	Timestamp/Date	Year-month	`month(order_date)` → 2024-03
`day(col)`	Timestamp/Date	Date	`day(order_date)` → 2024-03-15
`hour(col)`	Timestamp	Date-hour	`hour(event_ts)` → 2024-03-15-14
`bucket(N, col)`	Any	Hash bucket 0..N-1	`bucket(16, id)` → 7
`truncate(L, col)`	String/Int	Truncated value	`truncate(4, zipcode)` → "1002"
`void(col)`	Any	Null (removes partitioning)	`void(old_col)` → null

Delta Lake: From Hive-Style to Liquid Clustering

Traditional Delta Partitioning

Delta Lake originally used Hive-style explicit partitioning:

-- Delta Lake traditional partitioning
CREATE TABLE orders (
  order_id BIGINT,
  order_date DATE,
  amount DECIMAL(12,2)
)
USING delta
PARTITIONED BY (order_date);

This requires users to be aware of partition columns and suffers from the same over-partitioning and evolution problems as Hive.

Liquid Clustering (Recent Addition)

Delta Lake added Liquid Clustering to address these limitations:

-- Delta Lake Liquid Clustering
CREATE TABLE orders
USING delta
CLUSTER BY (order_date, region);

Key differences from Iceberg:

Requires OPTIMIZE: Clustering is applied during OPTIMIZE runs, not at write time
Data rewrite required: Changing clustering columns requires rewriting affected data
Not hidden: Users must still be aware of clustering columns for optimal queries
Incremental: Only new/modified data is re-clustered during OPTIMIZE

Apache Hudi: Explicit Partitioning

Hudi uses Hive-compatible explicit partitioning with no evolution support:

-- Hudi partitioning
CREATE TABLE orders (
  order_id BIGINT,
  order_date DATE,
  amount DECIMAL(12,2),
  region STRING
)
USING hudi
PARTITIONED BY (region);

Limitations:

No partition evolution: Changing partition columns requires full table rewrite
No hidden partitioning: Users must include partition columns in queries
No transform support: Partitions must be explicit column values (no month(), bucket())

Comprehensive Comparison

Feature	Apache Iceberg	Delta Lake	Apache Hudi
Hidden partitioning	✅ Full support	❌ Not supported	❌ Not supported
Partition evolution	✅ Metadata-only	⚠️ Liquid Clustering (data rewrite)	❌ Full rewrite required
Partition transforms	✅ 7 transforms	❌ None (explicit columns)	❌ None
Accidental full scan prevention	✅ Automatic	❌ User must know partitions	❌ User must know partitions
Multi-spec support	✅ Old + new specs coexist	❌ Single clustering spec	❌ Single partition spec
Partition pruning	✅ Manifest-level + file-level	✅ Log-level + file-level	✅ Metadata-table-level

Partitioning with Dremio

Dremio uses Iceberg's partitioning to deliver optimal query performance:

Creating Partitioned Tables

-- Dremio CTAS with hidden partitioning
CREATE TABLE nessie.db.orders
PARTITION BY (month(order_date), bucket(8, region))
AS SELECT * FROM staging.raw_orders;

Evolving Partitions

-- Change partition strategy without data rewrite
ALTER TABLE nessie.db.orders 
SET PARTITION SPEC (day(order_date), bucket(16, region));

Optimizing Partitioned Tables

-- Compact files within each partition
OPTIMIZE TABLE nessie.db.orders;

-- Clean up expired snapshots
VACUUM TABLE nessie.db.orders 
EXPIRE SNAPSHOTS older_than = '2024-01-01 00:00:00';

Reflection Acceleration on Partitioned Tables

Dremio's Reflections inherit Iceberg partition pruning automatically. When a Reflection is created on a partitioned table, queries against the Reflection benefit from both:

Reflection-level pruning: Only scan Reflection data for matching partitions
File-level pruning: Column statistics eliminate individual files within partitions

This double pruning is why Dremio + Iceberg delivers sub-second query performance on tables that would take minutes with Hive or Delta partitioning.

Real-World Partitioning Strategy Guide

Time-Series Data (IoT, Logs, Events)

-- Start with daily partitioning
PARTITION BY (day(event_timestamp))
-- If queries also filter by source, add bucket
PARTITION BY (day(event_timestamp), bucket(8, source_id))

Transactional Data (Orders, Payments)

-- Monthly partition with region bucketing
PARTITION BY (month(order_date), bucket(16, region))

Slowly Changing Dimensions (Customers, Products)

-- Light partitioning or no partitioning for small tables
PARTITION BY (bucket(4, customer_segment))

Partition Maintenance Operations Comparison

Beyond defining partitions, each format handles ongoing partition maintenance differently:

Operation	Apache Iceberg	Delta Lake	Apache Hudi
Add new partition field	`ALTER TABLE ADD PARTITION FIELD` (metadata-only)	Not supported (requires recreate)	Not supported
Drop partition field	`ALTER TABLE DROP PARTITION FIELD` (metadata-only)	Not supported	Not supported
Replace partition spec	`ALTER TABLE SET PARTITION SPEC`	Liquid Clustering change	Full rewrite
Compact files within partitions	`OPTIMIZE TABLE` in Dremio	`OPTIMIZE`	Automatic compaction
Drop a partition's data	`DELETE WHERE` or expiry	`DELETE WHERE`	`DELETE WHERE`
Rebalance partition data	No need (metadata-only evolution)	`OPTIMIZE` with repartitioning	Full clustering job

Iceberg's metadata-only partition operations mean that a data engineer can adjust partition strategy in real-time based on evolving query patterns, no maintenance window, no data movement, no risk of job failure corrupting data.

Common Partitioning Anti-Patterns

Anti-Pattern 1: Over-Partitioning

Creating daily partitions on a table with only 1 GB of daily data produces thousands of tiny files:

-- BAD: Daily partitioning produces 365 x tiny partitions per year
PARTITION BY (day(event_timestamp))

-- BETTER: Monthly partitioning for moderate-volume tables
PARTITION BY (month(event_timestamp))

Anti-Pattern 2: High-Cardinality Partition Keys

Using a column with millions of unique values as a partition key creates millions of partitions:

-- BAD: User ID has millions of values → millions of partitions
PARTITION BY (user_id)

-- BETTER: Bucket high-cardinality columns
PARTITION BY (bucket(64, user_id))

Anti-Pattern 3: Forgetting Partition Evolution

Starting with overly coarse partitioning and not evolving as data grows:

-- Started yearly when table was small
PARTITION BY (year(event_timestamp))

-- Table grew 10x → partitions are too large now
-- With Iceberg, simply evolve (metadata-only):
ALTER TABLE events SET PARTITION SPEC (month(event_timestamp))

This evolution is only possible with Apache Iceberg. Delta Lake and Hudi users must rewrite their entire table to achieve the same effect.

Partitioning Performance Impact: Real-World Benchmarks

The right partition strategy produces dramatic performance differences:

Query Pattern	No Partitioning	Monthly Partitioning	Daily + Bucket Partitioning
Full scan (SELECT *)	120s (all files)	120s (all partitions)	120s (all partitions)
Single month filter	120s	10s (11/12 eliminated)	8s (further file pruning)
Single day filter	120s	10s	0.3s (365x elimination)
Day + region filter	120s	10s	0.05s (partition + bucket)

The key insight: partitioning only helps if your WHERE clause aligns with the partition spec. Iceberg's hidden partitioning ensures this alignment happens automatically, users write WHERE order_date = '2024-03-15' and the query planner handles partition pruning transparently.

With Dremio's C3 cache and Reflections on top of partitioned Iceberg tables, queries that scan terabytes of raw data return results in sub-second latency. This layered optimization, Iceberg partitioning + file pruning + Z-ordering + Reflections + caching, is why Dremio achieves data warehouse performance on open lakehouse storage.

Frequently Asked Questions

What happens when I query across evolved partition specs?

Iceberg's query planner evaluates all active partition specs during planning. Data written under the old spec is pruned using the old spec's rules; data written under the new spec uses the new rules. This is transparent to the user, you write a normal WHERE clause and Iceberg handles the rest.

How many partitions should I have?

A good rule of thumb: aim for partition directories containing 100 MB - 1 GB of data each. Over-partitioning (thousands of tiny partitions) creates small file problems. Under-partitioning (one giant partition) prevents effective pruning.

Can I combine hidden partitioning with Z-ordering?

Yes. Partitioning provides coarse-grained pruning (eliminating entire partitions), while Z-ordering provides fine-grained pruning within partitions (co-locating related data within files for multi-column predicate pushdown).

Free Resources to Continue Your Iceberg Journey

Iceberg Lakehouse Books from Dremio Authors

Legacy Content

In this article, we compared several features between the three major data lake table formats: Apache Iceberg, Apache Hudi, and Delta Lake. Below is a summary of the findings of that article:

One of the areas we compared was partitioning features. In this article, we will dive deeper into the details of partitioning for each table format.

What Is Partitioning?

Partitioning is a strategy for optimizing how you store your data by subdividing the data by the values in one or more fields. For example, for sales data where queries often rely on when the sale occurred, you may partition the data by day, month, or year of the sale. Or, if you are looking at voter registration data, you may want to partition by party registration or precinct.

The benefit of partitioning occurs when a query filters the data by a partitioned column – the query engine can limit its scan to only subdivisions/partitions that apply to the query, resulting in much faster query performance.

How to Partition a Table

In most situations, the partitioning scheme is determined when the table is created by specifying columns whose values you want to partition the data by. The following shows what this would look like in each of the table formats.

Apache Iceberg

In Apache Iceberg you can partition the table in your CREATE TABLE statements, as shown.

CREATE TABLE IF NOT EXISTS catalog.db.players (
    id int, 
    name string, 
    team string
) PARTITIONED BY (team) USING iceberg

Beyond selecting a particular column to partition by, you can select a “transform” and partition the table by the transformed value of the column.

Transforms available in Iceberg include:

Day
Month
Year
Hour
Bucket (partitions data into a specified number of buckets using a hash function)
Truncate (partitions the table based on the truncated value of the field, and can specify the width of truncated value)

Some examples:

-- partitioned by first letter of the team field
CREATE TABLE catalog.db.players (
  id int, 
  name string, 
  team string
) PARTITIONED BY (truncate(1, team)) USING iceberg

-- partitioned by day, month, year of a timestamp
CREATE TABLE catalog.db.players (
  id int, 
  name string, 
  team string, 
  ts timestamp
) PARTITIONED BY (year(ts), month(ts), day(ts)) USING iceberg

-- partitioned into 8 buckets of equal sized ranges
CREATE TABLE catalog.db.players (
  id int, 
  name string, 
  team string
) PARTITIONED BY (bucket(8, team)) USING iceberg

Apache Hudi

In Apache Hudi you also specify your partitions in your CREATE TABLE statements. In Hudi you can’t specify transforms – just a column whose values you want to be used for partitioning.

CREATE TABLE IF NOT EXISTS default.players (
  id int,
  name string,
  team string 
) USING hudi
PARTITIONED BY (team);

Delta Lake

In Delta Lake, you also declare how the table is partitioned when the table is created, for example:

CREATE TABLE default.players (
  id int,
  name STRING,
  team STRING,
)
USING DELTA
PARTITIONED BY (team)

While you can’t specify a transform in your partitions like Apache Iceberg, you can create “generated columns” which are columns whose values are calculated based on another field. You can then partition based on that field. The example below will partition based on the first letter of the team name like the truncate() example from Iceberg. This feature creates a new column so that the generated value is stored.

(Note: The generated columns feature is fully functional only in Databricks Delta Lake and can use any Spark-supported SQL function aside from user-defined functions, window functions, aggregation functions, and those that return multiple rows. For the full functionality to exist in Delta Lake OSS some updates to Spark OSS are required which are expected in the Spark 3.4 release according to this GitHub issue. Keep in mind that this may cause issues when running other engines that may not support these Spark functions, making it difficult to get the open value of lakehouses.)

CREATE TABLE default.players (
  id INT,
  name STRING,
  team STRING,
  team_first_letter STRING GENERATED ALWAYS AS (SUBSTRING(name, 1, 1)) 
)
USING DELTA
PARTITIONED BY (team_first_letter)

How to Maximize the Benefits of Partitioning

In the Hive world, partitioning by a particular column or columns worked well to improve queries in many situations but had some areas of difficulty which modern table formats try to improve upon in different ways.

While you may have the table partitioned, query engines may not automatically know to take advantage of partitioning. In Hive tables, this required explicit filtering using the partition column (which were often derived from other columns and resulted in long-winded queries). An example of how verbose queries in Hive would be like, examine the following query:

SELECT * 
FROM orders 
WHERE order_time BETWEEN '2022-06-01 10:00:00' AND '2022-07-15 10:00:00' 
  AND order_year = 2022 
  AND order_month BETWEEN 6 AND 7;

As you can see, querying by the timestamp field isn’t enough to accelerate the query with partitioning, so you have to add filters on the derived month and year columns, which is assumed to be the partition columns for the table. Not having the filters on the order_month and order_year columns would result in a full table scan. Data consumers may not know enough about the engineering of a table to add these extra filters, resulting in slower queries. Plus, you really shouldn’t have to know the physical layout of the table to have a good experience with it.

Let’s look at how the different table formats try to improve upon this situation.

Apache Iceberg

In Apache Iceberg there is a feature called “hidden partitioning” that makes getting the maximum benefit of partitioning quite easy. When you partition the data, you specify a column and transform like day(ts), so any query on the transformed column will automatically benefit from the partitioning and avoid a full table scan where logically possible.

So instead of this:

SELECT * 
FROM orders 
WHERE order_time BETWEEN '2022-06-01 10:00:00' AND '2022-07-15 10:00:00' 
  AND order_year = 2022 
  AND order_month BETWEEN 6 AND 7;

Iceberg simplifies it to this:

SELECT * 
FROM orders 
WHERE order_time BETWEEN '2022-06-01 10:00:00' AND '2022-07-15 10:00:00';

Iceberg does even more to reduce the amount of data an engine needs to read. Within the metadata files that Iceberg uses to track tables are all sorts of column metadata tracking counts, minimum values, maximum values, and ranges at the manifest and data file level for further pruning based on any filtered column.

Apache Hudi

Using partitions in Hudi works like traditional Hive partitioning – there are no transforms at partition declaration like in Iceberg so any partition columns must be explicitly stated in your query like in Hive.
To compensate for this, Hudi stores column metadata in a column stats index in the metadata table that optimizes file pruning. Using this index, it can prune files based on their metadata and avoid the need for additional column partitioning such as day, month, or year columns (referred to as data skipping). So if data skipping and the metadata table are enabled on your Hudi table, a query filter on a timestamp field as shown below can be optimized using transforms in your filter.

SELECT * FROM orders WHERE date_format(
  order_time,
  "YYYY-MM-DD" 
) BETWEEN '2022-06-01' AND '2022-07-15';

The query engine will then use the column stats index to skip data files that don’t include relevant data to improve scan times. However, this requires you to know how to write a query like this, which isn’t the typical ANSI SQL way to write it.

Delta Lake

In Delta Lake, if you have added generated columns to your table, it will automatically add filter predicates on those tables when you filter by the source column. A query in Delta Lake would look like the following:

SELECT * 
FROM orders 
WHERE order_time BETWEEN '2022-06-01' AND '2022-07-15';

If you have generated columns order_year, order_month, and order_day and those are specified as partition columns, then Delta Lake would generate the predicates to take advantage of the partition columns.

Like Iceberg and Hudi, Delta Lake will also attempt further file pruning using metadata. In Delta Lake’s case, it will maintain indexes on the first 32 columns in your table (this can be reduced or increased) which will be used to prune when those columns are filtered.

How to Evolve the Partition Scheme

As your data and use cases grow and evolve, you’ll sometimes find that your current partitioning scheme needs to change. Oftentimes to change the partitioning scheme, you have to run the time-consuming, expensive, and disruptive operation of re-writing the entire table.

Apache Iceberg

In Apache Iceberg, all the data on how files are partitioned are handled in the metadata file. Each of the three tiers of metadata files (metadata files, manifest lists, and manifests) has information on the partitioning of the table. This enables the ability to change how the table is partitioned going forward, in order to evolve the partitioning scheme (referred to as partition evolution).

The image above was originally partitioned by month, then the partition scheme was evolved to partition by day beginning January 2009. This new scheme applies to all new data added to the table after the change is made. If by chance new 2008 data is added after the change in the partition spec, the new data will have the new spec applied, and during any query planning it will be lumped with any data that shares the same partition spec.

When planning a query, a separate plan will be made for data in each partition scheme so each segment is optimized for the best query performance.

Updating the partition scheme is as simple as running an ALTER TABLE statement.

-- Removing old partition field
ALTER TABLE prod.db.booking_table 
DROP PARTITION FIELD month(date);
-- Adding a new partition field
ALTER TABLE prod.db.booking_table 
ADD PARTITION FIELD day(date);

Apache Hudi

You cannot evolve the partition scheme in Apache Hudi without rewriting the table.

Delta Lake

You cannot evolve the partition scheme in Delta Lake without rewriting the table.

Conclusion

Partitioning is important to delivering performant queries on large data sets. All three of the main data lake table formats have different approaches to the role of partitioning in how they optimize file pruning for performant queries. When deciding on which format best suits your needs, be sure to ask and answer the following questions:

What are my current partitioning practices?
Will my partitions evolve in the future?
Which format will be most ergonomic and easy to use for my data consumers?

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.

Start For Free

Article Topics

Dremio Blog: Various Insights

Blog coverpage for Ingesting Data into Aparche Iceberg with Dremio

Feb 1, 2024 Product Insights from the Dremio Blog

Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg

By unifying data from diverse sources, simplifying data operations, and providing powerful tools for data management, Dremio stands out as a comprehensive solution for modern data needs. Whether you are a data engineer, business analyst, or data scientist, harnessing the combined power of Dremio and Apache Iceberg will undoubtedly be a valuable asset in your data management toolkit.

Alex Merced

Sep 22, 2023 Dremio Blog: Open Data Insights

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]

Alex Merced

Oct 12, 2023 Product Insights from the Dremio Blog

Table-Driven Access Policies Using Subqueries

This blog helps you learn about table-driven access policies in Dremio Cloud and Dremio Software v24.1+.

Albert Vernon

Table of Contents

The Hive Legacy: Why Traditional Partitioning Fails

Try Dremio’s Interactive Demo

Apache Iceberg: Hidden Partitioning and Evolution

Hidden Partitioning

Partition Evolution

Available Partition Transforms

Delta Lake: From Hive-Style to Liquid Clustering

Traditional Delta Partitioning

Liquid Clustering (Recent Addition)

Apache Hudi: Explicit Partitioning

Comprehensive Comparison

Partitioning with Dremio

Creating Partitioned Tables

Evolving Partitions

Optimizing Partitioned Tables

Reflection Acceleration on Partitioned Tables

Real-World Partitioning Strategy Guide

Time-Series Data (IoT, Logs, Events)

Transactional Data (Orders, Payments)

Slowly Changing Dimensions (Customers, Products)

Partition Maintenance Operations Comparison

Common Partitioning Anti-Patterns

Anti-Pattern 1: Over-Partitioning

Anti-Pattern 2: High-Cardinality Partition Keys

Anti-Pattern 3: Forgetting Partition Evolution

Partitioning Performance Impact: Real-World Benchmarks

Frequently Asked Questions

What happens when I query across evolved partition specs?

How many partitions should I have?

Can I combine hidden partitioning with Z-ordering?

Free Resources to Continue Your Iceberg Journey

Iceberg Lakehouse Books from Dremio Authors

Legacy Content

What Is Partitioning?

How to Partition a Table

How to Maximize the Benefits of Partitioning

How to Evolve the Partition Scheme

Conclusion

Try Dremio Cloud free for 30 days

Related Dremio Articles

Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

Table-Driven Access Policies Using Subqueries

Ready to Get Started?