This article has been revised and updated from its original version published in 2022 to reflect the latest Apache Iceberg developments, including V3 partition transforms and multi-engine ecosystem improvements.

One of the most common performance killers in traditional data lakes is the accidental full table scan. A user writes WHERE order_date = '2024-03-15' but the table is partitioned by year and month columns, since the query didn't filter on those exact partition columns, the engine scans every file. Apache Iceberg eliminates this entire category of problems with hidden partitioning.

Hidden partitioning is one of Iceberg's most impactful features for real-world query performance. It means users never need to know how a table is partitioned. They write natural predicates, and Iceberg maps them to the correct partitions automatically.

The Problem with Traditional Partitioning

In Hive-style partitioning, partitions are physical directories on storage: For official documentation, refer to the Iceberg partitioning spec.

s3://warehouse/orders/year=2024/month=03/data.parquet
s3://warehouse/orders/year=2024/month=04/data.parquet

To use these partitions, users must write queries that filter on the exact partition columns:

-- Works: Filter matches partition columns
SELECT * FROM orders WHERE year = 2024 AND month = 3;

-- BROKEN: Full table scan! order_date is not a partition column
SELECT * FROM orders WHERE order_date = '2024-03-15';

This creates several problems:

Users Must Know the Partition Scheme

Every analyst, data scientist, and BI tool must be aware of how the table is physically organized. This is a leaky abstraction, implementation details bleeding into the query layer.

Partition Columns Pollute the Schema

The year and month columns are synthetic, they exist only for partitioning. They shouldn't be part of the logical table definition, but in Hive, they are.

Changing the Partition Scheme Is Destructive

If you want to change from monthly to daily partitions, you must rewrite the entire table. There's no way to evolve the partition scheme without a full data migration.

Wrong Queries Cause Silent Performance Disasters

There's no warning when a query doesn't use partition columns. It just scans everything and runs 100x slower than expected. On a petabyte table, this can cost thousands of dollars in compute.

How Iceberg Hidden Partitioning Works

Iceberg separates the logical partition definition from query predicates. You define partition transforms on source columns, and Iceberg handles the mapping.

Defining Partitions

CREATE TABLE orders (
    order_id BIGINT,
    customer_id BIGINT,
    order_date TIMESTAMP,
    amount DECIMAL(10,2),
    region STRING
)
PARTITIONED BY (month(order_date), region);

Notice: there are no synthetic year or month columns. The partition is defined as a transform function month() applied to the existing order_date column.

How Queries Map to Partitions

When a user writes:

SELECT * FROM orders WHERE order_date = '2024-03-15';

Iceberg's planning step:

Reads the partition spec: month(order_date)
Evaluates: month('2024-03-15') = 2024-03
Uses the manifest list to skip all partitions except 2024-03

The user wrote a natural predicate on order_date. Iceberg automatically mapped it to the correct partition. No partition column awareness required.

Available Partition Transforms

Transform	Input Type	Example	Use Case
`identity(col)`	Any	`identity(region)`	Low cardinality columns
`year(col)`	Timestamp/Date	`year(order_date)`	Annual reporting
`month(col)`	Timestamp/Date	`month(order_date)`	Monthly batch workloads
`day(col)`	Timestamp/Date	`day(order_date)`	Daily ingestion
`hour(col)`	Timestamp	`hour(event_ts)`	High-volume streaming
`bucket(N, col)`	Any	`bucket(16, user_id)`	Even distribution for joins
`truncate(W, col)`	String/Int	`truncate(3, zip_code)`	Grouping by prefix

Range Predicate Support

Hidden partitioning works with range predicates too:

-- Scans only partitions for March, April, May 2024
SELECT * FROM orders WHERE order_date BETWEEN '2024-03-01' AND '2024-05-31';

Iceberg evaluates the range against partition bounds and prunes partitions that fall entirely outside the range.

Partition Pruning in the Read Path

Hidden partitioning connects directly to Iceberg's three-level query pruning:

Manifest list level: Each manifest entry includes partition summaries. Iceberg skips manifests whose partition ranges don't overlap with the query predicate.
Manifest file level: Within surviving manifests, each data file entry includes the partition value. Files with non-matching partitions are skipped.
Data file level: Column statistics within each surviving file enable further pruning.

For a detailed walkthrough, see How Apache Iceberg Is Built for Performance.

Partition Evolution: Changing Without Rewriting

One of the most powerful consequences of hidden partitioning is partition evolution. Since partitions are defined as transforms (not physical directory schemes), you can change the partition scheme without rewriting data:

-- Table was partitioned by month; traffic grew, now need daily partitions
ALTER TABLE orders SET PARTITION SPEC (day(order_date), region);

After this change:

Old data stays in monthly partitions, no rewrite needed
New data is written to daily partitions
Queries spanning both time periods work correctly, Iceberg evaluates the predicate against both partition specs

This is impossible with Hive-style partitioning, where changing the partition scheme requires rewriting every file.

Multi-Column Partitioning and Query Planning

Tables can be partitioned by multiple transforms:

PARTITIONED BY (day(order_date), bucket(8, customer_id));

When a query filters on order_date, Iceberg prunes by date partitions but scans all customer buckets. When a query filters on both order_date AND customer_id, both partition dimensions are used for pruning.

For multi-column filter patterns that don't map to partitions, consider Z-ordering to improve within-file pruning.

How Different Engines Handle Hidden Partitioning

Engine	Support Level	Notes
Dremio	Full (reads + writes)	Automatic pruning, Reflections enhance further
Apache Spark	Full	Most mature implementation
Apache Flink	Full	Streaming writes respect partition spec
Trino	Full	Reads and writes
DuckDB	Read-only	Partition pruning on reads
Snowflake	Read-only	External Iceberg tables only

Compared to Other Table Formats

Feature	Iceberg	Delta Lake	Hudi
Hidden partitioning	Yes (transforms on source columns	No) uses Liquid Clustering as alternative	No, explicit partition columns
Partition evolution	Yes (no data rewrite	No) requires rewrite	No, requires rewrite
Range predicate pruning	Yes (automatic	Yes) via file statistics	Partial
Multi-column transforms	Yes	No	No

For a full comparison, see Table Format Partitioning: Iceberg vs. Hudi vs. Delta Lake.

Best Practices for Partition Strategy

Start Broad, Evolve Narrow

Begin with month(ts) partitioning. If data volume grows and queries become more granular, evolve to day(ts). You can always add granularity without rewriting.

Avoid Over-Partitioning

Too many partitions create too many small files. A partition should contain at least one full-size data file (128-256MB). If each partition has only a few megabytes, consolidate.

Combine with Sort Order

Within each partition, sort data by frequently filtered columns. This tightens column statistics and improves file pruning.

Use Bucket for High-Cardinality Joins

bucket(N, join_key) evenly distributes data for parallel join execution across engines. Good for customer_id, user_id, or other high-cardinality columns.

use Dremio Reflections

Dremio's Reflections work with hidden partitions, you can create Reflections with different sort orders and partition schemes optimized for specific query patterns.

SQL Examples

Create a Table with Hidden Partitions

CREATE TABLE events (
    event_id BIGINT,
    user_id BIGINT,
    event_ts TIMESTAMP,
    event_type STRING,
    payload STRING
)
PARTITIONED BY (day(event_ts), bucket(16, user_id));

Query Without Knowing the Partition Scheme

-- Works perfectly, Iceberg maps to day partition automatically
SELECT * FROM events 
WHERE event_ts > TIMESTAMP '2024-03-01 00:00:00'
  AND user_id = 42;

Evolve the Partition Scheme

ALTER TABLE events SET PARTITION SPEC (hour(event_ts), bucket(32, user_id));

Hidden partitioning is one of the key advantages that makes Iceberg the most performant open table format for production workloads. It eliminates a class of performance bugs that plague traditional data lakes and enables non-technical users to write fast queries without understanding data layout.

Real-World Scenarios

Scenario 1: E-Commerce Order Analytics

An e-commerce company has an orders table with 500 million rows across 3 years. Analysts query by date range, region, and product category.

Without hidden partitioning (Hive):

-- Analyst must know: table is partitioned by year and month columns
SELECT region, SUM(total) FROM orders 
WHERE year = 2024 AND month = 3 AND region = 'US';
-- If analyst writes WHERE order_date = '2024-03-15', full table scan!

With hidden partitioning (Iceberg):

-- Analyst writes natural queries, Iceberg handles partition mapping
SELECT region, SUM(total) FROM orders 
WHERE order_date BETWEEN '2024-03-01' AND '2024-03-31' AND region = 'US';
-- Automatically maps to month=2024-03 partition, scans only matching files

The Iceberg version is safer (no accidental full scans), more readable (natural predicates), and forward-compatible (partition scheme can evolve without changing queries).

Scenario 2: IoT Sensor Data with Time-Based Access

An IoT platform ingests sensor readings partitioned by hour(event_ts). Data scientists query by specific time windows, while dashboards query by day or week.

-- Data scientist: query specific 2-hour window
SELECT * FROM sensor_data 
WHERE event_ts BETWEEN TIMESTAMP '2024-03-15 14:00:00' AND TIMESTAMP '2024-03-15 16:00:00';
-- Iceberg prunes to exactly 2 hour-partitions

-- Dashboard: daily rollup
SELECT sensor_id, AVG(reading) FROM sensor_data 
WHERE event_ts >= TIMESTAMP '2024-03-15 00:00:00' 
  AND event_ts < TIMESTAMP '2024-03-16 00:00:00'
GROUP BY sensor_id;
-- Iceberg prunes to 24 hour-partitions

Both queries work correctly and efficiently without either user knowing the partition scheme.

Scenario 3: Multi-Tenant SaaS with Bucket Partitioning

A SaaS application uses bucket(32, tenant_id) partitioning to evenly distribute data across partitions for parallel processing. Queries always filter by tenant:

SELECT * FROM events WHERE tenant_id = 'acme-corp' AND event_type = 'login';
-- Iceberg evaluates bucket(32, 'acme-corp') = bucket 17
-- Scans only 1 of 32 buckets, 97% of data pruned

Frequently Asked Questions

Does hidden partitioning work with every query engine?

Yes, the partition spec is stored in the table metadata, and every engine that reads Iceberg tables evaluates it during query planning. The pruning happens at the metadata level, not in engine-specific code.

Can I use hidden partitioning with existing data?

Yes. You can add a partition spec to an existing table, and new writes will follow the new partition scheme. Existing data stays unpartitioned, but future writes are partitioned. Over time, as old data is compacted or replaced, the entire table becomes partitioned. See partition evolution for details.

How does hidden partitioning affect write performance?

There's minimal overhead. The engine evaluates the partition transform for each row during writes to route it to the correct partition. For transforms like month() or day(), this is a simple timestamp extraction, negligible compared to the cost of serializing and uploading Parquet files.

Does Dremio support all partition transforms?

Yes. Dremio supports all standard Iceberg partition transforms including identity, year, month, day, hour, bucket, and truncate. Dremio's Reflections can also add additional sort orders and partition schemes optimized for specific dashboard query patterns.

What about cost savings?

By pruning 80-95% of files before scanning, hidden partitioning directly reduces object storage read costs. On AWS, each S3 GET request costs $0.0004 per 1,000 requests. A query that scans 10,000 files instead of 500 costs 20x more in S3 API fees alone, not counting the compute time difference.

Free Resources to Continue Your Iceberg Journey

Iceberg Lakehouse Books from Dremio Authors

Legacy Content

Partitioning Up Until Now

When it comes to running queries on your data lake, Hive delivers some great conveniences in being able to write SQL queries that can be converted into MapReduce jobs on HDFS (Hadoop Distributed File System). Hive provides users the ability to interact with tables by putting files in a particular directory – but this means when you query that table it will scan through every record (referred to as a full table scan), which makes queries take a long time to plan and to execute, especially at scale. To address this problem, Hive went with the tried-and-true method of partitioning, which enables you to specify a column or columns to split the dataset into smaller pieces.

For example, if the queries on a dataset commonly filter by month, you may want to partition the data by month. However, in most events and datasets, there isn’t an “event_month” field – rather, it’s usually event_date or event_timestamp. So, in order to partition by month, the ETL job writing the data to this table would have to duplicate the data in the event_timestamp field, truncate it to the month granularity, and write it to a new column “event_month.” This creates a directory in the table directory for each month, and queries that filter by this derived “event_month” field would take advantage of partition pruning and avoid a full table scan, resulting in much better response times.

However, many users of the dataset, especially new users, may not know about this duplicative field. They’re typically used to filtering by when the event really happened – the event_timestamp. If a user’s query filters for a single month, but not on the new event_month field (e.g., event_timestamp BETWEEN ‘2022-01-01’ AND ‘2022-01-31’), the query will not do any partition pruning, resulting in a full table scan and painfully slow response times.

By relying on the physical file system structure to determine what is a table and a partition, Hive tables had several disadvantages:

The wrong partition scheme could result in many subdirectories and files which could make query planning and execution take longer as they iterate through directories and files, aka the infamous small file problem.
If you wanted to re-partition, it would involve a complete rewrite of the data because the data would have to be reorganized along with the directory structure which would be very costly.
Partition columns must be used explicitly to avoid a full table scan. For some partition schemes, you need to create a new column (for example, taking a timestamp and making a year, month or date column). To benefit from the partition you always have to include the new month column in your query even if you specified a “where” clause on the original timestamp column.

There’s a need for a better way of tracking tables and partitions whether using Hive with HDFS, Cloud Object Storage, or any other storage layer.

Smarter Partitioning with Apache Iceberg

There have been several attempts to create a “table format” that can track tables in a way that eliminates many of these pain points. Each has different limitations, such as what file types you can work with, what types of transactions are possible, and more. At Netflix, a solution was created that is now an Apache Foundation Project called Apache Iceberg.

The Apache Iceberg table format separates the definition of tables and partitions from the physical structure of the files through three layers of metadata that track the overall table, its snapshots, and the list of data files, respectively. This brings many benefits:

Since partitions are tracked within metadata files, you don’t have to worry about partition subdirectories slowing down your queries based on your partitioning.

You can evolve your partitions. For example, if you were partitioning data by year and wanted to partition by month going forward you can do so without rewriting the table. The old data will follow the old partition, new data will follow the new partition, and Iceberg will be made aware in order to plan queries appropriately. Evolving your partitions can be a lot less expensive than a complete repartitioning rewrite.

In particular, an Iceberg feature called “hidden partitioning” helps overcome many of the challenges associated with partitioning Hive tables.

Let’s say you have a table of dentist visits with a timestamp of when the visit occurred as a column. Hive would need the timestamp column transformed into a new column to partition the data by month. In the query below, the data is taken from one non-partitioned table and created in a new partitioned table.

The problem arises when you want to query a particular set of days within a month (example below). Even though the where clause already captures the time window based on the same visit_time column that visit_month is based on, you still have to explicitly query the month column to get the benefit of the partition (without it a full table scan would occur). This could result in someone who is not familiar with the partitioning of the table running longer, more expensive queries than they need to.

The experience is significantly improved with the Iceberg format. Instead of creating a new field to be the partition column, you can specify a column and transform it. Both are tracked within the metadata without having to add a physical column to the table.

So when it comes to writing the files and organizing the data files and the metadata around them, such as manifests, Iceberg knows the table is partitioned by the column visit_time and the transform month. So when anyone queries the visit_time column there is no need to specify a different, partition-specific column in the query. That step is taken care of by Iceberg.

Manifest file(s)

A manifest file is a metadata file that lists a subset of data files that make up a snapshot of the table at a certain point in time.

Manifest file(s)

u003ca href=u0022https://iceberg.apache.org/terms/#manifest-listu0022u003eu003cspan style=u0022font-weight: 400;u0022u003eA manifest file is a metadata file that lists a subset of data files that make up a snapshot of the table at a certain point in time.u003c/spanu003eu003c/au003e

When this query is being planned, Iceberg picks up on the column by which the table is physically partitioned and automatically generates two predicates, the scan predicate, and the partition predicate to filter which manifests or delete files (files that specify deleted records that shouldn’t show up in query results of the current snapshot) the query will be executed against.

Delete file(s)

Files that track records to be removed/ignored from certain data files in a snapshot.

Delete file(s)

u003ca href=u0022https://iceberg.apache.org/spec/#delete-formatsu0022u003eu003cspan style=u0022font-weight: 400;u0022u003eFiles that track records to be removed/ignored from certain data files in a snapshot. u003c/spanu003eu003c/au003e

Hidden partitioning allows for simplified management on the table definition and data writing side. On the data access side, data consumers don’t have to know precisely how the data is partitioned, if they query based on any defined permutation of a partitioned field they will automatically get the benefits from the query planning without having to add an explicit clause to take advantage of the partition.

Embrace an Open Architecture with Apache Iceberg

Iceberg metadata management provides huge gains in efficiency when writing/organizing/accessing business-critical data in the data lake. Hidden partitioning adds some very user-friendly functionality to make Iceberg easier to use and simpler to manage. Maximize your data lake by building on robust open formats like Apache Iceberg.

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.

Start For Free

Article Topics

Dremio Blog: Various Insights

Blog coverpage for Ingesting Data into Aparche Iceberg with Dremio

Feb 1, 2024 Product Insights from the Dremio Blog

Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg

By unifying data from diverse sources, simplifying data operations, and providing powerful tools for data management, Dremio stands out as a comprehensive solution for modern data needs. Whether you are a data engineer, business analyst, or data scientist, harnessing the combined power of Dremio and Apache Iceberg will undoubtedly be a valuable asset in your data management toolkit.

Alex Merced

Sep 22, 2023 Dremio Blog: Open Data Insights

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]

Alex Merced

Oct 12, 2023 Product Insights from the Dremio Blog

Table-Driven Access Policies Using Subqueries

This blog helps you learn about table-driven access policies in Dremio Cloud and Dremio Software v24.1+.

Albert Vernon

Table of Contents

The Problem with Traditional Partitioning

Users Must Know the Partition Scheme

Partition Columns Pollute the Schema

Changing the Partition Scheme Is Destructive

Wrong Queries Cause Silent Performance Disasters

How Iceberg Hidden Partitioning Works

Defining Partitions

How Queries Map to Partitions

Available Partition Transforms

Range Predicate Support

Partition Pruning in the Read Path

Partition Evolution: Changing Without Rewriting

Multi-Column Partitioning and Query Planning

How Different Engines Handle Hidden Partitioning

Compared to Other Table Formats

Best Practices for Partition Strategy

Start Broad, Evolve Narrow

Avoid Over-Partitioning

Combine with Sort Order

Use Bucket for High-Cardinality Joins

use Dremio Reflections

SQL Examples

Create a Table with Hidden Partitions

Query Without Knowing the Partition Scheme

Evolve the Partition Scheme

Real-World Scenarios

Scenario 1: E-Commerce Order Analytics

Scenario 2: IoT Sensor Data with Time-Based Access

Scenario 3: Multi-Tenant SaaS with Bucket Partitioning

Frequently Asked Questions

Does hidden partitioning work with every query engine?

Can I use hidden partitioning with existing data?

How does hidden partitioning affect write performance?

Does Dremio support all partition transforms?

What about cost savings?

Free Resources to Continue Your Iceberg Journey

Iceberg Lakehouse Books from Dremio Authors

Legacy Content

Partitioning Up Until Now

Smarter Partitioning with Apache Iceberg

Manifest file(s)

Manifest file(s)

Delete file(s)

Delete file(s)

Embrace an Open Architecture with Apache Iceberg

Try Dremio Cloud free for 30 days

Related Dremio Articles

Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

Table-Driven Access Policies Using Subqueries

Ready to Get Started?