18 minute read · October 2, 2025

Apache Iceberg Table Performance Management with Dremio’s OPTIMIZE

Alex Merced

Alex Merced · Head of DevRel, Dremio

Apache Iceberg provides a powerful foundation for managing large analytical datasets, but like any data system, performance depends heavily on how well the data is organized on disk. Over time, frequent writes, schema evolution, and streaming ingestion can leave tables fragmented with many small files or oversized files that hurt query speed. Left unmanaged, this leads to slower scans, longer planning times, and more compute spent on queries.

This is where Dremio’s OPTIMIZE command comes into play. Designed specifically for Iceberg tables, OPTIMIZE rewrites data and metadata files into more efficient layouts, merging small files, splitting overly large files, and even reclustering data when needed. With a few simple parameters, data engineers can fine-tune the balance between optimization depth and job duration, tailoring compaction to their workloads. And with Dremio’s built-in auto-optimization capabilities, much of this work can even be automated, reducing operational burden while keeping performance consistently high.

In this blog, we’ll take a deep dive into:

  • What the OPTIMIZE command does under the hood
  • How to use its arguments to control file sizes, partition targeting, and manifest rewrites
  • Strategies for balancing optimization quality with job runtime and cost
  • How to leverage Dremio’s auto-optimization to keep tables healthy without constant manual intervention

By the end, you’ll have a clear understanding of how to incorporate Dremio’s OPTIMIZE into your Iceberg performance management strategy, ensuring queries stay fast, metadata remains lean, and maintenance fits seamlessly into your data platform.

What OPTIMIZE Does and Why It Matters

Every write to an Iceberg table, whether from batch loads, streaming pipelines, or incremental updates, creates new Parquet data files and metadata entries. Over time, these writes can result in:

  • Too many small files – Common with streaming ingestion, these increase metadata overhead and force queries to open many files, slowing scans.
  • Overly large files – Typical of bulk loads, these reduce query parallelism, making large scans less efficient.
  • Unbalanced partitions – Data may cluster unevenly, leaving some partitions with thousands of files and others with very few.
  • Bloated manifests – Metadata files can fragment, forcing the query planner to read more metadata than necessary.

The OPTIMIZE command is Dremio’s way of addressing these issues. When you run OPTIMIZE, Dremio rewrites the table’s files to meet target sizes and clustering rules, while also optionally compacting manifests. This results in:

  • Fewer, better-sized files – Improving scan efficiency and reducing open-file overhead.
  • Balanced partitions – Ensuring data is evenly distributed for parallel execution.
  • Leaner metadata – Compact manifests reduce query planning latency.
  • Improved query performance – Optimized tables consistently deliver faster and more predictable queries.

For clustered tables (where data is stored in a sorted order on specific keys), OPTIMIZE also attempts to re-cluster the data so that records are co-located according to clustering rules. This can significantly improve query performance when filtering or joining on those keys, though it may require multiple runs to fully converge the data layout.

Ultimately, OPTIMIZE ensures that as your Iceberg tables evolve, they remain performant and efficient. Without it, query latency creeps upward, metadata grows unwieldy, and maintenance costs rise, especially at scale.

Basic Syntax and Clauses

At its core, the OPTIMIZE TABLE command is a straightforward SQL operation, but its flexibility comes from the options you can apply. Here’s the general structure:

OPTIMIZE TABLE <table_name>

   [REWRITE DATA [USING BIN_PACK] [FOR PARTITIONS <predicate>]

      [(TARGET_FILE_SIZE_MB=…, MIN_FILE_SIZE_MB=…, MAX_FILE_SIZE_MB=…, MIN_INPUT_FILES=…)] ]

   [REWRITE MANIFESTS];

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

1. REWRITE DATA

This is the heart of OPTIMIZE. By default, Dremio uses the bin-pack strategy, which merges small files and splits overly large ones to reach a balanced file size. Adding REWRITE DATA explicitly lets you control its behavior and tune for your workload.

  • USING BIN_PACK – This is the default strategy and is almost always the right choice.
  • FOR PARTITIONS <predicate> – Allows you to focus optimization on specific partitions (for example, FOR PARTITIONS (date = '2025-10-01')). This is useful when only certain partitions are “hot” and receiving frequent updates.

2. Tuning Parameters

Inside REWRITE DATA, you can pass parameters to control how aggressively compaction is performed:

  • TARGET_FILE_SIZE_MB – The ideal size for output files (default 256 MB). Larger values reduce metadata overhead, smaller values increase query parallelism.
  • MIN_FILE_SIZE_MB – Files smaller than this threshold are considered for merging.
  • MAX_FILE_SIZE_MB – Files larger than this threshold are split.
  • MIN_INPUT_FILES – The minimum number of files that must be present before an OPTIMIZE job runs (default 5).

Together, these parameters let you balance optimization quality vs job runtime. For example, lowering TARGET_FILE_SIZE_MB can make jobs finish faster while still improving file layout.

3. REWRITE MANIFESTS

Manifests track the files and partitions within an Iceberg table. Over time, these metadata files can become fragmented and numerous. Adding REWRITE MANIFESTS to your command compacts them into fewer, larger manifests (default target size ~8 MB). This reduces query planning time and is a lightweight step you can run frequently, even outside heavy compaction jobs.

Key Parameters in Detail

The flexibility of OPTIMIZE lies in its tuning options. By adjusting these parameters, you can decide whether to prioritize query performance, job speed, or resource efficiency. Let’s look at the main parameters:

1. TARGET_FILE_SIZE_MB

  • Default: 256 MB
  • What it does: Sets the ideal size for output files after compaction.
  • Why it matters:
    • Larger target sizes reduce the number of files and metadata entries, lowering query planning overhead.
    • Smaller targets allow greater query parallelism, since more executors can scan in parallel.
  • Example use case: For large analytic scans, keep the target high (e.g., 512 MB). For workloads with many small queries, lower it to 128–256 MB to increase parallelism.

2. MIN_FILE_SIZE_MB

  • Default: ~24% of target file size (≈ 61 MB if target = 256 MB).
  • What it does: Files smaller than this threshold are considered for merging.
  • Why it matters: Helps eliminate the “small files problem” that hurts performance.
  • Example use case: In streaming ingestion pipelines, raise this value to aggressively merge small files and avoid metadata bloat.

3. MAX_FILE_SIZE_MB

  • Default: ~180% of target file size (≈ 460 MB if target = 256 MB).
  • What it does: Files larger than this threshold are split into smaller files.
  • Why it matters: Prevents oversized files that reduce parallelism.
  • Example use case: If your ingestion jobs produce giant files (1–2 GB each), lowering this value ensures queries can run across more threads in parallel.

4. MIN_INPUT_FILES

  • Default: 5
  • What it does: Sets the minimum number of files required for an optimization job to trigger.
  • Why it matters: Prevents OPTIMIZE from running on too few files, where the cost outweighs the benefits.
  • Example use case: Lower this to 2–3 for smaller tables where you still want frequent optimization, or raise it higher for large workloads to reduce unnecessary jobs.

 Balancing Act:
These parameters allow you to trade off optimization depth for job duration and compute cost. For example:

  • Lowering TARGET_FILE_SIZE_MB and MIN_INPUT_FILES yields faster jobs but leaves more files behind.
  • Increasing them means deeper optimization but longer runtimes.

For most production environments, the right approach is to start with defaults, then tune incrementally based on table size, ingestion pattern, and workload characteristics.

Manifest Rewriting

Optimizing Iceberg tables isn’t just about the size of the data files, it’s also about the metadata that tracks them. Iceberg uses manifest files to list and describe all the data files belonging to a table. Each manifest file contains information about file paths, partition data, and statistics. Over time, frequent writes can create many small manifest files, increasing the cost of query planning because the engine must scan them all before execution.

The REWRITE MANIFESTS option in Dremio’s OPTIMIZE command solves this by compacting many small manifests into fewer, larger ones.

Why It Matters

  • Faster Query Planning – With fewer manifests to scan, queries start faster.
  • Reduced Metadata Overhead – Smaller, consolidated manifests lower memory and CPU use during planning.
  • Healthier Metadata Layer – Cleaner manifests improve the scalability of your Iceberg table as it grows.

When to Use It

  • After High-Frequency Writes – If ingestion jobs are writing small batches often, manifests can fragment quickly.
  • Before Peak Query Hours – Running a lightweight REWRITE MANIFESTS ensures queries won’t hit planning slowdowns during busy periods.
  • Alongside Data Rewrites – Combining data compaction and manifest rewriting keeps both files and metadata in sync.

Example

OPTIMIZE TABLE sales_data

   REWRITE DATA (TARGET_FILE_SIZE_MB=256)

   REWRITE MANIFESTS;

This command merges small files into ~256 MB targets and compacts manifests at the same time, giving you the benefits of both data and metadata optimization in one run.

Low-Cost Maintenance

Unlike full data compaction, manifest rewriting is relatively lightweight. It can be run more frequently, even daily or multiple times per day, without putting major pressure on your compute resources. This makes it an excellent “quick win” for keeping tables responsive without waiting for heavier optimization jobs.

Auto-Optimization in Dremio

While manual OPTIMIZE jobs give you precise control, many data teams don’t want to spend time scheduling and tuning compaction runs themselves. To solve this, Dremio offers auto-optimization, a feature that keeps Iceberg tables healthy in the background with little intervention.

How It Works

When auto-optimization is enabled at the catalog or table level:

  • Dremio automatically compacts small files and manages metadata for you.
  • Jobs run on a regular schedule (default every 3 hours for data optimization, every 24 hours for vacuum/cleanup).
  • Optimizations focus on partitions that have been updated recently, reducing wasted effort on cold data.

This ensures that your most active tables stay performant without the constant need for manual maintenance.

Why It’s Useful

  • Reduces Operational Overhead – Engineers no longer need to manually script or schedule OPTIMIZE jobs for every ingestion pipeline.
  • Keeps Hot Data Fast – Tables that see frequent inserts or updates are automatically tuned for query performance.
  • Consistent Performance – Query users get predictable response times without waiting for manual optimization to catch up.

Where It Fits Best

  • High-ingestion tables – Streaming or micro-batch pipelines where small files accumulate quickly.
  • Critical reporting datasets – Tables that must deliver consistent performance to BI dashboards or production applications.
  • Shared environments – Teams with many users and queries benefit from an automatically managed baseline of performance.

Example: Enabling Auto-Optimization

In Dremio’s UI, you can enable auto-optimization for the entire catalog or for specific tables. Once enabled, Dremio continuously monitors and optimizes behind the scenes. For most organizations, this becomes the default layer of maintenance, with manual OPTIMIZE reserved for exceptional cases where deeper tuning is required.

Conclusion

Performance management for Apache Iceberg tables isn’t just about cleaning up small files, it’s about ensuring your data layout evolves in step with your ingestion patterns and query workloads. Dremio’s OPTIMIZE command provides the precision engineers need: merging, splitting, and reclustering data into efficient layouts while keeping metadata lean. With its flexible parameters, you can tailor compaction jobs to strike the right balance between optimization depth, runtime, and cost.

At the same time, Dremio’s auto-optimization features mean you don’t always have to run these jobs manually. By letting Dremio continuously monitor and optimize Iceberg tables in the background, your most critical datasets stay query-ready without the overhead of constant maintenance.

For data engineers and architects, the best approach often combines both worlds:

  • Auto-optimization for ongoing, incremental maintenance that keeps hot data performant.
  • Manual OPTIMIZE for targeted, deeper compaction or special cases where fine-tuned control is required.

Together, these strategies ensure your Iceberg tables remain fast, scalable, and cost-efficient, so your data teams can focus less on managing files and more on delivering insights.

Make data engineers and analysts 10x more productive

Boost efficiency with AI-powered agents, faster coding for engineers, instant insights for analysts.