19 minute read · October 2, 2025

Minimizing Iceberg Table Management with Smart Writing

Alex Merced · Head of DevRel, Dremio

Batch Ingestion Best Practices

Streaming Ingestion Strategies

Partitioning and Clustering Design

Table Properties Tuning

Monitoring and Adjustment

Conclusion

Managing Apache Iceberg tables effectively is often a balancing act between write performance, storage efficiency, and query speed. While Iceberg’s flexibility enables powerful features like time travel and schema evolution, many teams find themselves running frequent OPTIMIZE jobs to compact small files and rebalance partitions. These jobs improve performance but also consume valuable compute resources, extend maintenance windows, and can complicate low-latency or streaming pipelines.

The good news is that many of these challenges can be mitigated before they ever appear. By designing ingestion pipelines, both batch and streaming, to write optimized data up front, data engineers and architects can reduce or even eliminate the need for heavy post-write optimization. This not only lowers operational costs but also keeps systems responsive in real-time scenarios where every second counts.

In this blog, we’ll explore practical strategies for smart writing in Iceberg, including:

Choosing effective file sizes and compression formats
Designing partitions and clustering to align with query patterns
Buffering and batching streaming writes to avoid “small files”
Leveraging Iceberg table properties for compaction-friendly layouts

By the end, you’ll have a clear framework for building ingestion flows that minimize the need for reactive maintenance, helping you focus less on running compaction jobs and more on delivering value from your data.

Batch Ingestion Best Practices

When loading data into Iceberg tables in batch mode, whether via scheduled ETL jobs or one-time bulk imports, your choices at write time set the stage for future performance and maintenance. Poorly sized files, fragmented partitions, and inefficient compression schemes all increase the likelihood that you’ll need frequent OPTIMIZE jobs later. The goal is to design your batch ingestion so that each write produces data that is “close to optimal” from the beginning.

1. Write Fewer, Larger Files

Iceberg performs best when files are in the 128 MB – 512 MB range. Writing too many small files creates metadata overhead and slows down queries, while very large files reduce query parallelism. Most engines, including Spark and Flink, provide options to control the target file size (for example, via the write.target-file-size-bytes table property). Setting this appropriately ensures your batch jobs emit well-sized Parquet files right away.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

2. Use Compression and Row Group Settings Wisely

By default, Dremio and Iceberg recommend Zstandard compression because it offers an excellent balance between size reduction and decompression speed. In addition, configuring row group sizes to align with your target file size helps keep files compact while still enabling predicate pushdown and efficient scans.

3. Align Partitioning with Query Workloads

Partitioning is one of the most powerful features of Iceberg, but it can be a double-edged sword if not designed carefully. When writing batch data:

Partition on fields that are frequently used in filters (e.g., date, region).
Avoid over-partitioning, which creates many small partitions and leads to scattered small files.
Use hidden partitioning in Iceberg when possible to simplify write logic and reduce ingestion complexity.

4. Consider Pre-Clustering or Sorting Data

If your queries often scan by a particular column (such as customer_id or event_time), you can pre-sort your batch data before writing. This ensures related records are physically co-located, reducing the need for costly reclustering later.

5. Monitor and Adjust Over Time

Batch ingestion isn’t “set it and forget it.” Regularly check the distribution of file sizes and partition balance in your Iceberg tables. If you consistently see small files being produced, tune your job settings, like increasing parallelism or adjusting the flush size, to correct course before small-file accumulation becomes a bigger issue.

Streaming Ingestion Strategies

While batch jobs give you full control over file sizes and partitioning, streaming pipelines introduce new challenges. Continuous ingestion, especially with tools like Kafka Connect, Flink, or Spark Structured Streaming, often leads to many small files being written to your Iceberg tables. Left unchecked, this creates the classic small files problem, which degrades query performance and increases the need for frequent optimization jobs. With the right techniques, however, you can significantly reduce this overhead.

1. Buffer Events into Micro-Batches

Instead of writing every incoming event as a new file, configure your streaming framework to buffer records into small in-memory batches before flushing to storage. For example:

Adjust checkpoint intervals in Spark or Flink so files flush every few minutes rather than seconds.
Set flush thresholds (by record count or size) in Kafka Connect to ensure each commit produces files at least 128–256 MB in size.

This approach keeps latency low while still producing files large enough for efficient reads.

2. Tune File Size Properties for Streaming Workloads

Iceberg provides table properties like write.target-file-size-bytes, which defaults to ~512 MB. For streaming, you may want to lower this target so that files don’t take too long to fill up while still avoiding tiny files. Choosing a target between 128–256 MB often balances latency and efficiency.

3. Leverage Partitioning that Matches Stream Characteristics

Streaming data often arrives sorted by time or another event attribute. Use that to your advantage:

Partition by event time windows (e.g., date, hour) to keep related records together.
Avoid overly granular partitions (like minute), which can generate sparse partitions and increase metadata overhead.

4. Align Flush Intervals with Downstream SLAs

Every streaming pipeline has a tradeoff between data freshness and file quality. If downstream systems require data to be queryable within seconds, you may accept smaller files. But if a few minutes of latency is acceptable, you can buffer longer and produce well-sized files that minimize the need for compaction later.

5. Monitor and Adjust in Real Time

Unlike batch ingestion, streaming workloads evolve with traffic patterns. Monitor your file size distribution and number of files per partition daily. If you see a buildup of undersized files, adjust your stream’s buffer or interval settings immediately. Some organizations even implement adaptive flush logic, where flush intervals scale dynamically with throughput.

Partitioning and Clustering Design

Even if your ingestion pipelines are writing data at optimal file sizes, the way those files are organized across partitions and sort orders has a huge impact on query performance and maintenance overhead. Poorly chosen partitions can scatter data, create thousands of small files, and force the system to scan unnecessary data. On the other hand, smart partitioning and clustering strategies ensure that queries read only what they need, while minimizing the burden of post-ingestion optimization.

1. Partition for Your Most Common Queries

Partition columns should reflect how your users and applications filter the data. For example:

Time-based partitions (e.g., date, hour) are effective when most queries slice data by time ranges.
Categorical partitions (e.g., region, customer_id) make sense when workloads are strongly aligned with those dimensions.

The key is to choose high-cardinality fields carefully, while they can reduce scan sizes, they may also produce too many small partitions.

2. Avoid Over-Partitioning

It’s tempting to create highly granular partitions (like date + hour + region) for precision, but this often leads to an explosion of small files. Remember that Iceberg supports hidden partitioning, which automatically derives partitions (like extracting year/month/day from a timestamp) without exposing this complexity to the query layer. This keeps ingestion simple while still optimizing file placement.

3. Use Clustering (Sorting) for Secondary Optimizations

Partitioning is not always enough, especially when queries filter by multiple dimensions. In these cases, clustering (sorting) your data at write time can yield big performance gains:

Sorting by event_time within each date partition makes range scans much more efficient.
Sorting by customer_id ensures all rows for a customer are co-located, speeding up point lookups and joins.

Dremio’s OPTIMIZE command can recluster data post-write, but pre-clustering during ingestion means fewer heavy compaction jobs later.

4. Balance Query Flexibility with Maintenance Cost

Every partition and clustering decision represents a tradeoff: more selective queries versus more complex ingestion. As a rule of thumb:

Start with time-based partitions plus one high-value dimension.
Add clustering on secondary dimensions if queries consistently filter by them.
Reevaluate partition strategy as workloads evolve, what works for a batch warehouse may not work for a real-time application.

5. Monitor Partition Health

Partition skew, where some partitions grow very large while others stay tiny, can reduce parallelism and create hotspots. Regularly check partition distribution to ensure your strategy is still aligned with data volume and query behavior. Adjust as needed before it snowballs into a costly optimization problem.

Table Properties Tuning

Beyond ingestion strategies and partitioning, Iceberg’s table properties give data engineers and architects fine-grained control over how data is written and managed. Tuning these settings at the table level helps ensure files are sized, compressed, and organized properly from the start, reducing reliance on heavy post-processing like OPTIMIZE.

1. Control File Sizes at Write Time

The property write.target-file-size-bytes sets the desired file size for new data files. By default, it targets ~512 MB, which works well for large batch jobs but may be too high for streaming workloads. Adjusting this value to 128–256 MB in low-latency scenarios helps avoid small files while still producing query-friendly sizes.

2. Optimize Parquet Write Behavior

Iceberg exposes a range of Parquet-specific settings, including:

write.parquet.compression-codec – defaults to Zstandard for a balance of size and speed.
write.parquet.row-group-size-bytes – controls how data is chunked within files; larger row groups reduce overhead but may increase scan sizes.
write.parquet.page-size-bytes – manages the encoding page size, affecting predicate pushdown performance.

Fine-tuning these properties allows you to align storage format with query patterns while keeping files compact and efficient.

3. Manage Metadata Growth

Iceberg uses manifests to track file metadata. Left unchecked, manifest files can accumulate and increase query planning costs. Properties like commit.manifest.target-size-bytes (default ~8 MB) let you control the target size of manifests, ensuring they don’t fragment into too many small files. Pair this with occasional manifest compaction (via OPTIMIZE … REWRITE MANIFESTS) to keep metadata lean.

4. Configure Snapshot Expiration Defaults

Snapshots drive time travel in Iceberg but also consume storage. Table properties such as history.expire.max-snapshot-age-ms (default 5 days) and history.expire.min-snapshots-to-keep (default 1) determine how long old snapshots are retained. Setting these appropriately can reduce storage overhead and automate cleanup without needing to run VACUUM too frequently.

5. Tune for Your Workload, Not Just Defaults

The default values provided by Iceberg and Dremio are designed for general-purpose use. In practice, every workload is different:

High-throughput batch jobs may benefit from larger target file sizes and row groups.
Streaming ingestion may require lower targets and smaller flush intervals.
Metadata-heavy pipelines may need more aggressive manifest sizing.

Monitoring table health (file counts, file sizes, metadata size) will help you decide when tuning is necessary.

Monitoring and Adjustment

Even with the best ingestion patterns, partitioning strategies, and table property tuning, data systems are not static. Data volumes, query patterns, and workload priorities evolve over time. To ensure your Iceberg tables remain optimized with minimal post-processing, you need to actively monitor their health and adjust settings as needed.

1. Track File Size Distribution

Regularly analyze the distribution of file sizes within your Iceberg tables. If you see a buildup of files below 128 MB, that’s a clear signal that your ingestion job parameters, like flush intervals, checkpoint settings, or file size targets, need adjustment. Conversely, if you’re consistently creating files above 512 MB, consider lowering the target size to improve parallelism in query execution.

2. Watch Partition Balance

Partition skew can silently degrade query performance. If a few partitions contain the bulk of data while others remain sparse, queries may overburden certain executors while others stay idle. Monitoring partition-level row counts and file counts helps you identify when partitioning logic needs to be rethought.

3. Monitor Metadata Growth

Over time, manifest files and snapshots accumulate. If query planning starts slowing down, check the number and size of metadata files. This is often an early indicator that you need to adjust manifest sizing properties, expire old snapshots, or schedule lightweight maintenance like REWRITE MANIFESTS.

4. Establish KPIs for Table Health

To stay proactive, define metrics that represent “healthy” tables in your environment, such as:

Average file size in the 128–512 MB range
Partition balance within acceptable skew ratios
Metadata size staying within predictable growth patterns

By monitoring these KPIs, you can react to issues before they require heavy-handed optimization jobs.

5. Iterate as Workloads Change

As data grows, query patterns shift, or new applications come online, revisit your ingestion and table configurations. What worked at 1 TB of data may not scale at 100 TB. Iterative tuning ensures your system remains performant and minimizes the need for frequent OPTIMIZE operations.

Conclusion

The real secret to minimizing Iceberg table maintenance isn’t running more optimization jobs, it’s writing smarter data from the very beginning. By combining batch and streaming ingestion best practices, designing thoughtful partitioning and clustering strategies, tuning table properties, and monitoring file health, you can dramatically reduce the frequency and cost of downstream operations like OPTIMIZE.

For data engineers and architects, this shift in mindset is powerful:

Lower operational overhead – fewer compaction jobs mean less scheduling, fewer compute spikes, and simpler pipelines.
More predictable performance – consistently well-sized files and balanced partitions lead to stable, high-performing queries.
Future-ready systems – as workloads evolve, your ingestion design adapts more easily than constantly relying on reactive optimizations.

Ultimately, smart writing turns maintenance into a proactive practice rather than a reactive burden. Instead of firefighting small files and partition skew, your team can focus on building reliable, performant data platforms that scale cleanly with growth.

Article Topics

Product Insights from the Dremio Blog

Minimizing Iceberg Table Management with Smart Writing

Table of Contents

Batch Ingestion Best Practices

1. Write Fewer, Larger Files

Try Dremio’s Interactive Demo

2. Use Compression and Row Group Settings Wisely

3. Align Partitioning with Query Workloads

4. Consider Pre-Clustering or Sorting Data

5. Monitor and Adjust Over Time

Streaming Ingestion Strategies

1. Buffer Events into Micro-Batches

2. Tune File Size Properties for Streaming Workloads

3. Leverage Partitioning that Matches Stream Characteristics

4. Align Flush Intervals with Downstream SLAs

5. Monitor and Adjust in Real Time

Partitioning and Clustering Design

1. Partition for Your Most Common Queries

2. Avoid Over-Partitioning

3. Use Clustering (Sorting) for Secondary Optimizations

4. Balance Query Flexibility with Maintenance Cost

5. Monitor Partition Health

Table Properties Tuning

1. Control File Sizes at Write Time

2. Optimize Parquet Write Behavior

3. Manage Metadata Growth

4. Configure Snapshot Expiration Defaults

5. Tune for Your Workload, Not Just Defaults

Monitoring and Adjustment

1. Track File Size Distribution

2. Watch Partition Balance

3. Monitor Metadata Growth

4. Establish KPIs for Table Health

5. Iterate as Workloads Change

Conclusion

Make data engineers and analysts 10x more productive

Table of Contents

Batch Ingestion Best Practices

1. Write Fewer, Larger Files

Try Dremio’s Interactive Demo

2. Use Compression and Row Group Settings Wisely

3. Align Partitioning with Query Workloads

4. Consider Pre-Clustering or Sorting Data

5. Monitor and Adjust Over Time

Streaming Ingestion Strategies

1. Buffer Events into Micro-Batches

2. Tune File Size Properties for Streaming Workloads

3. Leverage Partitioning that Matches Stream Characteristics

4. Align Flush Intervals with Downstream SLAs

5. Monitor and Adjust in Real Time

Partitioning and Clustering Design

1. Partition for Your Most Common Queries

2. Avoid Over-Partitioning

3. Use Clustering (Sorting) for Secondary Optimizations

4. Balance Query Flexibility with Maintenance Cost

5. Monitor Partition Health

Table Properties Tuning

1. Control File Sizes at Write Time

2. Optimize Parquet Write Behavior

3. Manage Metadata Growth

4. Configure Snapshot Expiration Defaults

5. Tune for Your Workload, Not Just Defaults

Monitoring and Adjustment

1. Track File Size Distribution

2. Watch Partition Balance

3. Monitor Metadata Growth

4. Establish KPIs for Table Health

5. Iterate as Workloads Change

Conclusion

Additional Resources

Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg

Table-Driven Access Policies Using Subqueries

Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud

Make data engineers and analysts 10x more productive