11 minute read · September 5, 2025
Scaling Data Lakes: Moving from Raw Parquet to Iceberg Lakehouses

· Head of DevRel, Dremio

When you save data, the format you choose makes all the difference. Think about it like keeping notes: writing them in a plain text file is simple, but finding or analyzing specific details later can be a chore. Put those same notes in a spreadsheet, and suddenly you can filter, sort, and calculate with ease. Data formats work much the same way; your choice determines how easy (or painful) it is to store, process, and analyze information at scale.
For years, many teams leaned on raw CSV and JSON files. They’re human-readable and flexible, but not very efficient when datasets grow into the billions of rows. Enter columnar formats like Apache Parquet. By storing data column by column instead of row by row, Parquet dramatically cuts down on storage size and speeds up queries, especially when you only need a handful of columns out of a massive dataset.
But as powerful as Parquet is, it only gets you part of the way there. Once you start dealing with thousands of Parquet files, things get messy. Managing schema changes, tracking versions, and keeping governance under control all become headaches. That’s where technologies like Apache Iceberg and Apache Polaris step in, turning raw collections of files into structured, discoverable, and governable datasets that truly power the modern lakehouse.
The Benefit of Parquet over Raw CSV and JSON
CSV and JSON files are like the “default settings” of data storage. They’re easy to create, easy to share, and just about every tool on the planet can read them. That’s why so many data projects start there. But as soon as your datasets grow beyond a few gigabytes, their weaknesses start to show.
CSV stores everything as plain text, which makes files bulky and slows down processing. JSON adds structure with nested fields, but it’s verbose; those curly braces and repeated keys add up fast. And in both cases, your query engine has to scan the entire file, even if you only need one column. It’s like flipping through every single page of a phone book just to find one person’s phone number.
Apache Parquet flips the script by organizing data column by column instead of row by row. This simple shift unlocks several significant benefits:
- Smaller storage footprint: Similar values are stored together, allowing for better compression and significantly reducing storage costs.
- Faster analytics: Query engines can skip irrelevant columns, scanning only the data you actually need. If your query is just asking for “total sales by region,” there’s no reason to read customer names or product descriptions.
- Compatibility with modern analytics tools: Spark, Dremio, Trino, and just about every analytics engine speaks Parquet fluently, making it a go-to choice for data lakes.
In short, Parquet turns big, unwieldy datasets into something leaner and faster to work with. However, as we’ll see, storing data as “just Parquet files” still lacks a crucial layer of organization once your data lake grows beyond a handful of datasets.
The Limitations of Using Just Parquet Files
Parquet is a considerable step up from raw CSV or JSON, but as soon as you scale from a single file to a whole lake of them, cracks begin to show. Think of it like a folder on your computer: one spreadsheet is easy to manage, but a folder with 10,000 spreadsheets, each slightly different, quickly becomes a nightmare.
Here are some of the biggest pain points teams run into when managing datasets spread across many Parquet files:
- No version control: If someone updates or deletes a file, there’s no built-in way to roll back or see what the dataset looked like last week. You’re left with manual backups (if you remembered to make them).
- Schema drift: Over time, new fields are added, old ones are renamed, and not every file gets updated consistently. Your queries are starting to fail because the dataset is no longer uniform.
- Small files problem: Ingesting data in micro-batches often results in thousands of small Parquet files. Query engines like Spark, Trino, or Dremio have to open and scan every one, which makes performance tank.
- No transactions: If you’re writing to multiple Parquet files at once and something fails midway, you can end up with a corrupted or inconsistent dataset.
These limitations don’t just make life harder for data engineers; they ripple out to analysts, data scientists, and business teams who rely on trustworthy, up-to-date data. This is the gap Apache Iceberg was created to fill: providing a table-like structure and metadata management for collections of Parquet files.
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
How Apache Iceberg Brings Metadata and Structure
Apache Iceberg was designed to address the challenges associated with managing raw Parquet files at scale. Instead of leaving you with a giant pile of files to sort through, Iceberg layers in a rich metadata system that makes your data lake feel more like a traditional database, while still keeping the openness and flexibility of the lake.
Here’s what Iceberg adds on top of Parquet:
- Centralized metadata: Iceberg tracks every file in your dataset through manifests and metadata files. That means your query engine doesn’t need to scan the entire storage bucket to figure out what exists; it just consults the metadata.
- ACID transactions: Whether you’re inserting, updating, or deleting, Iceberg ensures consistency. Multiple jobs can safely write to the same dataset at the same time without corrupting it.
- Time travel: Every change creates a new snapshot. Would you like to view your sales table from last quarter? Just query the snapshot from that point in time.
- Schema evolution: Add, drop, or rename columns without breaking queries or rewriting all your old data. Iceberg keeps track of changes safely across files.
- Hidden partitioning: Instead of manually managing partitions (and hoping you got it right), Iceberg automatically handles partition logic behind the scenes, making queries faster without extra work.
A good way to picture Iceberg is like an index in a library. Without it, you’d wander through endless shelves (files), hoping to find the right book (record). With Iceberg, the index tells you exactly where everything is, what versions exist, and how it’s organized.
This layer of intelligence transforms Parquet from “a file format” into a true table format, setting the foundation for the modern data lakehouse. But Iceberg doesn’t work alone; it still needs a way to manage and expose these tables across your organization. That’s where catalogs like Apache Polaris come in.
The Role of Lakehouse Catalogs like Apache Polaris
Apache Iceberg makes individual datasets smarter by layering in metadata and structure, but in a real-world data platform, you’re rarely dealing with just one table. You might have hundreds, or even thousands, of Iceberg tables across different domains, teams, and projects. Managing them consistently becomes its own challenge.
That’s where lakehouse catalogs come in. A catalog acts like the brain of your lakehouse, keeping track of:
- Which tables exist and where they live
- Who has access to what data?
- How multiple tools (Spark, Trino, Dremio, Flink, etc.) can find and query them
Without a catalog, each tool operates independently, and governance becomes a patchwork of ad hoc permissions and tribal knowledge. With a catalog, you get a single source of truth for discovery, access control, and metadata management.
Apache Polaris is one of the most important pieces in this puzzle. It’s an open-source implementation of the Apache Iceberg REST Catalog specification, meaning:
- Any engine that speaks Iceberg can talk to Polaris over HTTP.
- Governance features, such as credential vending, multi-tenant support, and role-based access, are built in.
- It avoids vendor lock-in by adhering to open standards.
And this isn’t just theory, Polaris already powers major platforms like Dremio Enterprise Catalog and Snowflake’s Open Catalog, with more adoption on the horizon. By pairing Iceberg’s table-level intelligence with Polaris’s catalog-level governance, you unlock a full-fledged lakehouse that feels unified and discoverable, no matter how many tools or datasets you’re working with.
Together, Iceberg and Polaris transform a data lake from a loose collection of files into a governed, high-performance lakehouse foundation.
Conclusion: Building an Open, Governed, and Scalable Lakehouse
The journey from raw CSV and JSON files to an entirely governed lakehouse shows just how far data architecture has evolved. CSV and JSON made it simple to get started, but fell short once scale and performance entered the picture. Parquet solved those problems by introducing columnar efficiency, but it still left gaps around governance, schema evolution, and multi-file consistency.
Apache Iceberg closed that gap by transforming collections of Parquet files into true tables, complete with ACID transactions, schema flexibility, and time travel capabilities. And with Apache Polaris sitting on top as a catalog, organizations finally have a way to manage all those Iceberg tables consistently, delivering centralized access, discovery, and governance across every tool in the stack.
Together, Iceberg and Polaris form the backbone of the modern lakehouse: open, performant, and enterprise-ready. They empower data teams to move beyond file management and focus instead on delivering trustworthy insights and AI-ready datasets.
If your organization is still juggling raw Parquet files, now is the time to take the next step. Explore Apache Iceberg to bring structure to your data lake, and look to Polaris to unify governance and access. The result isn’t just cleaner architecture, it’s a foundation that can scale with your business, your teams, and the future of AI-driven analytics.