Dremio Blog

14 minute read · April 11, 2026

Open Source and the Data Lakehouse (Apache Parquet, Apache Iceberg, Apache Polaris and Apache Arrow)

Alex Merced Alex Merced Head of DevRel, Dremio
Start For Free
Open Source and the Data Lakehouse (Apache Parquet, Apache Iceberg, Apache Polaris and Apache Arrow)
Copied to clipboard

Every data warehouse, every database, every analytics platform is built from the same four components: storage, a table format, a catalog, and a query engine. Traditional systems bundle all four into a single proprietary product. You get convenience, but you also get lock-in, data silos, and a compounding infrastructure bill.

The data lakehouse takes a different approach. It deconstructs these components into modular, interchangeable layers, each built on open-source standards. This post walks through the Apache Software Foundation projects that form the core of the open lakehouse stack, what each one does, and how Dremio integrates them into a production-ready platform with built-in AI capabilities.

The Four Components Every Data System Needs

Whether you're running Oracle, Snowflake, or a Postgres instance on your laptop, four layers are always present:

ComponentWhat It Does
StorageWhere data physically lives: disk, SSD, object storage
Table FormatHow data is organized into tables with schemas, partitions, and transaction guarantees
CatalogThe registry that tracks what tables exist, where they are, and who can access them
Query EngineThe software that reads data, optimizes query plans, and returns results

In a traditional data warehouse, these four components are welded together. The storage format is proprietary. The catalog is internal. The engine only works with its own data. You buy the whole stack from one vendor, and your data lives inside their system.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

The Cost of Bundled, Proprietary Systems

When storage, metadata, catalog, and engine are all integrated into one product, every system becomes a silo. Customer data in your CRM can't be joined with revenue data in your warehouse without first copying it through an ETL pipeline. That copy creates a second version of the data, one that can drift out of sync with the source.

Multiply this across an organization's full tool stack and you get a pattern: five copies of customer data in five systems, each slightly different, none authoritative. The cost isn't just storage. It's the ETL pipelines you maintain, the governance gaps between copies, and the engineering time spent reconciling conflicting numbers every quarter.

You also can't swap out one component. If you want a faster query engine, you migrate your entire warehouse. If the vendor raises prices, you pay it or start a multi-month migration project.

The Lakehouse Decouples the Stack

The data lakehouse architecture separates these four components into independent layers. Each layer can be chosen, configured, and replaced independently.

  • Storage becomes cheap, durable object storage (S3, Azure Blob, Google Cloud Storage) that you own and control.
  • Table format becomes an open standard that adds database-level reliability on top of files.
  • Catalog becomes an independent service that any engine can connect to.
  • Query engine becomes interchangeable: use the best engine for each workload.

To maximize interoperability across these layers, you want open-source components. And for infrastructure-level building blocks where vendor neutrality matters most, Apache Software Foundation projects are the strongest choice.

The ASF is a non-profit that stewards 320+ open-source projects under vendor-neutral governance. Projects operate under "The Apache Way": transparent decision-making, merit-based leadership, and community over code. No single company can unilaterally change the API, license, or direction of an ASF project. That's why every layer of the open lakehouse stack has an Apache project at its core.

Apache Parquet: The Storage Format

Data on object storage needs a file format. CSV works but is slow (the engine reads every column even if a query only needs two) and space-inefficient. Apache Parquet solves both problems.

Parquet is a columnar file format, meaning it stores data by column rather than by row. This design enables three performance advantages:

  • Column pruning: A query that only needs customer_id and revenue reads only those two columns from disk. Row-based formats read everything.
  • Efficient compression: Same-type data in a column compresses well. Parquet files are typically 75-90% smaller than equivalent CSV files.
  • Predicate pushdown: Parquet embeds min/max statistics per column per row group. A query with WHERE revenue > 1000 can skip entire row groups that fall outside that range without reading any data.

Every major analytics engine reads Parquet natively: Spark, Trino, Dremio, DuckDB, Snowflake, Athena. Choosing Parquet as your storage format means your data is accessible to any engine on day one.

Apache Iceberg: The Table Format

Parquet files are efficient, but a folder full of Parquet files isn't a table. There's no schema enforcement, no transaction guarantees, and no way for two engines to safely write at the same time. Apache Iceberg fills that gap.

Iceberg is an open table format that sits on top of Parquet files (or ORC/Avro) and adds the reliability features you'd expect from a database:

  • ACID transactions: Concurrent readers and writers don't corrupt data or see partial updates.
  • Schema evolution: Add, drop, rename, or reorder columns without rewriting data files.
  • Partition evolution: Change your partitioning strategy (e.g., from daily to hourly) without rewriting historical data.
  • Hidden partitioning: Users write WHERE event_date = '2026-01-15', and Iceberg maps that to the correct physical partition automatically.
  • Time travel: Query any historical snapshot of the table for auditing, debugging, or reproducibility.

Because Iceberg is engine-agnostic, the same table can be read and written by Spark, Flink, Trino, Dremio, and Snowflake concurrently. Your data isn't locked into any engine's proprietary format.

Apache Polaris: The Catalog

Tables need a catalog, a central registry that tells engines "table X exists, its current metadata is at location Y, and here are the access rules." Apache Polaris is an open-source REST catalog purpose-built for Iceberg.

Polaris graduated to an Apache Top-Level Project in March 2026, meaning it has met the ASF's standards for community diversity, governance maturity, and production readiness. Key capabilities:

  • Iceberg REST API implementation: Any engine that speaks the Iceberg REST protocol can connect to Polaris. One catalog, many engines.
  • Unified access control: Security rules are enforced at the catalog layer, independent of which engine is running the query.
  • Multi-cloud support: Polaris works across S3, Azure, and GCS. Your catalog isn't tied to one cloud provider.

Without an open catalog, each engine maintains its own view of what tables exist. Polaris gives you a single source of truth.

Apache Arrow: The Performance Layer

Once data reaches the query engine, it needs to be processed in memory. Traditional engines read Parquet (columnar on disk) and convert it into row-based in-memory structures for processing. That conversion, the "serialization tax," wastes significant CPU time.

Apache Arrow eliminates this by defining a columnar in-memory format that engines can use natively. Key performance advantages:

  • Zero-copy interoperability: Different systems (Python, Java, C++) share data in Arrow format without converting it. No serialization, no deserialization.
  • Vectorized processing: Arrow's contiguous memory layout enables SIMD (Single Instruction, Multiple Data) instructions, allowing CPUs to process multiple values simultaneously.
  • Arrow Flight: A gRPC-based transport layer that streams columnar data over the network at speeds that approach raw network throughput. This replaces the legacy JDBC/ODBC row-at-a-time transport with parallel columnar streaming.

Engines built on Arrow (like Dremio, which co-created the project) read Parquet files into Arrow buffers and process them without format conversion. That's the architectural reason for the speed difference, not bigger clusters, not more hardware, just less wasted work.

The Integration Challenge

These four Apache projects together form a complete open-source lakehouse stack. The tradeoff: assembling them yourself is real work.

You need to provision and configure object storage (S3 buckets, IAM policies, encryption). Deploy and manage a Polaris catalog. Configure Iceberg table properties (partitioning, compaction schedules, snapshot expiration). Choose and deploy a query engine. Set up security and access control at each layer independently. Build monitoring across all of it.

This is doable, and many organizations with strong platform engineering teams do exactly this. But it's a meaningful investment in infrastructure expertise and ongoing operations.

Dremio: Open Standards, Integrated for the AI Era

Dremio takes the same four Apache projects and packages them into a unified lakehouse platform that's production-ready out of the box.

  • Storage: Works with your S3, Azure, or GCS buckets, or Dremio-managed storage.
  • Table Format: Native Apache Iceberg with automatic table optimization (compaction, vacuum, manifest rewriting) that runs in the background.
  • Catalog: Open Catalog built on Apache Polaris, extended with federated source integration and fine-grained access control (row-level security, column masking).
  • Engine: High-performance MPP query engine built on Apache Arrow with vectorized processing, Columnar Cloud Cache (C3), and Autonomous Reflections that optimize query performance automatically.

Dremio also adds query federation: query data in PostgreSQL, MongoDB, Snowflake, and other sources alongside your Iceberg tables, without copying data into the lakehouse. Predicate pushdowns push filtering to each source system.

For the AI era, Dremio layers three capabilities on top:

  • AI Agent: Ask analytical questions in plain English directly in the Dremio console. The agent writes SQL, returns results, and generates visualizations.
  • MCP Server: An open-source Model Context Protocol server that connects external AI tools (ChatGPT, Claude, custom agents) to your Dremio environment with full security and governance.
  • AI SQL Functions: Run LLM operations (AI_GENERATEAI_CLASSIFYAI_COMPLETE) directly inside SQL queries. Extract structured data from unstructured sources, classify text, or summarize results without leaving the SQL engine.

The foundation is open. Your data stays in your storage, in Apache Iceberg format, managed by an Apache Polaris catalog, processed by an Apache Arrow-based engine. No proprietary formats, no lock-in. And when you need AI-powered analytics, the agentic layer is built directly on top of that open foundation.

Try Dremio Cloud free for 30 days →

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.