Every data warehouse, every database, every analytics platform is built from the same four components: storage, a table format, a catalog, and a query engine. Traditional systems bundle all four into a single proprietary product. You get convenience, but you also get lock-in, data silos, and a compounding infrastructure bill.
The data lakehouse takes a different approach. It deconstructs these components into modular, interchangeable layers, each built on open-source standards. This post walks through the Apache Software Foundation projects that form the core of the open lakehouse stack, what each one does, and how Dremio integrates them into a production-ready platform with built-in AI capabilities.
The Four Components Every Data System Needs
Whether you're running Oracle, Snowflake, or a Postgres instance on your laptop, four layers are always present:
Component
What It Does
Storage
Where data physically lives: disk, SSD, object storage
Table Format
How data is organized into tables with schemas, partitions, and transaction guarantees
Catalog
The registry that tracks what tables exist, where they are, and who can access them
Query Engine
The software that reads data, optimizes query plans, and returns results
In a traditional data warehouse, these four components are welded together. The storage format is proprietary. The catalog is internal. The engine only works with its own data. You buy the whole stack from one vendor, and your data lives inside their system.
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
The Cost of Bundled, Proprietary Systems
When storage, metadata, catalog, and engine are all integrated into one product, every system becomes a silo. Customer data in your CRM can't be joined with revenue data in your warehouse without first copying it through an ETL pipeline. That copy creates a second version of the data, one that can drift out of sync with the source.
Multiply this across an organization's full tool stack and you get a pattern: five copies of customer data in five systems, each slightly different, none authoritative. The cost isn't just storage. It's the ETL pipelines you maintain, the governance gaps between copies, and the engineering time spent reconciling conflicting numbers every quarter.
You also can't swap out one component. If you want a faster query engine, you migrate your entire warehouse. If the vendor raises prices, you pay it or start a multi-month migration project.
The Lakehouse Decouples the Stack
The data lakehouse architecture separates these four components into independent layers. Each layer can be chosen, configured, and replaced independently.
Storage becomes cheap, durable object storage (S3, Azure Blob, Google Cloud Storage) that you own and control.
Table format becomes an open standard that adds database-level reliability on top of files.
Catalog becomes an independent service that any engine can connect to.
Query engine becomes interchangeable: use the best engine for each workload.
To maximize interoperability across these layers, you want open-source components. And for infrastructure-level building blocks where vendor neutrality matters most, Apache Software Foundation projects are the strongest choice.
The ASF is a non-profit that stewards 320+ open-source projects under vendor-neutral governance. Projects operate under "The Apache Way": transparent decision-making, merit-based leadership, and community over code. No single company can unilaterally change the API, license, or direction of an ASF project. That's why every layer of the open lakehouse stack has an Apache project at its core.
Apache Parquet: The Storage Format
Data on object storage needs a file format. CSV works but is slow (the engine reads every column even if a query only needs two) and space-inefficient. Apache Parquet solves both problems.
Parquet is a columnar file format, meaning it stores data by column rather than by row. This design enables three performance advantages:
Column pruning: A query that only needs customer_id and revenue reads only those two columns from disk. Row-based formats read everything.
Efficient compression: Same-type data in a column compresses well. Parquet files are typically 75-90% smaller than equivalent CSV files.
Predicate pushdown: Parquet embeds min/max statistics per column per row group. A query with WHERE revenue > 1000 can skip entire row groups that fall outside that range without reading any data.
Every major analytics engine reads Parquet natively: Spark, Trino, Dremio, DuckDB, Snowflake, Athena. Choosing Parquet as your storage format means your data is accessible to any engine on day one.
Apache Iceberg: The Table Format
Parquet files are efficient, but a folder full of Parquet files isn't a table. There's no schema enforcement, no transaction guarantees, and no way for two engines to safely write at the same time. Apache Iceberg fills that gap.
Iceberg is an open table format that sits on top of Parquet files (or ORC/Avro) and adds the reliability features you'd expect from a database:
ACID transactions: Concurrent readers and writers don't corrupt data or see partial updates.
Schema evolution: Add, drop, rename, or reorder columns without rewriting data files.
Partition evolution: Change your partitioning strategy (e.g., from daily to hourly) without rewriting historical data.
Hidden partitioning: Users write WHERE event_date = '2026-01-15', and Iceberg maps that to the correct physical partition automatically.
Time travel: Query any historical snapshot of the table for auditing, debugging, or reproducibility.
Because Iceberg is engine-agnostic, the same table can be read and written by Spark, Flink, Trino, Dremio, and Snowflake concurrently. Your data isn't locked into any engine's proprietary format.
Apache Polaris: The Catalog
Tables need a catalog, a central registry that tells engines "table X exists, its current metadata is at location Y, and here are the access rules." Apache Polaris is an open-source REST catalog purpose-built for Iceberg.
Polaris graduated to an Apache Top-Level Project in March 2026, meaning it has met the ASF's standards for community diversity, governance maturity, and production readiness. Key capabilities:
Iceberg REST API implementation: Any engine that speaks the Iceberg REST protocol can connect to Polaris. One catalog, many engines.
Unified access control: Security rules are enforced at the catalog layer, independent of which engine is running the query.
Multi-cloud support: Polaris works across S3, Azure, and GCS. Your catalog isn't tied to one cloud provider.
Without an open catalog, each engine maintains its own view of what tables exist. Polaris gives you a single source of truth.
Apache Arrow: The Performance Layer
Once data reaches the query engine, it needs to be processed in memory. Traditional engines read Parquet (columnar on disk) and convert it into row-based in-memory structures for processing. That conversion, the "serialization tax," wastes significant CPU time.
Apache Arrow eliminates this by defining a columnar in-memory format that engines can use natively. Key performance advantages:
Zero-copy interoperability: Different systems (Python, Java, C++) share data in Arrow format without converting it. No serialization, no deserialization.
Vectorized processing: Arrow's contiguous memory layout enables SIMD (Single Instruction, Multiple Data) instructions, allowing CPUs to process multiple values simultaneously.
Arrow Flight: A gRPC-based transport layer that streams columnar data over the network at speeds that approach raw network throughput. This replaces the legacy JDBC/ODBC row-at-a-time transport with parallel columnar streaming.
Engines built on Arrow (like Dremio, which co-created the project) read Parquet files into Arrow buffers and process them without format conversion. That's the architectural reason for the speed difference, not bigger clusters, not more hardware, just less wasted work.
The Integration Challenge
These four Apache projects together form a complete open-source lakehouse stack. The tradeoff: assembling them yourself is real work.
You need to provision and configure object storage (S3 buckets, IAM policies, encryption). Deploy and manage a Polaris catalog. Configure Iceberg table properties (partitioning, compaction schedules, snapshot expiration). Choose and deploy a query engine. Set up security and access control at each layer independently. Build monitoring across all of it.
This is doable, and many organizations with strong platform engineering teams do exactly this. But it's a meaningful investment in infrastructure expertise and ongoing operations.
Dremio: Open Standards, Integrated for the AI Era
Dremio takes the same four Apache projects and packages them into a unified lakehouse platform that's production-ready out of the box.
Storage: Works with your S3, Azure, or GCS buckets, or Dremio-managed storage.
Table Format: Native Apache Iceberg with automatic table optimization (compaction, vacuum, manifest rewriting) that runs in the background.
Catalog:Open Catalog built on Apache Polaris, extended with federated source integration and fine-grained access control (row-level security, column masking).
Engine: High-performance MPP query engine built on Apache Arrow with vectorized processing, Columnar Cloud Cache (C3), and Autonomous Reflections that optimize query performance automatically.
Dremio also adds query federation: query data in PostgreSQL, MongoDB, Snowflake, and other sources alongside your Iceberg tables, without copying data into the lakehouse. Predicate pushdowns push filtering to each source system.
For the AI era, Dremio layers three capabilities on top:
AI Agent: Ask analytical questions in plain English directly in the Dremio console. The agent writes SQL, returns results, and generates visualizations.
MCP Server: An open-source Model Context Protocol server that connects external AI tools (ChatGPT, Claude, custom agents) to your Dremio environment with full security and governance.
AI SQL Functions: Run LLM operations (AI_GENERATE, AI_CLASSIFY, AI_COMPLETE) directly inside SQL queries. Extract structured data from unstructured sources, classify text, or summarize results without leaving the SQL engine.
The foundation is open. Your data stays in your storage, in Apache Iceberg format, managed by an Apache Polaris catalog, processed by an Apache Arrow-based engine. No proprietary formats, no lock-in. And when you need AI-powered analytics, the agentic layer is built directly on top of that open foundation.
Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop
We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]
Aug 16, 2023·Dremio Blog: News Highlights
5 Use Cases for the Dremio Lakehouse
With its capabilities in on-prem to cloud migration, data warehouse offload, data virtualization, upgrading data lakes and lakehouses, and building customer-facing analytics applications, Dremio provides the tools and functionalities to streamline operations and unlock the full potential of data assets.
Aug 31, 2023·Dremio Blog: News Highlights
Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud
Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.