Dremio Blog

27 minute read · June 16, 2026

Agentic Lakehouse: The Architecture Built for AI-Native Analytics

Alex Merced Alex Merced Head of DevRel, Dremio
Start For Free
Agentic Lakehouse: The Architecture Built for AI-Native Analytics
Copied to clipboard

The Agentic Lakehouse solves a problem that most data teams haven't fully articulated yet: the architecture you built for human analysts is the wrong architecture for AI agents. A traditional lakehouse is optimized for predictable SQL from BI tools, tuned by DBAs who know the query patterns, and governed through access controls that assume a human is on the other end. AI agents operate at machine speed, generate novel queries no DBA anticipated, need business context to interpret data accurately, and require governance that doesn't pause for human review. Dremio coined the term Agentic Lakehouse to describe an architecture that treats AI agents as first-class consumers, not afterthoughts.

This is the definitive guide to what the Agentic Lakehouse is, why it matters, and how Dremio's platform is built to deliver it.

Why Traditional Lakehouses Were Not Built for AI Agents

Most data lakehouses in production today were designed around a specific mental model: a data engineer builds pipelines, a DBA tunes performance, and a human analyst runs SQL via a BI tool. That model shaped every architectural decision, from how performance is optimized to how governance is enforced.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Predictable Queries vs. Machine-Speed Novelty

BI tools fire the same queries repeatedly. A dashboard for monthly revenue might run the same GROUP BY five hundred times a day. DBAs can see those patterns and create the right indexes, materialized views, or partitions. Autonomous optimization is optional, because human operators can do it manually.

AI agents don't repeat themselves the same way. An agent asked to "analyze customer churn across all segments and compare it to last quarter's cohort retention rates" generates a query that might have never been run before. The next request generates another novel query. Manual tuning cannot keep up with this volume and variety.

The Missing Business Context Problem

A traditional lakehouse stores data. It doesn't explain what that data means. Tables are named fct_rev_q3_2024_final_v2. Columns have names like c_amt_net_usd. A human analyst who has worked with this data for two years knows what these mean. An AI agent has no such institutional knowledge.

When an agent queries the wrong table or misinterprets a metric because the platform provides no business context, it produces confidently wrong answers. This is not a model problem. It is an architecture problem. The semantic layer is missing.

Governance Bottlenecks at Human Speed

Traditional governance assumes a human submits a data access request, a data steward reviews it, and an administrator grants the permission. That cycle takes hours or days. AI agents can generate thousands of queries in minutes, accessing data at a rate that makes human-speed approval gates completely unworkable.

Governance for the agentic era must be enforced automatically, at the query engine level, with fine-grained precision: which rows an agent can see, which columns it can read, and a full audit trail of every action it took.

Interfaces Designed for Humans, Not Machines

Traditional lakehouses expose data through SQL consoles, BI tool connectors, and JDBC/ODBC drivers. These are designed for humans or human-written applications. AI agents need standardized machine-readable interfaces: REST APIs with clear schemas, Model Context Protocol (MCP) endpoints for agent frameworks, and metadata APIs that let agents discover what data exists without browsing a UI.

The interface layer of a traditional lakehouse is not equipped for this.

What Is the Agentic Lakehouse?

The Agentic Lakehouse is a data lakehouse architecture specifically designed to serve as the operational foundation for autonomous AI agents. It is built to be queried by agents, governed at machine speed, and maintained autonomously rather than by DBAs.

Two characteristics define it:

Built for agents means the architecture actively serves AI workloads. This includes a semantic layer that gives agents business context, fine-grained access control enforced at the query engine, standardized machine interfaces, and the ability to handle high-concurrency unpredictable queries without degradation.

Managed by agents means the infrastructure itself is maintained through automation rather than human operators. Materialized views are created and refreshed autonomously. Table optimization runs without manual scheduling. Metadata is generated by AI. The platform manages itself so engineering teams don't spend their time keeping the lights on.

This dual characteristic is what separates the Agentic Lakehouse from a traditional lakehouse with an AI chatbot bolted on. It is not a feature addition. It is a different class of architecture.

The Four Technology Layers of a Lakehouse Stack

Every modern data lakehouse is built from four modular layers. Understanding each layer is necessary to understand where agentic capabilities must be built in.

Object Storage: The Foundation

Object storage (Amazon S3, Azure Data Lake Storage, Google Cloud Storage) serves as the commodity storage substrate. It is cheap, durable, and decoupled from compute. Every other layer sits on top of it. The Agentic Lakehouse inherits this from the traditional lakehouse and adds no lock-in here.

Apache Iceberg: The Open Table Format

Apache Iceberg is the open table format that turns object storage files into queryable tables with ACID transactions, schema evolution, time travel, and partition management. Any engine that implements the Iceberg spec (Spark, Flink, Trino, Dremio, DuckDB) can read and write the same data without translation.

Iceberg V3 adds capabilities particularly relevant to AI workloads: deletion vectors for efficient row-level deletes without full file rewrites, the Variant type for semi-structured data that AI pipelines commonly produce, and row-lineage tracking for data provenance.

Apache Polaris: The Open Catalog

Apache Polaris is an open-source Iceberg REST catalog that Dremio donated to the Apache Software Foundation. It provides a standardized HTTP interface for registering and discovering tables across multiple engines, enforcing access control, and vending short-lived storage credentials to clients.

Dremio's Open Catalog is a production-managed deployment of Apache Polaris. Any engine that speaks the Iceberg REST spec can connect to it.

Dremio: The Query Engine

Dremio sits at the top of the stack as the query and intelligence layer. It executes SQL across all sources (not just Iceberg), enforces fine-grained access policies, maintains the semantic layer, and exposes the interfaces that AI agents use to interact with data. This is where the Agentic Lakehouse capabilities are primarily delivered.

What Makes Each Layer Agentic?

The four-layer stack is the foundation. The transformation from a traditional lakehouse to an Agentic Lakehouse happens at each layer through specific capabilities.

Storage: Open by Default

Object storage becomes agentic through the absence of lock-in. Because files are in open Iceberg format on commodity storage, AI agents using Python libraries like PyArrow or DuckDB can access them directly without going through a proprietary query API. ML training pipelines can read the same files that your BI dashboards query. No data copying required.

Iceberg V3: Features AI Workloads Demand

AI pipelines frequently need fast row-level updates. A recommendation model that tracks user events needs to delete stale records efficiently. Iceberg's deletion vectors handle this without rewriting entire Parquet files, which was the performance bottleneck in earlier formats. The Variant type stores semi-structured JSON natively in Iceberg tables, which matters for AI workloads that produce varied output schemas.

Hidden partitioning (an earlier Iceberg feature) prevents agents from needing to know partition column values when writing queries. The format handles pruning automatically.

Polaris Catalog: The Governance Layer for Agents

The Polaris catalog does three things that make it specifically agentic. First, Fine-Grained Access Control (FGAC) enforces row filters, column masks, and table-level permissions as part of the catalog spec, so any engine connecting through the catalog inherits those controls automatically. Second, credential vending means agents never hold permanent storage keys. The catalog issues short-lived credentials per request, limiting exposure. Third, REST-based semantic search allows agents to discover datasets by meaning rather than navigating a directory hierarchy.

Dremio Query Engine: Where Intelligence Lives

Dremio transforms data access from passive SQL execution into an active intelligence service. Five distinct capabilities deliver this, described in detail in the next section.

The Five Pillars of Dremio's Agentic Lakehouse

Dremio organizes its Agentic Lakehouse capabilities into five pillars. Each one addresses a specific failure mode of the traditional lakehouse when serving AI agents.

Pillar 1: Query Federation and Access to Everything Without Moving It

AI agents need access to all relevant data, not just what lives in the data lake. A customer churn analysis might need event data from the data lake, subscription records from Salesforce, financial data from Snowflake, and transaction history from a PostgreSQL database. In a traditional architecture, someone must ETL all of that into one place before the agent can query it.

Dremio's Query Federation eliminates that requirement. A single SQL query can join across a Snowflake table, an Iceberg dataset on S3, and a live PostgreSQL table without moving any data. The AI agent writes one query; Dremio routes it to the right sources and merges the results.

This matters for agents specifically because agents don't always know in advance which sources they'll need. Federation allows agents to explore across systems without requiring pre-positioned data pipelines.

Pillar 2: Autonomous Performance for Unpredictable Query Patterns

AI agents generate query patterns no DBA planned for. Waiting for a human to notice performance degradation and create a new materialized view is not viable when agents run thousands of queries per day.

Autonomous Reflections solve this. Dremio observes incoming query patterns and automatically creates, updates, and retires materialized views (called Reflections) that accelerate those patterns. When a cluster of similar queries hits a data source that could benefit from pre-aggregation, Dremio creates the Reflection without any operator action. When the pattern changes, Dremio updates or removes the Reflection.

C3 (Coordinated Cloud Cache) adds an SSD-tier cache layer across Dremio's executor nodes. Frequently accessed Iceberg files are cached locally, eliminating the latency of cold object storage hits. For agents running high-frequency analytical queries, C3 can reduce query latency by an order of magnitude compared to reading from S3 on every request.

results cache handles exact-repeat queries. If two agents ask the same question within a time window, the second gets the cached result instantly. Learn more about how this works in the Autonomous Performance blog.

Pillar 3: AI Semantic Layer and Business Context for AI Accuracy

The semantic layer is arguably the most important pillar for AI agent accuracy. Without it, agents operate on raw table names and undocumented columns. With it, agents work with a governed, documented, business-friendly representation of your data.

Dremio's AI Semantic Layer consists of several components:

Virtual datasets are named, documented SQL views that expose data through business-friendly names. Instead of fct_rev_q3_2024_final_v2, an agent queries quarterly_revenue_by_segment. The view handles the underlying complexity.

AI-generated wikis attach natural-language descriptions to every table, column, and virtual dataset. Dremio's AI generates these from the data itself, reducing the documentation burden on data teams while giving agents the context they need to interpret results correctly.

Labels are categorical tags that help agents filter the catalog. An agent can search for "all finance-approved datasets" or "datasets related to customer behavior" and get a governed list of relevant tables.

Governed metrics define business measures like total_revenue or active_user_count once, in a controlled location. Every agent querying those metrics gets the same definition, preventing inconsistency across agent responses.

Semantic search allows agents to query the catalog by intent. An agent framework can ask "find datasets about customer lifetime value" and receive a ranked list of relevant virtual datasets. This is fundamentally different from browsing a table list or running SHOW TABLES.

Pillar 4: Agentic Interfaces, MCP Server, and the Built-In AI Agent

Traditional lakehouses expose data via JDBC, ODBC, or proprietary connectors designed for specific tools. AI agent frameworks need something different: a standardized protocol that handles authentication, tool discovery, and structured query execution in a way that agent orchestration layers can consume automatically.

Dremio's MCP Server implements the Model Context Protocol, an emerging open standard for connecting AI agents to tools and data sources. Any agent framework that supports MCP (including LangChain, LlamaIndex, and Claude's tooling) can connect to Dremio's MCP Server and immediately gain access to the full semantic layer, governed datasets, and query execution.

OAuth handles agent identity, so every query carries the agent's credentials. Every action is audit-logged with agent identity, not just "API user." This distinction matters for compliance: you can answer "which agent accessed which data at what time" with precision.

Dremio's Built-In AI Agent goes a step further. It is a conversational interface native to the Dremio platform that speaks to the semantic layer, not to raw tables. When a user asks "what was our top-performing product line last quarter and how does that compare to the prior year?", the built-in agent uses the governed metrics and virtual datasets to answer accurately, with SQL it can show and audit. This is not a general-purpose chatbot repurposed for data. It is a data agent built on an architecture specifically designed to support it.

Pillar 5: AI SQL Functions and Native Machine Intelligence

SQL is the lingua franca of analytics, and Dremio extends it with AI-native functions that agents can use directly in queries. VECTOR_SEARCH enables similarity search over embedding columns stored in Iceberg tables, allowing an agent to find semantically similar records without a separate vector database. NLP embedding functions convert text columns to vector representations in-query. Conversion functions like CONVERT_TO_JSON simplify the transformation of Iceberg Variant columns for downstream AI processing.

These functions mean AI capabilities live where the data lives, rather than requiring data to be extracted to an external AI service and brought back.

Traditional Lakehouse vs. Agentic Lakehouse

The differences between a traditional lakehouse and an Agentic Lakehouse are architectural, not cosmetic. The following table maps eight dimensions where the architectures diverge:

DimensionTraditional LakehouseAgentic Lakehouse
Primary consumerHuman analystsHuman analysts and AI agents
Query patternsPredictable (BI queries)Unpredictable (agent-generated)
Performance tuningManual DBA interventionAutonomous Reflections, C3, results cache
Business contextNot present in the platformAI Semantic Layer with wikis, labels, governed metrics
Data accessSQL from a single connected engineFederated across lake, warehouse, SaaS, databases
Agent interfacesNoneMCP Server, Built-In AI Agent, REST API, Python libs
Governance modelRole-based at the database levelFGAC at the engine level, per-agent audit logging
AI capabilitiesExternal only, via ETLNative AI SQL functions, autonomous tuning

The most significant row is governance. A traditional lakehouse controls access at the database level, which means a service account with read access to a table can read all rows and all columns. An Agentic Lakehouse enforces fine-grained control at the engine level: specific agents see specific rows based on dynamic filters, specific columns based on masking policies, and all of it is logged per agent identity.

For organizations deploying multiple AI agents with different permission levels (a general analysis agent, a privileged financial planning agent, a customer-facing agent with narrow data access), this granularity is not optional.

The Open-Source Foundation That Makes It Real

The Agentic Lakehouse is not a proprietary architecture. It is built on three open-source standards that any organization can adopt independently of Dremio.

Apache Iceberg is the table format. Dremio has been one of the most active contributors to the Iceberg project since 2016. Iceberg's open specification means your data is readable by Spark, Flink, Trino, DuckDB, and any other engine that implements the spec. If you ever change your query engine, your data files go with you.

Apache Arrow is the in-memory columnar format at the core of Dremio's query execution. Dremio co-founded the Apache Arrow project, which is now used by pandas, PyArrow, Polars, and nearly every modern data tool. Arrow's zero-copy interoperability means Dremio query results can be consumed directly by Python ML libraries without serialization overhead.

Apache Polaris is the REST catalog. Dremio donated Polaris to the Apache Software Foundation, making the catalog spec open and implementable by any vendor. Your catalog metadata is not locked to Dremio. Any engine that speaks the Iceberg REST spec can connect to a Polaris catalog.

This open-standard foundation is not just a philosophical stance. It is a practical protection against vendor lock-in. Organizations that build their data infrastructure on open formats and open protocols retain the ability to switch components, add engines, and adapt as the ecosystem evolves.

Eleven Years Building Toward This Architecture

Dremio was founded in 2015 by the creators of Apache Hadoop and MapReduce, with a founding thesis that data should be accessible to anyone in an organization without data engineering bottlenecks. That thesis has remained constant for eleven years. What has changed is the landscape in which it operates.

In 2016, Dremio began deep investment in Apache Iceberg support, years before Iceberg became the industry standard. That same year, Dremio co-founded the Apache Arrow project, establishing the in-memory columnar format that now underpins the entire Python data ecosystem.

In 2024, Dremio donated Apache Polaris to the Apache Software Foundation, creating an open, vendor-neutral REST catalog standard that any organization can build on.

In 2025 and 2026, these investments coalesced into the Agentic Lakehouse vision. Federated query, autonomous performance tuning, the semantic layer, and open-standard interoperability were not retooled for the AI era. They were the foundation on which the AI era arrived.

Dremio did not pivot to AI. The AI era arrived at an architecture Dremio spent a decade building toward.

How to Build an Agentic Lakehouse with Dremio

Building an Agentic Lakehouse with Dremio follows a logical progression from data foundation to AI agent connectivity.

Step 1: Connect All Sources

Start by connecting Dremio to every relevant data source: S3/ADLS/GCS buckets containing Iceberg tables, operational databases (PostgreSQL, MySQL), cloud warehouses (Snowflake, Redshift), and SaaS connectors. Dremio's connector library handles authentication and query pushdown. You are not copying data at this stage. You are making it queryable through a single interface.

Step 2: Organize with Apache Iceberg and Apache Polaris

Structure your primary analytical data as Apache Iceberg tables, registered in your Apache Polaris / Open Catalog deployment. Organize tables into namespaces that reflect your business domains: financemarketingoperations. Configure FGAC policies at the catalog level to define which principals can access which namespaces, tables, rows, and columns.

Step 3: Build the Semantic Layer

Create virtual datasets in Dremio that expose your Iceberg tables (and federated sources) through business-friendly names and documented logic. Use Dremio's AI wiki generation to auto-populate descriptions, then review and refine them. Define governed metrics such as monthly_recurring_revenue or 30_day_active_users as named, documented virtual columns. Tag datasets with labels that reflect their domain, sensitivity, and intended audience.

A strong semantic layer is the difference between an agent that produces accurate, auditable answers and one that confidently answers the wrong question. The complete semantic layer guide covers this in depth.

Step 4: Enable Autonomous Performance

Autonomous Reflections are enabled at the space or dataset level in Dremio. Once enabled, Dremio begins observing query patterns against that data and creates Reflections when it identifies optimization opportunities. You can also seed Reflections manually for known high-traffic datasets while letting Dremio manage the rest autonomously.

C3 cache configuration happens at the cluster level. Allocate SSD storage to the C3 tier based on your working set size. Dremio handles the rest of cache management automatically.

Step 5: Connect AI Agents via MCP or Built-In Agent

To connect an external AI agent framework, point it at Dremio's MCP Server endpoint. The MCP Server publishes a tool manifest that agent frameworks can discover automatically. Your virtual datasets and governed metrics appear as named, documented tools the agent can call.

# Example: Discover available tools via Dremio's MCP Server
curl -H "Authorization: Bearer <oauth_token>" \
     https://<your-dremio-instance>/mcp/v1/tools

The response returns a list of available datasets and functions, with descriptions drawn from your semantic layer wikis and labels. An agent framework receives structured metadata it can use to decide which tool to call for a given user request.

For teams that want immediate value without building agent orchestration, Dremio's Built-In AI Agent provides a conversational interface directly in the Dremio UI. Business users can ask questions in natural language and receive answers backed by governed, documented data, without writing SQL.

The Bottom Line: Why the Agentic Lakehouse Is the Logical Next Step

The Agentic Lakehouse is not a new name for the same architecture. It represents a genuine shift in what a data platform is responsible for. A traditional lakehouse is a managed repository. An Agentic Lakehouse is an active participant in AI workflows: it provides context, enforces governance, and optimizes itself autonomously.

The organizations most exposed to risk in the AI era are those that bolt AI capabilities onto architectures not designed for them. An AI agent operating on a platform with no semantic layer will hallucinate metrics. An agent operating without fine-grained governance will over-access data. An agent operating on a platform with manual performance tuning will either run slow or require constant DBA attention.

The Agentic Lakehouse addresses all three problems structurally. Because it is built on Apache Iceberg, Apache Arrow, and Apache Polaris, it does so without requiring you to abandon open standards or accept proprietary lock-in. Your data remains yours, readable by any Iceberg-compatible engine, cataloged by an open REST spec, and processed by a query engine co-founded on open Apache projects.

For data architects and CTOs evaluating modern data platforms, the relevant question is not "does this platform support AI?" Nearly every platform makes that claim. The relevant question is: "Is this platform's architecture designed to serve AI agents as first-class consumers, or is it a traditional platform with AI features added?" The Agentic Lakehouse gives you a framework to answer that question rigorously.

Explore the Dremio Agentic Lakehouse solution page to see the full architecture in detail, or try Dremio Cloud free for 30 days and start building your Agentic Lakehouse today.

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.