Most enterprise AI projects fail at the same place. Not the model. Not the infrastructure. The data.
Specifically: the AI doesn't understand the data well enough to act on it accurately, and it can't reach most of the data anyway. These two problems compound each other. A model that can only see 30% of your data estate while misinterpreting half of what it does see isn't an AI agent. It's an expensive way to generate confident-sounding wrong answers.
The fix isn't a bigger model or a better prompt. It's a better foundation. That foundation is federated semantics.
Why Enterprise AI Agents Need More Data Than You Think
An AI agent doing real analytical work, answering business questions, surfacing anomalies, generating forecasts, triggering actions, needs to draw on data across your entire organization. Customer behavior data in your data lake. Transactional records in a relational database. Support tickets in a SaaS platform. Financial data in a warehouse. Pipeline telemetry in object storage.
None of these systems were built to talk to each other. In practice, most enterprise data estates look like a map of unconnected islands. And when you need to deploy an AI agent that reasons across all of them, you have a choice: move everything to one place, or build the intelligence that lets the AI work across the islands as they are.
Moving everything, the ETL-and-centralize approach, has been the default for decades. It has a cost that compounds:
Pipelines break. New source? New pipeline. Schema change? Broken pipeline.
Data duplication creates governance nightmares: which copy is authoritative?
Latency between the source and the destination means the AI is always working with yesterday's data.
At enterprise scale, the infrastructure cost of moving petabytes "just in case" is enormous.
This was a defensible tradeoff when analytics teams ran weekly batch reports. It's not defensible when you need AI agents operating in near-real-time across your full data estate.
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
Semantic Understanding Is the Other Half of the Problem
Access is necessary. It's not sufficient.
Even if an AI agent could physically reach every table in every system you operate, it would still fail without understanding what that data means. A column called “ARR” might mean annual recurring revenue in one context, array data type in another, and arrival timestamp in a third. A metric called "active customer" has a different definition in sales, support, and finance, often deliberately. The AI doesn't know this. Without explicit instruction, it guesses.
This is the hallucination problem, and it's not primarily a model quality issue. It's a data context issue. Models hallucinate more when they're given ambiguous or context-free data. Give the model a well-defined semantic layer, with consistent business logic, documented definitions, and governed relationships between entities, and accuracy improves significantly.
A semantic layer formalizes what your data means:
It is the system that translates raw data into business and technical meaning. It's where your organization's definitions live: what counts as a converted opportunity, how churn rate is calculated, which transactions qualify as revenue. Instead of those definitions existing informally in someone's head or inconsistently across a dozen analyst notebooks, the semantic layer encodes them once and makes them available to every consumer of data, human or AI, in a consistent and governed way. When an AI agent queries your data through a semantic layer, it isn't guessing what "active customer" means in your business. It's working from the same definition your CFO approved. That's what separates an agent that produces trusted output from one that requires constant fact-checking.
The semantic layer teaches the AI your business language so it generates the right SQL, not generic SQL. That's the difference between an agent that produces actionable output and one that produces plausible-sounding nonsense.
Why Federated Semantics Is the Core Requirement
Here's where most semantic layer discussions fall short: they treat semantics as a property of a central repository. Build one clean warehouse, define your metrics there, done.
That approach has the same problem as centralized data infrastructure. It only covers the data you've moved. Everything still sitting in source systems, which is most of your data, gets no semantic treatment. The AI is working half-blind.
Federated semantics means extending the semantic layer across your full data estate, regardless of where the data physically lives. It means:
Consistent business definitions applied to data in your lake, your warehouse, your operational databases, and your SaaS tools
Metadata and documentation that travel with the data source, not just with a copy
Governance policies enforced at the point of access, not just at a central sink
This is the foundational layer. Everything else depends on it. A connector architecture that reaches 10 data sources is only valuable if the AI understands what it's reading in all 10. A high-performance query engine only helps if the query itself is semantically correct. A catalog that indexes metadata across federated sources only provides value if that metadata is meaningful.
Federated semantics isn't one component of the agentic AI stack. It's the layer that makes every other component work.
The Architecture That Makes This Real
Federated semantics doesn't exist in isolation. It requires the rest of the stack to be purpose-built around it.
A connector architecture that reaches data where it lives. The agent needs to query data in place with no ETL and no copying. This means native connectors to object storage (S3, Azure Data Lake), relational databases, SaaS platforms, data warehouses, and other catalogs. Predicate pushdown matters here: filtering work should happen at the source, not after a full table scan.
A query engine built for federation. Joining data across a PostgreSQL database, a Parquet file in S3, and a table in Snowflake in a single query requires a massively parallel processing engine that handles heterogeneous sources transparently. The engine has to be fast, delivering sub-second response times for the AI to operate interactively. It has to work across columnar and row-based formats, as well as have acceleration technology that eliminates the penalty of doing cross-source joins.
A catalog that maintains metadata across federated sources. A catalog is only as useful as its coverage. For federated semantics, the catalog needs to index not just tables you manage directly, but sources across your entire estate, including other catalogs, external databases, and object storage. And it needs to enforce fine-grained access control at query time, not just at ingestion time.
These three layers, connectors, query engine, and catalog, are the implementation of federated semantics. They're what make the semantic definitions actionable at the speed AI agents require.
How Dremio's Agentic Lakehouse Delivers This
Dremio was built around this exact architecture. The design isn't a retrofit of a traditional analytics platform. It's a system built from the ground up to give AI agents what they need.
The AI Semantic Layer is where federated semantics lives in Dremio. Virtual datasets encode consistent business logic. Wikis and labels attach context to every table and column across every connected source. AI-generated metadata means the catalog becomes self-documenting over time, reducing the overhead of maintaining semantic coverage at scale. The result is a system where the AI reads business-ready data with business-meaningful context, whether that data is in an Iceberg table Dremio manages or a live query against a Salesforce object.
The connector architecture lets Dremio query data in place across object storage, relational databases, data warehouses (including Snowflake, Redshift, and BigQuery), and other catalogs, without requiring you to move anything. Query data in place without the risk and cost of data movement. Every connected source participates in the semantic layer. Every connected source is accessible to the AI agent.
Dremio’s Intelligent Query Engine is built on Apache Arrow, the open columnar in-memory format that Dremio co-created. Because Arrow is Dremio's native format, there's no serialization overhead when processing data. The engine eliminates the serialization tax that slows traditional query engines. Autonomous Reflections learn from query patterns and pre-compute optimized data copies transparently, so performance improves over time without manual tuning. The Columnar Cloud Cache (C3) brings object storage latency down to local-disk speed on hot data.
Reflections address one of the most common objections to federated architectures: speed. Querying data across multiple live source systems, each with its own latency profile, can produce slow results that make interactive AI experiences impractical. Reflections are pre-computed, physically optimized copies of datasets stored as Iceberg tables. Dremio's query optimizer substitutes them transparently when they can accelerate a query. The user writes standard SQL against a logical view spanning multiple sources. The engine rewrites the query to use the fastest available Reflection. The cross-source join penalty disappears.
The Open Catalog is built on Apache Polaris, the open Iceberg REST catalog standard, extended to include federated sources alongside managed Iceberg tables. One governed namespace spans both the data Dremio manages and the data that lives elsewhere. Fine-grained access control, including row-level security and column masking, is enforced at query time, which standard Iceberg REST catalogs can't do. The formula is straightforward: Dremio’s Open Catalog = Apache Polaris + federated source metadata + advanced governance + autonomous table maintenance + autonomous table optimization
On top of this foundation, the built-in AI Agent and MCP Server let AI clients, including external tools like Claude Desktop, connect directly to the Dremio environment with full semantic context and governance intact. The agent isn't working around the data infrastructure. It's working through it.
The Right Order of Operations
Getting enterprise agentic AI right means getting the layers in the right sequence.
Start with federated semantics: consistent business definitions, documented metadata, governed access across your full data estate, not just the fraction you've centralized. Layer in federation infrastructure: connectors that reach every source, a query engine that handles cross-source joins at interactive speed, a catalog that maintains context everywhere. Then deploy AI agents that can actually use what you've built.
Most organizations try to skip straight to the agents. That's why most enterprise AI projects underdeliver. The model isn't the constraint. The foundation is.
Federated semantics is that foundation.
Try Dremio Cloud free for 30 days and deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.
Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg
By unifying data from diverse sources, simplifying data operations, and providing powerful tools for data management, Dremio stands out as a comprehensive solution for modern data needs. Whether you are a data engineer, business analyst, or data scientist, harnessing the combined power of Dremio and Apache Iceberg will undoubtedly be a valuable asset in your data management toolkit.
Sep 22, 2023·Dremio Blog: Open Data Insights
Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop
We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]
Oct 12, 2023·Product Insights from the Dremio Blog
Table-Driven Access Policies Using Subqueries
This blog helps you learn about table-driven access policies in Dremio Cloud and Dremio Software v24.1+.