December 10, 2025

Let Metadata Talk: An AI Agent for Your Lakehouse

Today’s lakehouse architectures contain massive amounts of data. Running analytics on top of this is expensive without knowing where and what to look for. The associated Metadata (containing schema information, snapshot history, plus file statistics from tables) can be used to narrow this down, thus increasing the focus area and reducing the cost (time + money) of running the analysis. In my talk I will show how to turn these hour-long detective missions into 10-minute simple conversations with a Lakehouse Metadata Agent.

Having worked on Lakehouse extensively at AWS and having judged conference papers related to analytics, I have seen that working with metadata is a big challenge industries face. Want to know which tables are burning through your storage budget? Which datasets are zombie pipelines nobody’s touched? Where PII might be hiding across 500 tables? What breaks if you change that critical column? What are the write patterns like? Legacy products create multiple dashboards to track different facets. But given that everyone has become comfortable with AI, using text to debug is a natural extension.

With the rise of AI Agents, we need to feed them the minimal amount of data to maximize the per-token cost. Starting with the metadata layer as the initial input, the context becomes tight and streamlined. The Agent can borrow rich insights from this already and then nail down the problem area. To validate this, I built an agent that would extract metadata from all Apache Iceberg tables in the lakehouse (~1000 tables), analyze SQL dependencies, build vector embeddings to capture semantic relationships, and use an LLM to answer complex questions about the tables in the lakehouse at a petabyte scale.

This agent can easily be productionized and save teams tons of time in learning more about their data!

Topics Covered

Agentic AI
Apache Iceberg
Lakehouse
Use Cases

Sign up to watch all Subsurface 2025 sessions