March 2, 2022

Cross-Platform Lineage with OpenLineage

Data within today’s organizations has become increasingly distributed and heterogeneous. It can’t be contained within a single brain, a single team, or a single platform…but it still needs to be comprehensible, especially when something unexpected happens. Data lineage can help by tracing the relationships between datasets and providing a cohesive graph that places them in context.

Most data platforms have lineage capabilities, and can tell you how the datasets within their scope relate to one another. But very few of us can manage to operate within a single data platform. Fortunately, OpenLineage provides a standard for lineage collection that spans multiple platforms, including Apache Airflow, Apache Spark, and dbt. This provides a panoptical view that empowers a company to diagnose and address data quality and efficiency issues in real time across its entire ecosystem.

In this session, Michael Collado from Datakin will show how to trace data lineage across Apache Spark and dbt. He will walk through the OpenLineage architecture and provide a live demo of a running pipeline with real-time data lineage.

Download PDF

Speakers

Michael Collado

Michael Collado is a seasoned developer with fifteen years of experience building websites, high-throughput services, and backend data infrastructure. Over the last decade, he has focused on big data analytics, catalog systems, and high-throughput data access. He has led multiple teams, including Amazon’s biggest clickstream data ingestion and analytics system and their primary website experimentation platform. He is currently a Staff Software Engineer at Datakin.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Cross-Platform Lineage with OpenLineage

Speakers

Try Dremio’s Interactive Demo

Get Started Free

See Dremio in Action

Talk to an Expert

Make data engineers and analysts 10x more productive