From Discovering Data to Trusting Data
At Lyft, we have made our analysts and data scientists over 20% more productive by making it easier to discover data. Recently, we open sourced Amundsen and it’s now being used by ING, Square, Workday and many more.However, we ran into an interesting challenge. Not only is it now easy to discover good trusted data, it’s also easier to discover bad data that was previously hidden in the unforgotten nooks and crannies of the data lake. Consequently, we are now asking ourselves, “How can we recommend not just any data but trusted data to our users?”This talk will provide a quick overview of Amundsen and detail how we have tried both automated and curated metadata to showcase what’s trusted and what’s not trusted in Amundsen. It will dive deep into linking the Airflow DAG which produced the data (task-level lineage), linking what and how many dashboards are built from a given dataset (table-level lineage), as well as SLAs and historical landing times to give users a signal into what’s trusted.The talk will conclude with insights into current challenges and how we may solve them in the future.
Mark Grover is the co-creator of the open source data catalog and metadata engine, Amundsen. Amundsen is used by data scientists and analysts to discover, understand and trust the data they use. At Lyft, Amundsen has 700+ active users every week, and outside of Lyft, Amundsen is used by 27 companies like Instacart, ING, Square and more.
Ready to Get Started? Here Are Some Resources to Help
What Is a Data Lakehouse?
The data lakehouse is a new architecture that combines the best parts of data lakes and data warehouses. Learn more about the data lakehouse and its key advantages.read more
Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse
The adoption of data mesh as a decentralized data management approach has become popular in recent years, helping teams overcome challenges associated with centralized data architecture.read more