
From Discovering Data to Trusting Data
At Lyft, we have made our analysts and data scientists over 20% more productive by making it easier to discover data. Recently, we open sourced Amundsen and it’s now being used by ING, Square, Workday and many more.However, we ran into an interesting challenge. Not only is it now easy to discover good trusted data, it’s also easier to discover bad data that was previously hidden in the unforgotten nooks and crannies of the data lake. Consequently, we are now asking ourselves, “How can we recommend not just any data but trusted data to our users?”This talk will provide a quick overview of Amundsen and detail how we have tried both automated and curated metadata to showcase what’s trusted and what’s not trusted in Amundsen. It will dive deep into linking the Airflow DAG which produced the data (task-level lineage), linking what and how many dashboards are built from a given dataset (table-level lineage), as well as SLAs and historical landing times to give users a signal into what’s trusted.The talk will conclude with insights into current challenges and how we may solve them in the future.
Topics Covered
Speakers

Mark Grover
Mark Grover is the co-creator of the open source data catalog and metadata engine, Amundsen. Amundsen is used by data scientists and analysts to discover, understand and trust the data they use. At Lyft, Amundsen has 700+ active users every week, and outside of Lyft, Amundsen is used by 27 companies like Instacart, ING, Square and more.