Subsurface LIVE Winter 2021
From Discovering Data to Trusting Data
At Lyft, we have made our analysts and data scientists over 20% more productive by making it easier to discover data. Recently, we open sourced Amundsen and it’s now being used by ING, Square, Workday and many more.
However, we ran into an interesting challenge. Not only is it now easy to discover good trusted data, it’s also easier to discover bad data that was previously hidden in the unforgotten nooks and crannies of the data lake. Consequently, we are now asking ourselves, “How can we recommend not just any data but trusted data to our users?”
This talk will provide a quick overview of Amundsen and detail how we have tried both automated and curated metadata to showcase what’s trusted and what’s not trusted in Amundsen. It will dive deep into linking the Airflow DAG which produced the data (task-level lineage), linking what and how many dashboards are built from a given dataset (table-level lineage), as well as SLAs and historical landing times to give users a signal into what’s trusted.
The talk will conclude with insights into current challenges and how we may solve them in the future.
Mark Grover, Amundsen Creator & Product Manager
Mark Grover is the co-creator of the open source data catalog and metadata engine, Amundsen. Amundsen is used by data scientists and analysts to discover, understand and trust the data they use. At Lyft, Amundsen has 700+ active users every week, and outside of Lyft, Amundsen is used by 27 companies like Instacart, ING, Square and more.
Welcome everyone to this session with Mark Grover. He is going to talk to us, From Discovering to Trusting Data.
Before we begin, a couple of housekeeping items. We are going to have a live Q and A session at the end, so please make sure that you have your microphone and your camera enabled if you want to participate in the Q and A so we can make it more [00:00:30] interactive.
And also, don’t forget to visit the Slido tab at the end of the presentation where you will see a couple of questions to provide feedback on how the session went. And without further ado, Mark, the time is yours.
Cool. Thank you so much for having me. Have you ever tried to do some analysis, and you wanted to find out if you have a certain data or if there were many different kinds of [00:01:00] tables or dashboards like that existing in your warehouse or in your dashboarding system, figuring out which one to trust?
That problem is so common. It was very common at Lyft. It’s very common in many other companies, and that’s the problem that I try to solve. That is exactly the problem we’re going to have a talk today about and that I call From Discovering to Trusting Data.
My name is Mark. I am co-founder of a startup called Stemma.ai, and I created an open source project called Amundsen, which is also a project [00:01:30] in this space. If you want to follow me, you can use that link.
To talk further about my past experience a little bit, I was a developer of Spark at Cloudera. And from there, I went from a platform company to a product company at Lyft, hoping that all the data that’s available to Lyft will be easily made use of, and that’s where I found the problem with discovering and trusting data.
So to talk a little more about that problem, data scientists’ [00:02:00] or analysts’ workflow is divided into four large steps. The first one is, I want to discover what data exists. Has someone done this past work before?
Then I want to go explore this data, better understand what it’s used for, how it’s used. Can I trust it? Then I actually do my ML model or my analytical model, productionalize it, and visualize and communicate it.
Turns out you would expect most of the time to be spent in the last [00:02:30] two steps of the workflow. However, over a third of the time at Lyft was spent in the first two steps. And this was time that was wasted time because people were just asking around, what is the source of truth of this data?
What exists? What does not? And it led to three interesting side effects. The first effect was this sense of, I don’t know what is out there. I don’t know what I can trust.
There’s a psychological impact, but there’s a very real productivity [00:03:00] impact. The second one was increased load on databases. Every time you have to run a query, you are saying select start from this table to explore, to figure out, what does the region column… Is it a three-letter abbreviation, or is it a full name? The state column, is that just a two letter code, or is the whole name… So on and so forth.
But you’re also figuring out, what’s the latest partitions and other metadata about the table? And thirdly, [00:03:30] you’re looking for, what is it being used for?
And the most common way is Slack, and it’s me going over to my buddy and being like, “Hey, how do you usually use this table? What is this column joined with? Where is the source of truth for this?”
And these were three side effects that were of this problem. And if we step back and think about why this problem became a problem, it’s because there’s this workflow of a data lifecycle. Data is ingested, then stored and processed, [00:04:00] and then later on, it’s consumed.
And you can consume it using analytical tools or ML tools, but you find that there’s a lot of innovation that’s happened in the left side of this diagram. You’ve got Stitch and [Fivetran 00:04:12] in order to ingest data.
You have Airflow and DBT for being able to orchestrate the processing. You have Innovation, Snowflake, Redshift, [BigQuery 00:04:23] to store this data.
You’ve got a good chunk of work happening on the consumption side. There’s Looker, Tableau, and [Mode 00:04:29]. The [00:04:30] ML consumption space is bubbling up.
But the thing is in the center, when you have so much data being produced and a lot of people consuming it, the center became under-invested, which was two problems that led to. The first problem is as a user, I don’t know what we have and what to trust; and second, as an organization, I don’t know how to govern this data because I collect so much data.
Some of it is high-risk, like PII, or sensitive. And some of it I [00:05:00] have to lock down because it’s business restricted. And I can’t govern it without fully understanding what data we can take, and that’s how this problem became a problem. So I went around evaluating solutions.
I figured there might be some industry solutions out there or some open source projects, and that is, in fact, true. Before I go further into the evaluation, I want to share what the goals of this evaluation was.
I wanted something that can capture metadata around these data endeavors, so listing of all the tables, dashboards, ETL DAGs, people [00:05:30] in the company and their relationships, right? The relationship is this DAG produces this data.
This person owns this dataset. This person consumes this dataset and so on and so forth. So the first thing is this graph of metadata. Second thing is you can explore this metadata in user-friendly grids.
So maybe I can search through it. I can track lineage through it, so I can say, “Oh, this user is my team lead. I want to see what datasets they own or they use.”
And lastly, because you can’t anticipate all applications or products you want to build on it, an API where you can expose [00:06:00] this stuff. And the other important thing was there are different sources and classes of sources.
So our class of source is a dashboard or a Kafka Topic. A new source within that class is a new type of dashboard.
Maybe you have support for Tableau, but you add support for Looker. And so I wanted it to be easy to extend to new classes of sources as well as new sources within those classes.
Here are the [00:06:30] crates I was looking for. So I wanted to capture metadata around what kind of information and on a certain kind of data assets. So in terms of the depth of metadata, we wanted a ABC of metadata, this terminology borrowed from the Ground paper.
A means application context, which is, where is the data? Does it exist, and what are the semantics? Which is the names and the columns, the descriptions and the types.
Second one was behavior, [00:07:00] which is, who’s using it? Who created this data? And the third one is change, because data, like code, changes with time, so you want to be able to track changes in data as well as the code that’s generating the data.
So that’s the depth of the metadata. Now we talk about the breadth. The breadth we wanted to capture was in the data stores or in databases, in employees of the company, in dashboards and reports, notebooks, events, schemas, and streams.
And what I found was [00:07:30] in the depth, if I go back this slide, you find that industry tools were very good at doing A, the application context, but hadn’t done B or C. And that was really important because historically, tools have used a notion of a person that I will appoint, Alex, to go and every month say how is this data used and how it should be used and what are common got yous, so on and so forth.
And that just didn’t [00:08:00] scale with the amount of data and the amount of growth that the company like Lyft was having. The second thing around the breadth of metadata, industry tools and open source projects were very good at capturing the data stores or databases but didn’t really employ the notions as deeply of people and dashboards and reports that were built on top of it.
I looked at Alation and Collibra as vendors, a bunch of open source projects, and some research, and ultimately came up with a [00:08:30] diagram like this. So we wanted to be able to expose this metadata in three different ways: search-based; lineage-based; and network-based, network-based being me looking at my team lead’s page and being able to say what datasets they use.
And I wanted to have support for various different systems, starting with just the databases, and nothing fit the bill. And that led to the birth off a new project called Amundsen.
It’s named after a Norwegian explorer, Roald Amundsen, [00:09:00] who was the first person who visited both poles and the first person to discover the South Pole. We thought our data discovery project was apt to name after Amundsen.
So first, I’m going to talk to you about what the product looks like, and then we will jump into what the technology behind the product is. So the first application we built on this metadata graph was a data discovery application.
And the goal is for a data scientist or a data analyst, for them to be able to use this at the first step of their workflow to figure out what data [00:09:30] exists. What’s being built on it?
How is it connected? Can I trust it? And that’s this application, so the most common use case is you come, and you search, and you can search for things like ETA. I remember there was an example pre-Amundsen at Lyft.
A data scientists were trying to optimize ETA, and if you’ve taken a Lyft ride, you know that you open up the app; we say, “Your driver is two minutes away”; and you go to the right funnel, and this ETA keeps on changing. Then there’s the actual ETA of which the driver actually [00:10:00] gets to your doorstep.
So if a data scientist is optimizing ETA for Lyft rides, they have to find the actual ETA. And we measure five times ETA in the session. We may have 20 models that have been in production in the past.
We may have 50 that never made it to production, so 70 models times five, there’s that many ETAs columns in the warehouse, and you don’t know which one to trust. And that’s a problem that Amundsen was trying to fix.
So in Amundsen, you just simply come in, and [00:10:30] type you ETA. That’s the most common use case. 70, 80% of the people use search.
Now, we’ll talk about search in a moment, but if you don’t know what to search for, you can curate data through tags, and you can then browse this data through that. So you can go through marketing data or finance data.
You can bookmark datasets that you commonly visit, bookmark dashboards that you commonly visit, and you can also have a view of popular tables. Popular here means what’s used in the rest of the organization through [00:11:00] query logs.
That’s not what popular is in the Amundsen UI. It’s what’s popular in query patterns in the databases, and that’s one key paradigm you will continue to see is that the premise is that we want to learn from the usage patterns and provide and build an opinionated product through automated metadata around usage patterns around what’s trustworthy, what’s not.
So you search. We give you a listing of results. In this is example, we show you tables and the Amundsen graph today. We have tables; we have dashboards; and we have people, [00:11:30] so you can search for all of those things.
The listing here is ranked based on popularity. Again, popularity is query usage in the databases. And once you get there, you click on the page, and you get a table detail page.
You get information about the description, which is seeded from the databases itself but can be modified, and it goes back to the databases. Often, tables have got yous, like, “Oh, you should always filter on this column,” or so on and so forth.
And this allows you to report an [00:12:00] issue from an analyst, for example, and then that issue shows up here. And it also notifies the owners that there’s an issue been filed. So that way, other people can see what issues may exist with the table and make their own call.
Then we have date range. Most of the metadata is automated, so here, date range is obtained from the partitioning scheme of the data and put here in the view. Last updated as to integration with Airflow, [00:12:30] so you can see when this table was last updated.
Frequent users are most common users of this table and their names. This was meant to be [face pile 00:12:40], kind of like Facebook, but we weren’t able to get images in the tool. So you get the first letter of the name, and we’re working on improving that.
Then you have tags, so you can attach tags to your dataset, and these are the ones that get curated on the front page we saw. And then you have owner information. That’s going to be adjusted via the API, obviously, but also updated through the UI.
On the right-hand [00:13:00] side, you have columns. You have descriptions of the columns and their types, and you have the first kind of lineage, which is like, if this table has any dashboards built on top of it, you get to see them in the dashboards pane.
On the top right-hand side, you have the lineage, which is dataset to data lineage. This was an internal tool at Lyft and Amundsen is building now an open source dataset to dataset lineage in Amundsen.
And that’s in works as we speak. You can also see the GitHub, which is the code that was used [00:13:30] to build this table, and that code is you can see change management and who changed it and what changes were made so on and so forth.
If you have access to the data, you can see a preview. And lastly, this is the first step of exploring datasets and creating analyses and decisions, so the last step of this is that explore button, which is, I have discovered this data. This is the one I want to use. You click on the button.
It takes you to a BI tool. And it populates a skeleton query with the right database, with the right schema chosen, and it’s a [00:14:00] select start query, and you can start modifying it.
We also show you stats of the column. So if you click on a column, you can see what the distinct values and min, max, and other stats on the column are. I mentioned about preview already.
This is an example of what happens when you click on that dashboards pane. So in this case, I took a table, and I clicked on their dashboards. I see what dashboards are built on top of it.
And people often use, when they’re making a change to this table, you can notify or learn how deep [00:14:30] and pervasive this change is going to be and the impact is going to be by looking at the reports and dashboards built on it. Obviously, the other part of this is what other ETL is and tables are built on it.
And like I said, that’s work in progress in the open source right now. You can search for datasets as well. A use case here is that you don’t want to do work from scratch if some other work already exists, so this allows you to do that.
And once you’re in the data dashboard page, [00:15:00] you see… This is the other side of lineage. You see what tables are built on top of it, and you see what queries are being used in order to build that table.
So this is a use case where an existing dashboard owner or a viewer can establish if they can trust this dashboard, partly because they can see that these tables are either trustworthy or not. We can certify tables, and you can see badges of those certifications of tables here as well.
And the last use case here is searching for a coworker. [00:15:30] And so in this case, you can search for a person, and you go to their page, you see what datasets they own, what datasets they bookmarked, and what datasets they frequently use. Frequently used, actually, is the most important and most commonly used one often, and you can browse through all the tables they frequently use.
Cool. So at this point, we’re going to go towards what the technology is behind this product. We will start from the top right. You saw the front end application. Now we have behind it two services.
One is the search service, which powers [00:16:00] the search that you saw; and the other one is a graph database, which has all this metadata and connections in it. To search, we support Elasticsearch.
For metadata, we support Neo4j by default, but we also have support for AWS Neptune as well as Atlas. Atlas is a product from the Hadoop ecosystem that has tight integrations with other Hadoop ecosystem products.
There were some organizations in the now open source Amundsen community who wanted to make use of the metadata they already had in Atlas, and that’s why we brought in the integration with [00:16:30] Atlas. Underneath all this is a system called Databuilder that grabs this metadata in and puts it in the right places.
Now, this system was historically a pull-based system, hence the name crawler in this diagram, and it essentially goes and gets metadata from your databases, from your dashboarding systems, or from your HR system, and puts it all together. We are now in alpha for a functionality called push, which allows you to push metadata into the [00:17:00] Databuilder.
However, that’s relatively new. And the last thing I would say is that the metadata service is consumed by the front end service, but that’s not the only way you can consume this graph. And there are other microservices that are built on top of this master data that can consume the graph.
These are obviously organization-specific. At Lyft, we had a feature service that was tagging data as it was generating with the kind of feature they were and attributes about that feature. And we also had a security service, and we’ll talk [00:17:30] more about this later on, which was using this metadata to govern access to PII and also make sure that we were complying with [CCPA 00:17:38] and other regulations.
I briefly touched upon the pull versus push model. The pull model makes sense when you have a strapped team, and you want to just really quickly be able to build a system up, and you don’t want to have a dependency on other teams to send you metadata.
Well, push model works well when you have need for near-real time indexing, so if a table got [00:18:00] created and you need to know immediately instead of waiting a couple hours, for example, in the full model. And the side effect of the push model is that it ends up creating dependencies within your organization, so the team that owns the dashboarding system has to then send you API calls and messages.
The team that owns the database system has to do that as well. And that’s why Amundsen was pull to begin with: because we were a small, scrappy team trying to build this product very, very quickly.
And then there are some use cases. And then there are [00:18:30] discovery use case, a couple. Our delay was fine, and hence we didn’t see a push model as something to start with.
But we are on path to supporting that. Cool. I wanted to share a little more about relevance and search. If you search for apple on Google, I show you an orange, that’s obviously a bad result.
But really, when you search for apple, maybe you get the fruit, but perhaps you were searching for the computer company. And that’s the same logic we use in Amundsen.
[00:19:00] We want to strike the balance between relevance and popularity, so relevance in table world is descriptions, table names, column names, and tags is what we search. Popularity is querying activity, and we weigh automated activity versus adhoc activity differently. There’s an equivalent in the dashboard world as well.
Why are companies choosing Amundsen? First reason is its catalog for the next generation architecture, so it actually integrates with Airflow for orchestration, Hive and Spark for ETL, [00:19:30] and your newer data lakes and data warehouses.
The second one is, this is an ongoing thing for us, but it’s lower time to value. Historically, you take a vendor tool, which is a data catalog tool, it relies a lot on human metadata, that you go deprecate this table.
You go mark it as certified. You go tell how it’s used, and that wasn’t going to scale at Lyft, and I’m noticing that that doesn’t scale in other organizations. So this is an automated grabbing of metadata and trickling it to the right places, through [00:20:00] the lineage, user experience, and ML.
So the impact of this product at Lyft was that we had 750 users every week. For context, Lyft only has about 250, 300 analysts, so this number was twice as large. And the number kept on increasing since the launch.
Why was that number twice as large as apparent in this diagram? So this diagram is the penetration rate. Blue bar shows the number of people at Lyft with that title; red bar shows how many of those people came to [00:20:30] Amundsen in the last week.
So you find that the penetration rate is highest actually in the second and the third column, which is data scientists and research scientists, the exact people that we wanted to build this product for. But you find that the software engineer penetration rate is about 17%, and that’s because most of them don’t use Amundsen day-to-day.
They’re building mobile apps or services and so on and so forth, but about 20% of them are data engineers and interact with data every day, and those are [00:21:00] the ones that use this tool every day. Then there’s a long tail of less technical, more business users who come to this tool because the barrier to entry to doing analysis has now been lowered.
Amundsen is now open source. It has over 1100 community members from 150 companies with 30 companies using it in production. To give you a sense of who these companies are, companies that use Amundsen include Reddit, Asana, ING, Square, Workday, obviously created at Lyft, and Brex, Convoy, Instacart, [00:21:30] and then a whole bunch of companies in the community.
The future for Amundsen, it’s twofold. One is the future for the data discovery app, so we want to do dataset lineage at table and column level. We want to have ACL integration so only specific users can modify and edit descriptions.
We want to show improvements in search quality as well as context and have a shopping cart experience where you can request access, which integrates with an internal system. And here’s a [00:22:00] mock of what I was talking about of dataset and column lineage, and here it shows you that this particular table has three other tables derived from it.
And the red arrow shows that there was a failure, so it can also help you easily debug. And this is one addition that actually was made at Square.
And this is taking Amundsen and saying, “Amundsen is two things. It’s a data discovery application and a metadata graph.”
And Square said, “I don’t quite care about the data discovery use case. I’m going to classify all my [00:22:30] data as PII and what subject it relates to and then show a notification to owners that this data has been flagged as sensitive data. Do you agree or not?”
The owners approve or reject, and then this is then used to gate access in Snowflake by creating different views for different people based on their role. This is an integration with features, which is now enabling to do feature discovery in Amundsen; particularly this is Feast.
And I’ll finish with this. The [00:23:00] breadth of metadata we can do with this application, data discovery is only the first application. There’s a huge graph of metadata that we can use, other applications… security and privacy through the application Square built… and we imagine a future in which the same metadata will be used to build and partner with other applications in solving other related problems like monitoring, quality, cost management, and maintenance.
To summarize, there’s a huge pain for data discovery. That’s what led to Amundsen. [00:23:30] There’s a huge opportunity to create other metadata applications like the one for privacy that I gave you an example of.
If you wanted to read more about Amundsen, the website is the first link. Then you have the GitHub page, a blog post, and the blog post from community.
There’s also another great talk tomorrow from Josh about effectively using data lakes with Amundsen and Dremio, and I would strongly encourage you check that talk out at noon tomorrow. [00:24:00] That is all for me. Thank you so much for attending, and I look forward to your questions.
Excellent. Thank you so much, Mark. Great talk. So we have multiple questions in the chat, and we also have some participants queued up.
So I’m going to go ahead and queue up someone from the list here to do a question live while I also ask a question that we have in the chat. And this one comes with from [Gopal 00:24:29], and he or she is [00:24:30] asking, where does Amundsen source the metadata from?
Okay, so the metadata is coming from a variety of different sources. So it comes from databases to dashboard systems to HR systems.
And it stores it in the graph database, so it stores it in Neo4j. I’m not sure if I’m answering this question right, but if there’s more, please do let me know.
Speaker 1: Excellent. And I believe this is a follow-up to another question, says, does [00:25:00] Amundsen integrates with Metabase dashboards?
Today, it doesn’t. However, the integration for adding a new connector, whether that’s your dashboard or database, is fairly simple. And if you were interested in that, I would love for you to come join the community, and we can help you write that connector and connect that to your Metabase dashboard.
Excellent. Let’s see, another one here. Can [00:25:30] the automatically derive metadata be manually enhanced?
Absolutely. This is a great point. So an example of an automatically derived metadata could be that these two columns are related.
This is something that doesn’t exist but would exist in the future with lineage. Another example is a description that comes from an existing system, like your Stitch or Fivetran definition, and that trickles down, and you can automatically enhance that in the user interface [00:26:00] and [inaudible 00:26:02] depending on what the source was, where it came from.
Great. I think you already mentioned this, but just in case, does Amundsen have data lineage including transformation?
Right. So three kinds of lineage that I should talk about. The first kind is dataset to dataset lineage, which you do ETL on; the second one is dataset to ETL DAG lineage, so what DAG produced this dataset?
And the third one is dataset to report or dashboard lineage. Amundsen [00:26:30] has the last two today, so you can relate an ETL DAG to the table, and you can relate a dashboard to a table. It doesn’t have dataset to dataset today, and that’s a work in progress.
Great. And [Louise 00:26:45] is asking, I’ve seen Project Well is building on top of the Amundsen data builder library. Are you surprised people are leveraging bits of your architecture?
No, I think that’s open source. It’s Apache [00:27:00] software V2 licensed, and I think it’s intentional. Some people can use some stuff or all of it.
And I think that’s where we get the benefits of the community and getting the integrations as well. And so I think it’s a part of being an open source product as well.
Great. And then another question. Can Amundsen get metadata from EA tools like Enterprise Architect, [00:27:30] BlueDolphin, and other ones?
That’s a good question. I’m not super familiar with those tools. If you want to hit me up and tell me more about your use case, I would love to learn more and see how we can make that integration happen.
Cool. Excellent. Let me see if I’m missing anything here. Talking about BlueDolphin.
Hang on. There’s another here. How does Amundsen [00:28:00] link synonyms from different data sources?
Right, so there are two things here. One is data sources sometimes are referred by one name in the database, but they’re referred by a different name in people’s minds.
So an example is maybe there’s a passenger table, but you call them [rider 00:28:22]. But this passenger table may be coming from three different datasets.
And what you can [00:28:30] commonly do for this is that you take the rider tag, and you apply it to all the passenger tables from various different places. When you search for rider, you see all these passenger tables that are coming from various different data sources.
Great. And I see a lot of people who had put themselves in queue to ask live questions, but I’m not seeing one. I don’t know if they are becoming shy, stage fright or…
Oh, there you go. So we [00:29:00] have [Haim 00:29:02]. Haim, you will have to enable your audio and video. No, they [crosstalk 00:29:07] something else.
Okay. That’s good. Let me see if we have Craig in here. It takes a little bit for it to show up. Let’s see, Victor.
All right, I think we’re getting to the end, so let’s go ahead and get a couple more questions here from the chat. Is there any effort to expose Amundsen [00:29:30] metadata to [JupyterHub 00:29:32]?
Mark Grover: Yeah, so that’s a great question. I want to talk about this in a little more detail. The first thing is the Amundsen you see is grabbing metadata from places and putting it in it.
But then the other thing is you can take this metadata, and you can start showing it in various places, so a place where you may want to show it is a BI tool. When you’re typing your mode query or something like that, you see descriptions, or you see, when was this table last updated?
Is it being used for other things? Is it popular? Or so on and so forth, and [00:30:00] I think this is in line with what Sanjay is asking too is another application where you can show this as JupyterHub.
So thus far, I would say Amundsen has only done the job of grabbing this metadata. And now I think we are going to have to start exposing this metadata in other places.
The API is available, but we have to write clone plugins or integrations to actually show this metadata in the right panes for the users so they don’t have to switch windows in order to leverage this metadata. [00:30:30] That thing doesn’t exist is the short answer, but we’d love your help and support in taking us in that direction.
Great. One more before we have to go. Any plans to capture metadata from Apache Beam like Amundsen does with AirFlow DAGs?
Streams are in the future. The number of streams an organization has is an order of magnitude lower than the number of tables and batch processes they have, so we have focused more on batch simply by the sheer [00:31:00] volume of things that an organization has.
It’s definitely in the vision but doesn’t exist today. If it’s interesting to you, would love for you to come join the community and collaborate.
Wonderful. Well, Mark, I want to thank you so much for this wonderful presentation, for your time. And also, all of you, the audience, thank you so much for being here today.
Please don’t forget to take a look at the Slido tab on the right-hand corner on your screen so you can provide a quick feedback on the session and the entire conference. [00:31:30] I invite you to continue participating and listening to the rest of the conference.
And please everybody, be safe and be healthy. And I’ll talk to you soon. Bye-bye
Mark Grover: Thank you, bye. (silence).