Data Discovery at Lyft and Convoy

Session Abstract

In this talk, we will introduce why it’s so crucial to solve data discovery, and discuss the learnings from addressing this problem at Lyft and Convoy. The leading open source data catalog is used by 750 users every week at Lyft and by 80% of Convoy’s employees every month. We will share what makes a successful data catalog and the latest improvements in Amundsen, including lineage and dbt integration. We will end with what’s still not working well and how we as a community could tackle it.

Everyone has access to data but few know what exists, what’s trustworthy and how to use it. Humans solve this problem naturally through the gossip protocols of Slack and shoulder-tapping which doesn’t scale and comes at a huge productivity loss. But it gets worse. Wrong data leads to wrong conclusions.

Mark and Chad saw this problem first hand at their respective organizations – Lyft & Convoy. Analysts and data scientists were spending more than 1/3rd of their time discovering and establishing trust in the data they use. Lyft has made its analysts and data scientists over 20% more productive by creating and using the leading open source data discovery and metadata engine, Amundsen. Convoy has ~80% of the company use Amundsen for data discovery and trust.

Transcript

Mark: Thank you, Dave. Really great seeing you all. I appreciate you taking the time to… Let me get my screen figured out…to be here. Yeah.

So I’m Mark. As they mentioned, I’m the co-founder of Stemma and the co-creator of Amundsen. The reason we are here today is that the problem that exists in organizations today is they have captured so much data and they have democratized [00:00:30] access to data, to various people within the company. The big problem that exists now is what data exists. How do we effectively use it to make better decisions, to make better products for the company? And the problem has now become a problem of trusting this data. And both Chad and I have done a lot at work in this space, and we are sharing with you what this problem is, how we have thought about it in our current and past companies and how we’re solving [00:01:00] that problem.

So I’m the co-creator of a leading open source project in this space called Amundsen. It’s an automated data discovery and cataloging tool. And with me is Chad.

Chad: Hey everybody. My name is Chad Sanderson. I am the head of product for the data platform team at Convoy, and we own essentially Convoy’s end-to-end data stack. And that includes data discovery as a very important piece.

Mark: Cool. So with that, let’s talk through what exactly the problem [00:01:30] is. So the problem is this; that there is a lot of wasted time in the current analytical works list. And so there are various different personas upstairs, the analysts and the data scientists and the data engineers. We’re going to start with just focusing on the analyst.

And if you take an analyst’s workflow, you can break it up into four categories. The first part is ‘I want to discover what data we have. I want to figure out what analysis has been done in the past.’ So say if I’m doing like optimization for ETA for a ride sharing company, I want to be able to first figure out [00:02:00] what is a source of ETA data, and has someone already done this one.

Then let’s say, no one has done this work, I’m the first person to do this work. I go explore this data. I understand what the shape of the data is, what the various distribution of the various fields are. If it’s still being populated, so on and so forth. Then I do my actual Modoc development and productionalizing action. This is where we want analysts to spend most of their time and energy on. And the last step is to actually communicate [00:02:30] and validate and test.

The problem is that you notice that the first two steps are a place where the analyst spends a third of their time. Well, the last two was two thirds of their time. And that one third time can actually be reduced significantly and that’s wasted time. So this is the place where I found at Lyft, where I was a PM before, there was a lot of wasted analysts and data science time being spent.

The other [00:03:00] thing that happens is because of this wasted time, it leads to a lot of side effects. And some of these side effects are these three. The first one is there’s a lot of unknowns and just uncertainty. Does this data exist? Prior work been done here? What is the source of truth? Who owns it? Who uses it? Next one is around people running the same queries over and over again. These are your select queries or select distinct queries that show up in the database as increase the load and just like block down real processes and real jobs from running faster. And lastly, and most [00:03:30] importantly, in my opinion, it creates an interrupt, heavy culture. These companies, you end up having a lot of questions on slack asking like, “Hey, what does this mean? How do I use this? Is this thing still being updated? Who else uses it? Can someone help me out?”

These questions lead to a shallow work. You can’t do deep work at organizations because of this. And if we were to look back at the industry, in retrospect it makes sense why this is a problem. So you take a look and we [00:04:00] as a community, as an industry have invested a lot in ingestion of storage. So things like five trends, stitch, Redshift, snowflake, big query, all let you ingest data into the system and then store it very, very easily with the data warehouses that exist. And on the consumption side BI tools have existed for a long time. So if you’ve got Tableau, Looker, mode and so many other tools that enable consumption of these data [00:04:30] products that exist in the data warehouse, and we’ve also trained more people to be sequel savvy, to be able to use data in decision-making to be able to interpret dashboards, and change the sliders in order to make decisions.

So of course the problem isn’t at the ingestion and the consumption, the problem is right in the center, we have collected so much data as an industry, a data scientist, or an analyst or a data engineer, it doesn’t mean know what exists. [00:05:00] Can I use it in a fewer owner of the data? And you want to modify the data. You don’t know who is using it so you can notify them that this thing is going to get modified or deprecated in this way. And so what I did in my past job, which where I was a product manager at Lyft was, I said, “gee, this problem doesn’t seem unique to Lyft. I’m going to go evaluate what solutions exist in this space.” And I looked at few solutions. The requirements I was going after, there were two main requirements.

One was [00:05:30] that I could get all this metadata in an automated manner. And it’s pretty easy to get the information schema for data waste or just the list of dashboards from a API off a BI tool. What’s more interesting as can you connect more of a sense of trust? Can you get metadata from airflow or in fact, to be able to tell what is the last time when this table ETL job ran? How often is this table utilized? Who uses it, how many dashboards use it? All [00:06:00] that stuff can give you a sense of what’s trustworthy without relying on a human to actually go and curate like this is the source of truth data. And unfortunately all the solutions that existed prior to this were all relying on this human, this data steward and having an army of data stewards that will go curate every single piece of source of truth.

And the problem with that is that that stuff doesn’t work out in modern organizations because these people who have this context in their head are too busy, [00:06:30] and the data stewards don’t have the organizational and the data context about all these different areas to be able to populate them. You can never hire enough of them. The company’s evolving the data systems in the space that as soon as you document, they go out of date. So it was very important to find a place where we could get all of this information automatically. The second one was that we integrate with the best of breed products. So like we were [00:07:00] integrating with the data warehouses off today, we’re integrating with slack, which is were still a lot of conversations happen. So those were the two main requirements. And then you expose the metadata in various different ways.

You can power a search rank and kind of like page rank style. You can expose it to the lineage based view where you can see what is being impacted. If I change this, you can do network based view. So if I join Chad’s team, I could say like, “oh, what tables or dashboards does Chad use every day?” And then lastly have an API that exposes them. [00:07:30] And so that’s what I was looking for. I looked at a few different solutions in this space. There were vendors, open-source projects, but nothing fit the bill because many of them were either based on the old school warehouses and almost all of them were based on this idea of curation, which was not going to pan out. So that led me to start a project within Lyft called Amundsen, which is an automated data catalog. I’m going to do just a quick two-minute demo on that in interest [00:08:00] of time.

And we’ll go on there. So this is a Stema product, which is a hosted version of Amundsen, but the core of the product and the journeys remained the same. And so here are the most common use cases that you, as an analyst can come in and search for COVID cases data, and you get results from your data warehouse, as well as your dash boarding system. So in this case, I’m showing you results from snowflake and then dash boarding system is mode, but the integrations exist with many of the data warehouses in dash boarding systems. These results [00:08:30] are ranked based on a page rank style system. There is an opportunity for you to curate if you so desire, but the real power of the product in line would last thing before, is when you can use automation to figure out what’s trustworthy and present that as a view.

And so here you notice that the second table is more around cases, which looks to me that that’s the one I should be using, even though that’s not certified. So I can click on this table and I had to quickly figure out, can I trust it? And the pieces of automated metadata that come out of [00:09:00] the box here are, you can see what is the last updated time for this data. So no one has to put this information. We just show you that this job was last updated yesterday at 02:00 AM. And that’s like a really good proxy for like, is this thing still up to date, we can query a metadata from underneath the data warehouse to actually show you what the first date for which the table has data of the last day, for which you have data. We can also show you what are the most common users who query this data?

And this comes from parsing the data warehouse is query logs. We show you what are the common [00:09:30] dashboards that been built on it. So maybe the analyst can actually prevent doing duplicate work. We show you what is a source where this data comes from. This is lineage. And in this example, I’m showing table level lineage. It comes from parsing off the access logs off your data warehouse itself. There are a few other things here. I won’t go into them in interest of time, and I’m happy to dig into any of these areas. But for example, another, another thing that people want to see is what in what ways does the table use? So you could see, like [00:10:00] this table was used, right? How often in the last week.

And then one addition we have made more recently in Stema is that we show you the most common join and filter conditions, because once you have established, this is something you want to use, you can quickly see in what ways should I use it? Like, how does this dataset fit with the other data sets that are at the company anyway, that’s all I’ll share for now. And if there are any particular questions or areas you want to dig into, I’m happy to do that in the Q&A session as well. And to share more [00:10:30] about the architecture for Amundsen here. Amundsen is basically it’s on microservices. So we can start with the top right corner. There is a front end service, which was a UI that you saw, behind it as a search service which is powered by elastic search. And then on the left center you will see a metadata service. This is the service that houses essentially the graph of all the data objects. What tables exist, where columns exist, who are the people using it?

There’s a note for a human and you can go see what tables they own or query. There’s a note for a dashboard [00:11:00] and so on and so forth. Now this metadata service is most often queried by the front end service, but there are also other micro-services that can query both to put data in here as well as to take data out. And lastly, obviously both of these elastic search and near [00:11:15] need to be populated, and for that we have a system called data builder, which integrates with our data warehouses orchestration systems, dash boarding system, HR systems, to get a view of what the world of data it looks like.

And so why you [00:11:30] should think about getting something like Amundsen, A- it’s an automated data catalog. Most data catalogs fail because we can’t put in enough curated information for them to be successful and the single biggest thing you can do to help with that is to actually get as much technical metadata and business metadata to automation lineage. And that’s what you get with Amundsen. The second one is that Amundsen integrates with the best of breed tools that exist today. So it uses… it integrates with airflow for orchestration, a bunch of data warehouses that are the state-of-the-art today, as well as [00:12:00] the ETL execution engines, slack for conversations on notifications, things of that sort.

So the impact that this has had at Lyft was huge at Lyft, 80% of the data scientists, data analysts and data engineers used this product every week, had over 750 weekly active users. And Amundsen, this is the sense of who were the people who use them, so you notice that the blue bar is number of people who actually had that role at Lyft and the red bar is [00:12:30] how many of those people used Amundsen in the last week. So the ratio is most important, and that’s the penetration range. So penetration is super high, 80% and 70% of respectfully for data scientists and analysts and suis, the penetration rate is low, but that’s because data engineers at Lyft have the suite title, but the penetration grading data engineers was super high as well.

Here’s Amundsen’s open source, has a community of almost 2000 people, 150 companies in the community and almost 40 companies using it at this point. And these companies come from various [00:13:00] different shapes and sizes, and they’re all like Reddit, Gusto, Workday’s square, ING, Asona. And that brings me to Chad, Chad is the head of data products at Convoy. And he went to the version of this problem at Convoy. And he wanted to share with you what that problem looked like on why and how they have solved it there.

Chad: Awesome. Thanks Mark. [00:13:30] Okay. Well, like Mark said, my name is Chad Sanderson. I run the data team at Convoy. Convoy is a later stage startup. It’s a series D, based in Seattle in the freight brokerage industry. So we’re a two-sided marketplace. We sit in between the shipper that is trying to transport freight to a facility and a trucker that is trying to take that freight for a fee. And we operate a marketplace that is built on machine learning to match the right shipment [00:14:00] to the right carrier. Next slide please. So the data platform team at Convoy, which is the team that I run has a pretty large surface area, meaning we cover everything from the low-level data layer, the data warehouse, ETL, all of our streaming jobs, Kafka is supported by our team, all the way up to what we call the application layer.

So how we use that data at convoy, which includes an internal experimentation platform, internal platform for deploying models [00:14:30] and building features as well as tools like common transform tools, like a DBT and airflow and data discovery as well. You can see Amundsen right in the center of that list of… In that graphic of tooling there. And the goal of our team is to improve the efficiency of our data scientists by a 10X. And that requires really making intelligent investments in the tools that we buy and build. Next slide please. So I’m [00:15:00] on routes to our goal of increasing efficiency of the data science team by 10X back in 2020, we did a analysis of Lauren’s data science problems. One of those problems that made it through the survey was data discoverability. We found that the data science team had a lot of issues discovering the data that they needed in order to build queries.

This took a very long period of time. We had a whole bunch of tools that we use [00:15:30] at Convoy that sort of serve as like pseudo documentation tools. We have shiny apps that do this. We use DBT docs, we’re very heavy DVT power users. And we had several channels in slack where folks would ask questions about data. What does this column mean? What does this table mean? Where can I find this information? And none of that was really collected centrally. So it led to a lot of rebuilding the wheel anytime somebody wanted to go on a similar journey of discovery. [00:16:00] There are a few other huge issues. One of which was a lot of institutional knowledge at the company as well. So data scientists that had been at Convoy for a long time and really only they understood the queries and tables and models that they had built. And then as with, I think, a lot of startups and a lot of companies in general, our data model was incredibly complex. Next slide, please.

So our team, we realized this was a problem. We saw a huge opportunity to solve it. And [00:16:30] there was a few requirements we had for a solution. The first one is that we wanted to make adding metadata and really informative context around all of our data entities very, very easy. At the time we use DBT for most of our data documentation and you have to write a PR for that. So oftentimes data scientists who maybe forgot about sort of adding documentation when they were building the model, [00:17:00] didn’t want the overhead of going back and sort of writing a PR just for documentation sake when they already understood what the data was for. We thought search was the most ideal way to actually discover data. It seemed a lot more natural than some of the tools we were using. We wanted to add a ton more context, and then also we want to data discovery to serve as a foundation for future data products that we would build.

Next slide please. So we did an evaluation of a bunch of products [00:17:30] in the market, many of which Mark listed in his evaluation slide. And we decided to go with Amundsen primarily because the feature set really matched what we were looking for. It was a super lightweight technology. We could get onboarded and using this thing very easily and it mapped pretty closely to convolute existing tech stack, and the community as Marl already showed, was really awesome, very, very vibrant and extremely helpful. And there were several features that we added sort of en route to really getting a solid adoption of Convoy [00:18:00] data builder or something Mark also showed off. So we sort of created that DBT integration. We also created a metabase integration and an integration with mixed panel as well for reporting and a usage statistics. Okay. So now onto the fun part, how did we actually roll this out internally and how was that process received?

So there’s a few things that we did. My idea in general is we wanted to gamify this process as much as possible. Really, nobody likes being [00:18:30] like forced to onboard to a new tool. So we started off with a small number of stakeholders that really cared about data discovery as a problem, they understood how important it was. Maybe they were coming from other companies, but they were using something like an elation or a data hub. And those were going to be our internal advocates that could sell Amundsen to the rest of the business and be our early adopters. The other thing that we did was set up a documentation day, which was about a three hour time block, that not everybody was required [00:19:00] to participate in, but it was on everyone’s calendars. And the goal was that during that time block, you should go into Amundsen, get familiar with it and add as much documentation as possible to the data entities that your team owns.

And we had an entire spreadsheet where we broke out all the columns and all the major tables in our core and BI schemas, which are the two primary schemes at Convoy and attach owners to those so that people could very easily go in and figure out what they own. And then add the documentation to that. The outcome is that we had [00:19:30] 750 column descriptions added in our BI schema and 150 column descriptions added in our core schema, which is like our centrally managed data engineering schema and a snowflake. And then we have the what I think is probably even more important than all the metadata is more than a hundred people that got extremely familiar with the tool. They understood how to search. They understood how to add metadata. They understood how to find owners. So our adoption rate, we got ramped up on Amundsen very, very quickly [00:20:00] across the entire business.

Next slide please. So in terms of adoption today, about 80% of our product and engineering team has used Amundsen at some point, and we have around 4,500 searches a month at our peak. And right now that’s leveled out to about 3000, 3500 searches every single month for data. And in terms of users, it ranges all the way from a data scientists [00:20:30] and analysts to product managers, program managers. And surprisingly, one of the largest power users of combo is actually our operations managers that is searching for data around our shippers. Here’s some feedback that we got that sort of tickled me a little bit. So this is from one of our lead data scientists at the time. And we had an Amundsen outage, something that we did pretty stupidly on our part, [00:21:00] but when Amundsen went down, he did say, it’s only when you lose something, you realize how much you actually need it. So very, very quickly Amundsen went from this is a cool little project that we’re trialing to a critical part of the data science workflow at Conway.

One other example of this is within our quarterly planning documents, Tim, who was always tickled when I mentioned him in presentations, he is the leader of data science, and he’s the head of data science at Convoy, [00:21:30] pointed out when we were writing about the value Amundsen brought during the quarter that the entire data science team really, really loved it. And I got a lot of ad hoc qualitative feedback, that it was one of the most valuable products that we implemented in terms of open source over the last one to two years. And here’s a graph of the searches at Amundsen searches that we have. You can see when we first implemented Amundsen, we had a huge spike, [00:22:00] and really that came from that initial onboarding and documentation data that I mentioned before. And those searches have held relatively stable over the last year.

Okay, in terms of upcoming plans for Amundsen, we think… We’ve really seen the impact and the value of data discovery. It’s definitely effected our roadmap and we want to invest in this space even more. So a lot of these, of features or things are offerings that Stema has. So that’s certainly [00:22:30] something that we’re thinking about, but here’s sort of the big areas that we really care about are very interested in. Data lineage is huge. We’re going through a few large scale migrations right now. So having some tools that can make it very easy for data scientists to figure out what periods need to be deprecated would be extremely important.

Being able to index all types of dashboards specifically for meta base, which is our BI tool of choice. We have an internal metric store and feature store. So being able to index things from those tools would also be awesome. [00:23:00] And then there are a few things that are a bit more advanced, but I can definitely see being a huge part of data discovery at Convoy eventually, like table relativity, essentially making it very easy for a data scientist to say, “I have a table over here and a table over here. I’m not exactly sure how to join these.” Maybe that metadata can assist with that problems as well. And then also importing those conversations from slack that I mentioned before into the tool for indexing also.

Mark: [00:23:30] Cool. Thank you, Chad. So we’re going to just spend the last minute sharing with you what the future looks like for Amundsen. So the way this project started was only for that first pillar of data discovery and trust. And that was the main goals that I was trying to solve for when I started this project at Lyft, what I have learned along the way is that we have gathered a goldmine of data, metadata specifically, that can be used to power other applications. So you would find that in the future, I’ll show you some of the work that’s happening in the data discovery [00:24:00] interest application, but I’ll show you one example of work that’s happening in an adjacent application, which has a security and privacy application that square created. So in the data discovery and trust application, as Chad was saying, like we are as an open source community, working on a data set in column level lineage, you are seeing some of that already in the product.

And some that we would need to, that we are working on building and that the goal there would be to mark ETL jobs that I’ve had that have failed as red, and then be able to quickly [00:24:30] debug what was happening. We have built an integration with ML features. So like you can discover ML features in ways similar to the way you were discovering data. And this is the application I was talking about around classification. So square with Amundsen has built a classifier using GCPS DLP, where you can tag a bunch of metadata on whether a particular column has PII name, email address, so on and so forth, and then be able to surface that [00:25:00] in Amundsen and then be able to audit based on that when PII shows up in places that it shouldn’t show up in. So to summarize data discovery is a huge pain.

That was the primary pain that Amundsen was created to solve for. It does a phenomenal job at it. My company Stema, it provides a managed version of the same to deploy really easily with some proprietary additions. But in general, there’s a huge opportunity for metadata driven applications and the metadata that’s being used for data discovery it could also be [00:25:30] used in other spaces like the privacy example we saw. If you were interested in learning more about Amundsen join us on slack, that’s the link over there. And then I left you some URLs that where you could go and learn a little more. The first one is the Amundsen website. The second one is the Amundsen GitHub repo. The third one is the state of the gap between producers and consumers of data. And the fourth one is the blog post from square on data privacy. That’s all from Chad and me today. And we look forward to your questions.

Speaker 3: [00:26:00] Great. Thanks Mark and Chad. All right, let’s go ahead and open it up for Q and A, remember if you have any questions, you can use the button in the upper, right to share your audio and video, and you can ask the video and you’ll automatically be put into a queue. And if for some reason you can’t, or don’t want to do that, you can simply ask your questions in the chat. That’s fine too. And okay. You don’t see anybody in the audio and video, let’s go straight into the questions. All right. So the first question I see here [00:26:30] is, is there a source for this breakdown on wasted tech?

Mark: The source for this was a user interview survey that I did at Lyft. So it is… It was specific to Lyft. I’m kind of curious if Chad’s done something similar there, but I don’t know if any general survey that existed across organizations.

Chad: Yeah. We’ve done internal surveys as well, and found basically a similar breakdown a pain, but I’m also not aware [00:27:00] of any standardized reports on this.

Speaker 3: I did like that phrase, shallow work, which Kevin posted in the chat, it means corporate spending more money on new technology that only further exasperates the operational breakdown due to data and insight disconnect. It’s a nice summary there. All right. So let’s get onto the next question. Does Amundsen also extract content from tables for search?

Mark: Right. So [00:27:30] the answer is depends. So the big thing to remember here is that Amundsen searches for metadata, not data. So it usually doesn’t touch your data and if you want to query your data, you actually go to your data lake and you use JDBC or whatever tools that you have to actually query this data. The place where sometimes the data inside a table or a column becomes metadata is when you have a column that has some limited number of values. So the countries [00:28:00] you operate in, there’s only, let’s say 50 of those, those are irrelevant. Or the operating systems of cell phones that you deploy or application on is an example of another dimension. So if the number of records in number of distinct values in a column are less than 20 by default denomination, we will be able to get that information and show you that in the context of the metadata, but in general, data is considered a separate thing and we [00:28:30] don’t touch it and we don’t help you query or search it.

Speaker 3: Okay. And then we got a question here, where does Amundsen get its name?

Mark: So Amundsen is named after Roald Amundsen, who was the first person to visit South Pole and the hence the first person who to get to both the poles. And we thought for a data discovery project, it was great to name it after an Explorer. And so that’s how the name came about. In retrospect, it’s not [00:29:00] the most amazing name, it’s the heart to say for the American palate and hard to remember, hard to spell but it’s also very memorable. So that’s where the name comes from.

Speaker 3: Okay. Yeah. Well, with someone with the last name, S-E-N I can tell you that [inaudible 00:29:16] spelled S-O-N [inaudible 00:29:17]. Okay. So here we go. Last question on the QA folks, if you have any other questions, please post them. Last question here is from Lucas. What [00:29:30] kind of inputs still need to be manually put in on Amundsen, for example, via, I’m not sure he means by documentation competition here, but just start with that first part of the question.

Mark: Yeah. So the things that come, I’ll take the parts that come automatically. And then Chad has a lot of great insight into what went on top of that. So things that come automatically are a listing of tables, listing of columns, integration with [00:30:00] ETL systems to get, like when was this table last updated? You get usage information. Who’s using this both people as well as dashboards. And you get lineage information that to query parsley in order to show like what’s built from what, and I’ll let Chad speak for what human stuff needed to go on top of that.

Chad: Yeah. So the things that we asked our team to manually edit into the Amundsen UI was mainly around the business logic and context of [00:30:30] data entities. So as an example for a particular table, we wanted them to add what does a row in this table actually represent? In many cases that wasn’t documented anywhere very clearly. And the same thing goes for columns. Like from my business logging perspective, what does this column actually represent? If you have a column called a shipment and let’s say shipment time is that in time and hours? Is that time in kilometers? Is it in miles? Is it in something [00:31:00] else we wanted to make this search experiences as easy as possible for data scientists? And for that, the context is important. That does require some manual effort.

Speaker 3: Okay. Folks, and we are actually running out of time here. So before we actually do get cut off here, let me just mention that I posted the links to join in slack. If you haven’t joined already, you can use the link I posted there to sign into slack. And then you can search for the Mark Grover and Chad Sanderson channel. Or [00:31:30] I did post the link here that I believe will take you directly to that channel. And we do have some questions here, but I, at this point, we’re over time. So I’m going to suggest that we take this and continue it over on slack.

And we’ve got some good questions here. We’ve got a question about where does data discovery fall under a corporate digital transformation journey? You got another question here on from kit and from Arthur and from Nadia, but we were just out of time. So join us over in slack [00:32:00] folks. And both mark and Chad will join us there as well. And we can ask additional questions with that. Thanks mark. And Chad, let’s go ahead and wrap this up and that’s all the questions. I’m sorry. And I do want to remind you to answer the Slido survey questions. It’s very, very short and the next session is coming up in just five minutes. You have to go hall open. I encourage you to check out the booths. Thanks everyone. And enjoy the rest of the conference.