Subsurface LIVE Winter 2021
Effectively Cataloging Data Lakes with Amundsen and Dremio
Similar to how Dremio is a data lake engine, Amundsen is an open source data discovery and metadata engine. With some recent contributions to the Amundsen source code to include Dremio integration, the two work together to effectively enable the discovery of data in your data lake via a usage-based recommender system.
The session will:
1. Introduce the need for cataloging
2. Provide an overview of Amundsen
3. Describe the integration with Dremio
4. Demo the new functionality
Joshua Howard, Manager, EY Consulting
Josh Howard leads multiple diverse and multidisciplinary engineering teams for EY, a global leader in the professional services industry. He is based out of Atlanta, GA and enjoys coding, coffee and his cat.
So, welcome everyone, I hope you’re having a good afternoon. Good evening, good morning, depending on where you are joining us from today.
Today we have with us, Josh Howard from EA, he’s the manager of Emerging Technologies at EY, and he has a wonderful presentation for us today. Before we start, there are a couple of housekeeping items that I want to run by the audience. If you have any questions, please go ahead and post your questions in the chat [00:00:30] window, which you will find on the right hand side of this screen. At the end of the presentation also, please, don’t forget to join the Slack channel for follow up questions. And also, please go ahead and go to the Slido tab that you would see there, where you can also provide feedback on the event and the session as well. We will have a live Q and A at the end of the presentation. So if you want to participate on that, we will queue you so you can activate your camera and your microphone [00:01:00] so we can make it more interactive. Without further ado, Josh time is yours.
Thank you. To begin, I’d like to digress maybe a little bit and start with a bit of an analogy. If you guys played with Legos as a kid, which hopefully some of you did you, your pile [00:01:30] of Legos might’ve looked a little bit like what’s on the left. And you might have felt while you were playing with your Legos and trying to build things that it might be a little frustrating to find things you don’t necessarily know exactly what you have. And you could probably imagine that if you were a little bit more disciplined, and had organized your Legos prior to building things that you could build things a little bit more easily build better things, maybe build things a bit quicker. [00:02:00] And so I think for the purpose of the analogy, building things from pieces requires organization. And if you’re a data engineer, the pieces you’re building whether are usually data.
And so we’re going to talk a little bit about how data catalogs can provide that organization and exactly how to go about doing that. So to give an overview, the purpose of this talk is to provide an approach to cataloging data lakes. It’s efficient, scalable, and reusable. Here, [00:02:30] I’m defining efficient as the approach that requires a minimal effort for infrastructure and engineers scalable. The approach applies to all data sizes and reusable. So the approach can be used for multiple data sources. Now here, we’re talking specifically about data lakes because subsurface is the data Lake conference, but this is still an extensible approach and, we’ll be kind of plugging that a little bit as we go along. I just wanted to put this talk [00:03:00] in context for the talk that Mark Grover gave yesterday.
He is the founder of Amundsen and did a great talk on the kind of the motivation and the future state roadmap of Amundsen. It’s definitely worth checking that out, in addition to this one, and I think the recording should be available next week to do that in case you missed it. But you know, to motivate this, I’m a contributor to Amundsen, I have been working on it for [00:03:30] about eight months now. And we’ve been using this tool at EY. My team has been using it to enable a quicker time to insight for most of our data science teams.
So without further ado, we’re going to start by motivating the use of data catalogs. We’ll describe the approach to data cataloging with Amundsen and Dremio, and then we’re going to take a deep dive into what exactly that approach looks like from an end users perspective. And then as mentioned, we’ll take questions at the end.
[00:04:00] So we’ll start by talking about what exactly is a data catalog. A data catalog is an application which provides context on data assets. And there’s a paper that I like and it’s called Ground, where they break up data context into three different categories that correspond to A, B and C. So first you have application, which is essentially the context that provides things like descriptions, [00:04:30] tags and schemes. And usually that’s essentially where data cataloging stops. But they also define behavior context, which is information on how the data is used and who it was created by. And then also change context, which is information on the frequency and type of update to them.
And so you might be wondering, why do you need to kind of log your data? And outside of the Lego analogy, it’s a great question because [00:05:00] data cataloging seems to have only risen to prominence in the past couple of years whereas, the data renaissance has been happening for 20. And I think the key difference and why cataloging data is so important now is really the shift from a warehousing environment to almost a pure data Lake house environment.
And I think that shift is really best represented by these two quotes here. So bill and men, [00:05:30] who is somebody who speaks on how to organize your data warehouse would argue, “there’s no point in bringing data into the data warehouse without integrating it”. And there’s no author for the second quote, but no doubt you’ve heard it,” just put the data in the data Lake”. We may never use it again, who cares, and it’s kind of hand-wavy, but it probably gets said more than usual. And so the key thing to understand about this is whenever you’re integrating data into a data warehouse, you are [00:06:00] provisioning it for a predefined context where data lakes really allow you to use more of an emergent context. And with that emergent context, it’s very important to make sure that you retain enough information to make that data useful down the road.
So last bit on the Lego analogy, if you had, pre-purchased a Lego set to go and build some Yeezys, like on the left, those Legos [00:06:30] already come with a purpose. There are instructions you put together in the set and everything works, but also you can have other blocks lying around. So what we’re going to be talking about is, what is the minimal set of information that you need to provide with those blocks so that if they become useful later, you’ll know.
So hopefully you’re with me on, why you would want to go about doing this. So now let’s talk a little bit more about how. So we talked [00:07:00] previously and said that a data catalog is an application for storing data context and it’s usually based of asset meta-data. And the challenge is really that meta-data can be very complex. And so the meta-data for a single table, is shown on the right. This is one table and it has columns, tags, users and there’s a lot of information here and this is only the meta-data that’s one degree removed from the table. So [00:07:30] each of these other entities that are the center one also have extended meta-data and the list goes on and on. So meta-data is very complex and you need an application that handles that complexity for you. And so Amundsen is a full featured and open source data catalog that does this for you essentially.
And whenever I say full featured and open source, full featured is in comparison to kind of the commercially available data catalogs like Collibra [00:08:00] and Alation, but Amundsen offers up these features essentially for free, which is great. And another feature that really sets it apart from the rest of contenders, I would say is the automated metadata collection aspect of it. What that means is Amundsen doesn’t rely on an end-user to go and manually create table definitions. And you don’t have to go throughout your organization, define table owners in order to [00:08:30] get table entries populated. It does an audit, it follows an automated meta-data collection method, which will allow you to go and just scrape sources for the information that you need without any manual intervention. And so we’ll talk a little bit more about that.
The Dremio here, I think is essential to the architecture because it’s a distributed query engine and that helps you abstract away some of the complexity of Metadata collection.
So [00:09:00] let’s double click on Amundsen. It’s an application, but it’s composed of four microservices. You have the front end service, which is essentially what the end user interacts with. You have the metadata service, which serves as a proxy for Neo4J or other graph databases. If you’re not familiar with Neo4J, it’s an application database that follows a graph model picture. On the previous slide is an example of one of those models. And then you also have the search service, which is a proxy for elastic search. [00:09:30] The data builder service is a service, which is responsible for meta-data ingestion, but it’s a little bit different from the other services, because it’s something you have to create yourself. Amundsen out of the box, doesn’t know how to go automatically and gesture data, but it does provide a framework for doing just that. And so we’ll talk for a minute on kind of how to build out that data builder service.
So [00:10:00] if you go to the get hub, it will describe the data builder service as an ETL framework for Amundsen, which is highly motivated by Apache Goblin. I’ve never worked with Goblin, I actually didn’t know what it was until starting this, but I would like to start out by saying that, it’s not necessary reading. Whenever you start out looking at this, it looks maybe a little perplexing, it looks like there’s a lot to do. But the key thing is, [00:10:30] you need to know that there’s a job API, there’s a task API. And then there are these things that reside within those APIs called extractors transformers, loaders, and publishers. So to simplify further, if you look through the examples, you’ll see very quickly that the transformers rarely used the loader and the publisher, essentially boilerplate. So the only thing you have to do is really define what extractors you want [00:11:00] to use and, and implement those, and then wrap the rest of it up with the boiler plate and the task and job APIs.
So if we were to go about doing this, we wanted to go, maybe do this manually where we’re putting the tables that are sitting in our data Lake into Amundsen. You could probably theorize about the way you would want to go do this. We’re going to go through kind of the naive approach and as [00:11:30] a disclaimer, this is definitely not the way you want to do it. But I think it’s worth going over this just so we understand. For the first option, the first thing that you could do is you could say is that, I want my table to be populated in the Amundsen. So let me go pull some meta-data manually, maybe using photo3 for example. If your data is in S3, where you would go and build out this table .CSV file, and then you would go and pack all of this meta-data [00:12:00] into that file and then you would be good to go. And the point here is that, they are fixed with texts, it’s meant to just kind of be [anichart 00:12:09] to cheer a bit, but it’s a lot.
The second thing you’d have to do is actually go repeat the same for the columns, because the columns actually have their meta-data as well. And then you would go to find an extractor. This is a short code snippet, which is essentially how you define extractors in Amundsen. Again, there’s [00:12:30] more documentation on the report, but you define an extractor, first you pass in some parameters to that extractor, namely the file names that we just created, you go wrap it in the job and task APIs, and then you run the job. And you do that however many times you need to, or with whatever frequency you need to, it’s really up to you.
So that already sounds pretty simple, but we’re kind of glossing over a lot of the complexity of going and scraping that [00:13:00] data from S3. This isn’t particularly scalable, it’s a lot of work, and it’s not particularly resilient to changes or anything like that. The second option is really just to use Dremio. And if you’ve already got Dremio set up, you already have virtual datasets, which are configured to read from your underlying data lakes. Then all you have to do is pass in some configs so that Amundsen can connect to Dremio using this extractor and [00:13:30] then run it. And then that automatically syncs your data sets and Remio, which are underlying or the data Lake is underlying. And it will push all of that into Amundsen and you have guaranteed consistency and it does it rather efficiently as well.
This is also the point where multiple different data sources can be used. So Dremio, doesn’t just support data lakes, there’s multiple other backend data sources you can use. Same [00:14:00] goes you could swap out Dremio for another distributed query engine that has a completely different, data source profile. But the idea is really, you want that automated meta-data collection because it saves time and also guarantees consistency and Amundsen with a distributed query engine can kind of help you do that. To reiterate that, the naive approach here really looks like connecting Amundsen to every individual data source. [00:14:30] And based off the sheer number of extractors that they have, you would kind of maybe think that this is what they want you to do. Because there’s been an enormous amount of work of making audits and work in different environments. But what I’m essentially proposing is that you kind of want more of a layered approach where you set Amundsen on top of one of these query engines and then that actually federates out to data sources.
Another question you might be asking is that [00:15:00] why do we need this distributed query engine? Why don’t I just go to my source? Why, if I’m an SQL server shop or whatever, don’t, I just go to this one source? And so the reason for putting in this distributed query engine is an intermediary it’s really it’s required for big data. And then most distributed query engines actually have a separation from storage, which allows them to connect to multiple backend data sources, which has beneficial. And then these query engines [00:15:30] usually follow the ANSI SQL standard, which has a table called information schema, which is essentially what you need in order to enable this automated meta-data attraction. They follow a standard which actually facilitates the automatic extraction of that data. Key thing again is that, there’s several of these. If you have one already in-house, you should probably just be able to hook that up. It’s not [00:16:00] absolutely necessary to have Dremio in order to make Amundsen work, but it is a great choice for queering data Lakes.
Now that we have this kind of approach laid out, let’s move on over and talk about what the user experience that results from this approach looks like. And so to do this, I’m going to start by looking at a table on Dremio. The table that I’m using as an example is from IMDb, which is a movie database [00:16:30] that some of you might be familiar with. The purpose of this table is that it just stores a record of different movies when they were created, that type of thing. And you can see here, if you’re familiar with Dremio, this is created as a virtual dataset, which is mapping over data that’s stored in S3. And you can see a little bit about the schema and also what the data looks like here.
The user journey really starts in Amundsen because what the user wants to do, they want to find data and then they want to query it. And [00:17:00] so they’re going to want to start by finding data. So they navigate to the Amundsen`UI, they search for the data that they want, maybe they entered in a query for movies or IMDb or something like that. But either way, they found their way to the IMDb titles table. And this is the table detailed view in Amundsen, which gives you all of that context that we talked about before. So there is application context in the top left with the description. You have owners and frequent users, which gives you an idea of that behavior context. [00:17:30] You have last updated, both for the table and Amundsen as a whole. So, that gives you a little bit of information about the change and then, you have the full schema definition, even things like the size of the data, monthly costs and S3 tags, all of that by stuff.
And the thing that I really want to double click on is really where does the user go next? Let’s say that they saw this table based off of the context here. They think that this is exactly what they want to use. [00:18:00] So they have two options kind of to explore further. The first is the preview button, which will actually pull up a view of the underlying data. And this is configurable in the front end to work with Dremio, there’s an example in the gateway repo, but essentially it pulls up a table that’s fetched, whenever you load the table detailed view, you see an actual copy of the data that can be viewed from the application itself. And then you can actually go [00:18:30] and see if that is really what you are looking for, which I think is very useful.
Amundsen also supports things like columns, statistics, data, quality metrics, all of that stuff can be put into the UI. But I think that requires a deeper integration and honestly, just being able to take and see a snapshot of the data is usually I think more useful and I tend to put a little bit more trust in that. So they can exit out of the preview page and if they actually want to go [00:19:00] and do something with that data, then they can navigate the Dremio from the Amundsen UI, with the source data model populated. So if you click on the Dremio link in the header, then it will actually just take you right back to the table in Dremio where you can munge the data, drafter SQL statements, and then, use it for whatever further downstream application you’re trying to build.
In summary, the goal here is to give you [00:19:30] a little bit of an understanding of how to effectively catalog data lakes using Amundsen and Dremio. If you take one thing away from this, I just want to make it very clear that it’s absolutely essential to catalog your data and also that it’s very easy to. If you are using these two tools together, you should be able to get this up and running almost in production within at least 24 hours. So it’s very simple to get started and Amundsen has an [00:20:00] absolute great community behind it. The team that is developing it out-live, is really great, they’re very helpful and they have a very active Slack channel and are also very responsible on GateHub. So it’s a great community to be a part of.
If you would like to reach me, my contact information is below and now we can take questions. [00:20:20] [00:20:30] Lucy I think you’re on mute.
Speaker 2: Can you hear me now? So we had, a couple of members of the audience that were queued up for video, but they changed their minds. Let’s see, we have questions here in the chat window. One of them is, coming from Nikhil Patel. He’s asking your plan to support AWS Neptune at Making a database.
Josh: [00:21:00] Sorry, what was the question there?
Any plan to support the AWS Neptune as by candidates.
So in this case, AWS Neptune is already to my understanding a supported backend for the Amundsen meta-data service. So that’s just the way that the app runs. So if you’re a Neptune shop, you can definitely configure that.
And the other one from Steve, he’s asking, does Amundsen support [00:21:30] data lineage?
So I don’t currently use that feature and the version that I run, but it is definitely better might be, in the full version, now I’m not exactly sure. It was discussed yesterday, so if you could check out that talk, that would probably answer that question.
And I’m going to take that as an opportunity to let the audience know and all of you who are asking if this is going to be available, the talk will [00:22:00] be available later. The recording, you will be able to see this entire session along with the questions as well. So in the following days, just keep checking their subsurface website for more information on that. John Scott is asking if the Dremio integration and Amundsen is available out of the box, or this is something that you had to create.
I mean, actually the contribution I made was creating [00:22:30] it, but it is now available out of the box. So you can go to the Amundsen data builder repo and you can very easily find it from, their page, but it’s an example under “README” that you can follow.
I think we have time for a couple more questions. The original one here, this person is asking if admin center lets you add custom meta-data.
I think that’s a great question. Overall, the tool is incredibly configurable. [00:23:00] I would say, it’s almost a double-edged sword because you can make it into whatever you want. And one of the things that Mark covered yesterday was about like building additional microservices on the meta-data model that Amundsen support. So it’s definitely configurable, it depends on the further you branch away from the existing standard. There’s definitely going to be less support for that, but it’s pretty easy to understand the way that everything works under the hood.
Excellent. And one [00:23:30] more, does atmosphere allow us to grow data into the mains?
There’s a facility for tagging where you could do like domain based tagging that the Amundsen app itself doesn’t have a concept of domains, but you could definitely implement that.
And last but not least, Zach is asking if there is any development resources that you would suggest for any newbies out there.
[00:24:00] Whenever I say the Amundsen is probably the most well-documented project I’ve ever seen, it’s incredibly hard to, overstate that, their documentation’s great.
Okay. And I think we have time for one more. Does Amundsen support geospatial data?
So in this case, you’re just storing the meta-data. So as long as it’s the metadata [00:24:30] that you would want to store of the geospatial data fits with the meta-data model, then it should support it. I’m not that familiar with geospatial though.
Okay, I think that is everything that we have for today. Josh, thank you so much for such a wonderful presentation. To the rest of the audience, if you can go to the Slido tab and provide an answer to three questions that we have in there, it will take you nothing but 10 seconds. And I hope everyone and is [00:25:00] continued to enjoy the conference today. And thank you for participating and talk to you soon.
Thank you guys.