Platform-Agnostic Lineage Validates and Creates Trust in Your Data Lake

Session Abstract

If stakeholders are to make the most of your data lake, they need confidence in the data’s quality, reliability, and timeliness. Where did it originate? How does it relate to other data? What transformations happened along the way, and when?

Data lineage gives you this information, offering a complete reckoning of the journey your data took, from ingestion to where it exists today.

Join CompilerWorks CTO as he takes a technical deep dive into data lineage and shows how CompilerWorks Lineage automatically discovers the provenance of your data using only metadata dumps and query logs, never touching your data. CompilerWorks Lineage creates trust in your data.

Video Transcript

Shevek:    Thank you very much. So, the powers that be at Subsurface and at CompilerWorks, they’ve been kind enough to grant us 25 minutes to have some sort of conversation. And the conversations are always more valuable if you get what you want out of the conversation. So, I’m actually going to encourage you wholeheartedly to throw your questions and comments into the chat and I will deal with them as I go along. I’m Shevek. I’m tech leader, CompilerWorks. [00:00:30] I’m going to make a totally unsubstantiated claim because I have no idea how you would substantiate it, but I might have written more compilers than anybody else, full stop. No idea how one would go about substantiating that, but that actually gives us an interesting opportunity that if one simply wants to pile in and ask questions about compilers at any stage or for any platform let’s have a talk about it. If you’ve been to music festivals or used products or [00:01:00] your kids play Minecraft, you’ve probably interacted with something that’s been run through one of my compilers.

I’m going to talk a little bit to get started until the questions pile in about automatic lineage analysis of data processing code. So, with the background of the introduction that Titus gave, and my background and our background that goes across the team, I’m going to make some ridiculous claims. Why not? Let’s have fun with it. It’ll get the argument started, and [00:01:30] you’re thoroughly encouraged to dig into how any of this works. We take data processing code and we compile it. It’s pretty much in the name. And we use that information to answer a set of questions about data processing infrastructure. So given a bunch of code, the behavior of the system is defined by that code.

So, it is possible, for instance, if you could trace where every piece [00:02:00] of data comes from and goes to, to say, “Am I, for instance, GDPR compliant based on code analysis?” It’s possible to find out where something comes from, where something goes to, anything about this code, just by doing the code analysis. And that’s what we do at CompilerWorks. And we’ve been doing two things with this. We’ve been doing, we call it the lineage product, but really it’s the, what do I need to know about this code that ran? If you are a person running an infrastructure [00:02:30] where there’s code pulling data out of tables and pushing it into other tables and moving it across the network, and you for instance want to maintain a piece of code. Who uses it? Who relies on it? Who touches it? If I touch this thing, will I break anything in production? Am I changing the definition of a number that’s used at a sales dashboard? What is going on with this code? And that’s the sort of question and answer at a distance that makes one’s life a lot easier.

The other thing that we’ve been doing a lot of in this organization is doing machine translation [00:03:00] between languages. Taking in code in one language and emitting it to another and actually going platform to platform rather than just language to language, which can be a little bit academic. And so, there’s a bunch of questions. We’ve got all the motivating questions up on the slide in front of us. Let’s see what we end up with from this. We take in, in this particular instantiation, let’s say a database log or a set of database logs from different languages and we produce a lineage model, which has [00:03:30] an awful lot of detail in it. And that detail is accessible. We’ll jump into a demo in a couple of minutes and we’ll start to see the detail.

Underlying this, it’s sort of an interesting question as to what sort of company we are. Are we a compilers company? Yes. But compiling code for execution is a different game to compiling code for static analysis. I think one of the things that one realizes about a lot of database products [00:04:00] and we’ve got 20 something, 30 something compilers in this suite. Not all of them listed here, but there’s just some significant ones. Compiling for execution is actually quite different to compiling for static analysis. Because if you’re compiling for execution, all you have to do is to produce an execution plan, preferably a good one. But the process of compiling for an execution plan actually throws away an awful lot of semantic information that you need for static analysis.

And so, I think over the first couple of years of running this company, we [00:04:30] learned that you actually have to write quite a high level algebra engine in order to be able to answer this sort of question. So we have the compilers, we’ve got the algebra engine, and the emitters then get asked one of two questions. The lineage fabric gets asked, what do I need to know about this code and about this infrastructure? And the code emitters get asked, what code do I need to omit in order to do the same business process on a target platform? [00:05:00] Neither of those questions is about how do I run this code? How do I port this code? Where does this go from there? We’ve kept the questions that we need to answer very, very high level here.

We have three guiding principles in our product suite as we go through. The first one is accuracy because there is no point whatsoever doing any of this unless you can say that the answer you’ve given is correct and complete. A proof of concept for something like this says, “I’m going to give you an infrastructure, [00:05:30] and you’re going to analyze this infrastructure, and then I’m going to examine you on what it was that you learned about this infrastructure.” I’m going to ask you to validate that you managed to prove three facts that I knew were true about the infrastructure, and then you’re going to tell me something that surprises me. And that’s what a proof of concept looks like for CompilerWorks. And we go through that, and really, I’ve encouraged you to be brutal questioners. I haven’t seen anything in the chat yet, but please go ahead.

The really good proofs of concept are the ones where they come in and they say, “I don’t think you’re going to succeed [00:06:00] because you can’t show me how this thing up here affects that thing down there.” And we say, “Yes, we can. Here it is.” And then that’s the past proof of concept. So yes, you have to be correct and complete. The motivating thesis for correctness and completeness is if you’re doing something like a GDPR or a CCPA analysis, there is a real risk of a significant fine or going to jail if you get it wrong. What product or what analysis do you want to run that’s going to keep you out of jail? And our job [00:06:30] here is keeping people out of jail.

Performance is the second one because we ended up dealing with very large infrastructures. You get, in some cases up to terabytes of log files that you have to analyze and you need to produce an answer from that in a small number of hours. And then usability comes in third because accuracy is the first thing.

What do we get out of it? We get a column level or attribute level model of everything that affected everything else anywhere in the infrastructure, and details on how it [00:07:00] affected it. And I’m going to be careful, I’m going to get out of my slide order here. So, although we’ve reduced it for the rendering, the amount of data that we have underneath and we get to play with that somewhat in the demo is absolutely vast. We can say what happened, where it happened, when it happened, and how it affected it even at a distance. And the real joy is when something… There’s little red markers in here that say, “Not just that something’s unused, but that something is possibly eventually, or [00:07:30] transitively unused.” Because finding out that a column is unused is fairly easy. You can say this column never appears as a reference. But there’s a column here, which is marked unused later, which means nobody… It is used, but the people who used it were not then used.

We have a question about what languages we deal with. Yes, it does. It includes a suite of languages. So, the question is what languages can we reverse engineer and do we do proprietary ETL tools? [00:08:00] We do a lot of languages. As I said, there’s sort of 20 or something of them. We’ve done all the major databases. So you’re talking the Oracle, Teradata, Vertica, Postgres, all the open source ones. So, the Hives, the Prestos, the MySQLs. The clouds, so the Snowflakes, the BigQuerys, the Redshifts, and then a bunch of other tools. So we do a lot of work with Informatica, for instance, which is a complete pain in a lot of people’s backsides.

And one of the motivating factors for this is that when you’re working with a proprietary enterprise tool, [00:08:30] there is no incentive on the vendor to tell you certain things about your code. One of the touchstones is it is not in Oracle’s interest to help you use less Oracle. The cloud vendors are changing this a little bit. If you want to know about data stage you should write to Viken Eldimer, our CRO, and he will put you on the beta for data stage. [00:09:00] It’s happening and I did not announce that.

Anyway, you’re welcome. So, let’s keep going. So, yeah, so actually recently we just shipped a new GUI, which is a lot cleaner and it works a lot more in the interactive version. But I’m also going to dig down into the point about having the complete and correct understanding is that we expose all of this data via APIs, which makes it very easy to integrate with other products. [00:09:30] So, we can throw, there are infrastructures where people want to say, I want the following thing never to be true, or I want the following thing always to be true. Well, that’s an API query. You could just monitor, and you could say, “This data must never affect that data. Or if this data was critical or all data which attacked critical must never have a latency or a behavior or whatever of more than the following.” And that’s where we expose the entire lineage model with all of its painful detailers, GraphQL [00:10:00] APIs.

One of the other questions that goes to the enterprise question is it is end to end. You haven’t done lineage correctly and completely unless you can start with an interaction in the operational data store and push it all the way through to the dashboard. And this is where actually we’re a very open format platform. One of the things you see is that we have done all of the reverse engineering for all of the languages. This isn’t a build it, and they will come situation. This is a, we’ve [00:10:30] done the hard work. You give us Informatica, we’re there. But also people do custom things. People do machine learning things where they just don’t tell you what’s happening on the GPU’s and we can integrate that.

And so, when you’re looking at a test for a full CompilerWorks installation, it’s it started in the operational data store. It shows up in Tableau and I can show the absolute path for every attribute, whether it was sensitive, what effected it. And then we’ll get into questions such as when this attribute breaks or when the data in this [00:11:00] attribute is missing or wrong, who do I need to talk to? What are the possible potential places for this error to be introduced? And the sort of the level of detail that we have.

I referred briefly to the fact that we also do code migration. So there is a level of understanding that we have of the code, which is so much deeper than just passing form clauses. And that enables us to answer all of these additional questions. If you have a [00:11:30] function the computer checks up or something like that, that’s not a GDPR sensitive transformation. In order to be GDPR sensitive, it has to be based on the sensitivity of your organization, either an exact or a partial copy, which could be a substrate or an exact copy.

If you’re going to be dealing with the introduction of an error. My data is missing, something’s gone wrong. Then you have to be able to look at data, do derivations or effects. If you’ve got wrong cardinality then you have to be able to look at [00:12:00] filters, and the ability to compile the code into the algebra model, and be able to detect all of this detail is what makes this product work. I’m going fairly high level because I haven’t been driven down into any of the technology by a presumably very technical audience, but you’re absolutely welcome and encouraged to do so.

Deployment looks roughly like this. You have a database, there is a dumper which pulls metadata out of the database. Metadata- [00:12:30] only, we never touch data. We have a very different security property to many other things that want to connect to the database. Firstly, we need metadata only. And secondly, everything we do will run sandboxed and clean. And then we take the code, which might be log files, it might be exports, it might be something else. We feed that through the transpiler and we end up with lineage database. And without particular further ado I am going to.

[00:13:00] I do not know the definition of the term PROV-O. Somebody has asked, can the lineage tool generate PROV-O output? I will comment very briefly because I actually don’t know the definition of that format, but I will say that because we have a GraphQL data structure, and you can think of GraphQL as being a structural query interface for an arbitrarily complex data structure. Now, what this is, is [00:13:30] it basically lets you look at the lineage graph as if it were infinitely large and infinitely detailed, and you can query any sub-part of that. Standards tend to follow implementation.

So, if you read, for instance, the SQL standard, the SQL standard looks awfully like Oracle because guess who was spending money to promote the SQL standard at the time it was being written? And I don’t think anybody is… In fact I know, nobody has ever done an implementation that can extract this level of detail [00:14:00] from code in order to produce this lineage model. And therefore, it isn’t really a surprise that there isn’t any existing standard that’s really able to talk about this level of detail. I say this in ignorance of PROV-O, which has the W3C standard, but I would be surprised to hear that there was any standard that was able to represent this level of detail. And so, our recommendation would be to write a GraphQL query, which will give you a JSON model and then convert that JSON model down into the target. [00:14:30] But I think you’ll find that that’s actually LaCie on our GraphQL model just because we have so much detail.

So, where were we? Yeah. So we were actually going to go for the dangerous thing, which is trying to change presentation tab. So let us go for this one. Let’s have a look at some lineage. So this is the front page of the lineage [00:15:00] exploration interface. And let’s start with a table, so let’s start with an ad campaign performance fact. So, let’s start there. So we have reverse engineered this model from a model… Basically, from the log files. We can see where this table came from, we can see where it went to. This one is a terminal table. [00:15:30] Now, I can start pulling up details and we can say, “Okay, let’s have a look around and have a look at other tables.” I’m just going to reload this one.

Okay. If I click on there, I get this sidebar up. And now we start to get interesting things because I’ve got a list of columns here in this sidebar. And as I move my mouse over this, what you can start to see is that the places where this individual column came from in the upstream path are highlighted [00:16:00] through the main diagram. So, I can see that for instance, total clicks came from web sessions, fact which came from interweb something, Stich which came from interweb something initial, and all the way down.

I can look at the detail for any particular column. And here you can see that when I go into column level, we’re still working on the old graph [inaudible 00:16:20], which we’re currently looking at replacing through the product. But now what I’m looking at is the exact detail of a single attribute as it came all the way through the infrastructure [00:16:30] from the initial. And here we’ve got the ingest of the operational data store in HubSpot, all the way through the infrastructure. I can see that it was aggregated, that it was joined or unioned on several things, and that it then came out into this target column. In fact, if I look at this, I can start to say, “What was the exact code?” I can see here that three columns went into one. I can say, “What was the exact code that joined these three together?” In fact, that’s the [inaudible 00:16:57], this is the right. And so now I can see [00:17:00] that this is a, I’ve got three CTEs and a select. So I’ve got a union there.

Let me go back into data flow and we’ll start to ask some other questions about some of these tables. So, there’s a lot of detail here. One of the things that I could do is I can actually zoom out to the business level and I can just say condense this to the business level only. And I can see the dotted lines where it says “There was more complexity here, but this is the basic [00:17:30] properties of what it is that you want to see.” I can say, “How did web sessions fact affect ad campaign fact?” And so, I can actually go into this edge and I can say, “Give me the detail of it.” And it says, this was a data transfer and there was an aggregation performed on this edge. I can say, “Well, show me, what was this data transfer? What was the aggregation performed on the edge?” And then I’ve got the information here and I can start to see there… Okay, so we’ve got the aggregation and it’s been highlighted here so I can see exactly what it is.

Now, I’m going [00:18:00] to jump out of this a little bit and I’m just going to go into some labeling. So I’ve labeled this table as a dashboard. And so, now if I go back to my table and I look at the data flow, and I look at some table which is upstream of this. So, let’s pick this table here. I can start to ask questions like which columns of this table affect the dashboard table? Now, this is your key for a developer who is looking to perform maintenance on this table. This [00:18:30] is propagated information, having labeled my critical table as a critical dashboard, I now know that I can safely maintain the conversions’ column without affecting business critical data, but I cannot safely maintain the cost column without being aware that this affects the production dashboard. And I can say, “How does it affect the production dashboard?” And there’s my answer. And so, what you can see now is that it’s possible to take a large and complex data infrastructure and perform [00:19:00] maintenance in it safely without having an unintended consequence or breaking somebody that you didn’t intend to break.

Now we’ve exposed the data at relation level. We’ve exposed it at column level. We can also start to look at this thing at pipeline level, and we can say what pipelines were affecting data? In fact, I’m going to go into this a different way. I’m going to go back to the table that I was thinking of maintaining [00:19:30] and I am going to say, “Well, who do I affect?” Let’s say that I want to maintain this table. I can see the downstream image. I can now start to see… Let’s have downstreams, downstreams all. There we go. I can now start to see if I maintain this table, who am I going to effect? Who used it? There’s a set of users. When did they use it?

And this is all of the information that you need to know in order to safely and reliably maintain this table without [00:20:00] affecting it. I can say who wrote it? Who were the users? The last user was Chris who used this table. These are the people who are going to be affected at that target if I do something to edit this table. And then I can go, and I can call them, and I can figure out what were they doing. But to be honest, I don’t even need to do that because I can say, “Okay, what was Brooke doing with this table?” And I can find it out.

There is a question about integration into the environment. Does anybody want to do any more questions on the [00:20:30] demo before I flip back to the presentation on… Oh, okay. So, pulling in the question from the questions and answers, is this literally a code compiler for the code written to build ETL pipelines? Yes. Yes, it is. Yes. Three letter answer. One word, yes, it is. This is not a framework. We write compilers. It’s in the name, and you are welcome to quiz me on the writing of compilers in any algorithm, in any system down to the nasty details [00:21:00] of what implicit costs are performed by Teradata when comparing strings and vintages? Just go into the corners that we will prove to you that yes, we write compilers. This is the real thing. I’ll take other questions on how to find things out, just because there’s so much in here that it’s difficult for me to cover everything that we’ve got here in one round. But you basically have all of the answers to your questions.

[00:21:30] As code changes, how is the lineage kept up to date? So you pull a new log file and you run it through the process that we ran through before. So the transpiler is our core product and it turns code into lineage model. And this web interface that I’m showing you is just one of the ways of viewing the lineage model. The other interesting aspect is that one of the things I showed was that there is [00:22:00] the GraphQL interface, which basically lets you query for anything and access the full underlying API. Everything that we’ve shown in the user interface is accessible through GraphQL. There was no code in here which is purely user interface. Every single analysis that we’ve done there is available to you through the APIs. So, yes, the answer is every day or every hour, or however frequently you want to do it, you [00:22:30] pull down a fragment of log file. You run it through the transpiler. That will generate you a new fragment of lineage database and you then add that to the lineage model.

People in the queue, do you want to type your question or do you want to ask it verbally? Please if you are in the queue type the question as well because it helps us get through a bunch. Brett, are you coming into moderate?

Brett:    Yeah. Well, I was going to say, so if anyone has a question now, do we want to take a quick pause for questions to be answered over [00:23:00] audio real quick?

Shevek:    I’ll happily take that.

Brett:    Yep. So I’m going to start with the first one. Again, when I bring you in just share your audio and your video, and ask your question. So, Todd, you are first.

Todd:    Hello. I’m actually just relaying a question that we had earlier today and that was around integrations with data catalogs. And I think it actually plays a little bit into the GraphQL. So Shevek, [00:23:30] our approach to data catalogs please.

Shevek:    Yes, I’m missing a slide here that shows the publication, but what we have is a process where the front-end pulls in the log files in all of the code, and then the backend pushes out to the data that we have in several different formats. One of them is a bulk format in Protobuf. One of them is a bulk format as a SQL database. One of them is the GraphQL API and one of them is the live user interface that you see here. For [00:24:00] small to medium-sized integrations, the GraphQL API is the very best way to integrate because as with anything, I think the reason GraphQL has taken over the world is that you get to say what data you want and you get it back in that particular structure. If you have more than several gigabytes of log files per day, then you might want to consider doing integration using either the Protobuf data format or the SQL database data format.

Todd:    Thank you.

Shevek:    Next question, [00:24:30] please.

Brett:    Okay. We’re promoting one more come in. He should be joining in a second. Again, if you have any additional questions, please put them in the chat and we can ping pong back and forth from those that are in the video queue to those that are asking on chat.

Shevek:    And I’m going to ping off two questions that sort of went together in the chat. Is this a framework or is this a real compiler? Yeah, this is a real compiler. How do you integrate it? Part of the reason that I put this slide up [00:25:00] and I insisted as a tech lead that we put the real commands in the slide is that if you install our product and you type those three commands, you get from a raw log file what I’m showing in the GUI. That’s how real the slide is. There was a question coming in.

Brett:    Yeah, there was, there’s two people that were in the queue that are not showing up. So, the last one, [Sheesh 00:25:21], I’m going to promote you now and please ask your question. We’ll just give it a couple seconds. If not, [00:25:30] we can continue on.

Shevek:    Some people may be fighting with microphone or camera permissions because I think it asks them at a very inconvenient moment, doesn’t it? So there was a question from .[Kevin Leeeldo 00:25:44] in the channel about integration into one’s environment. I switched back to this presentation because it really answers. It starts to answer it. I will say that the different enterprise systems expose [00:26:00] their information in different ways. So for example, if you’re working with Tableau, you run a little command that dumps all the XMLs out of the repository. If you’re working with DataStage, you can export as DSX or XML. If you’re working with Informatica, you export XMLs. The databases differ. So, for instance, if you’re working with Redshift, then you pull audit log files from S3, or you pull from the STL queries.

If you’re working with something like Oracle that has [00:26:30] persistent code, then you do decks, DP, and IPTP. So what I’m doing here now is I’m really going into the nitty gritty of how you integrate with your environment. The answer is that for every product, there is a way, some of them natural, some of them are unnatural. Some of them for the log files, some of them from persistent state where you get the necessary information out of the environment and CompilerWorks is then a suite of compilers, which absorbs those native formats, [00:27:00] whatever they are. And the objective of CompilerWorks is to absorb that format basically as is where is.

Brett:    Okay. So, we’ve got about two minutes left before the end of the session. I’m just going to say, are there any additional questions that anyone wants to ask in the chat?

Shevek:    I was going to say, if we were light on questions, I was just going to flip back to the demo and push buttons until something breaks.

Brett:    Please do. Please. Yeah, we’ve got two minutes.

Shevek:    Ask the real questions. If [00:27:30] you’re skeptical or you think this isn’t what it looks like it might be, ask the question that persuades you that it is because I’m going to be absolutely upfront with you. And nobody’s really asked, so how do compilers work? Which is always entertaining.

Brett:    And there is a reminder that you can still ask questions after this session. You can go into the SubSurface Slack workspace, and I put a reminder in the chat, or you can obviously reach out to Shevek directly and ask questions there as well. So, please [00:28:00] feel free to do that. Again, we do have about one minute left.

Shevek:    Yeah. So, we’ve also got insights and insights is one of the questions where I haven’t actually plugged anything into this to say something is critical, but we saw the propagation. So, I can start to go into something like companies. And I can start to say, this table is GDPR sensitive. And at the point where I have that sort of marker in there, I can then start to go into insights and it will automatically tell me whether anything which is GDPR sensitive [00:28:30] has been published outside the organization because a propagation downwards of a fact that something is GDPR sensitive will meet a propagation upwards saying that something has been published outside the organization.

And now your CDO just needs to keep an eye on this dashboard, see that all of these numbers are zero. If any of them’s not zero, he gets the explanation to say why it happened at which there are unfortunately none in this dataset because I’m shockingly under prepared. And this really is just the answer that you need to know. And we just need to lean on the fact [00:29:00] that it needs to be correct, and it needs to be complete for this to be reliable, and that is what our business is. We’re called CompilerWorks because we sit here and we write the compilers, and we do the hard work. Brett, I think that’s time. Thank you.

Brett:    You hit the nail on the head with the time. This was a great session. This was really informative. I loved it. I had a blast. I always love seeing demos. So, thank you so much for your time. Again, just a reminder to everyone. Please rate the session in the top right there’s a tab for Slido. [00:29:30] There’s questions there to rate the session and to provide feedback. But Shevek, thank you so much for your time and for your support and have a wonderful day and everyone have a great rest of Subsurface Summer 2021.

Shevek:    Thank you very much indeed. You are a great crew.