Root Cause Analysis for Your Data Lake

Session Abstract

From null values and duplicate rows, to modeling errors and schema changes, data pipelines can break for millions of reasons. And once “data downtime” happens, we need to know what caused it so that we can fix it – fast.

It’s one thing to talk about root cause analysis in concept, but what does it look like in practice? In this talk, we pull back the curtain on how some of the best data teams are tackling data downtime across their data lake by walking through how to root cause a real-life incident across three main channels: your code, your operational environment, and the data itself.

Video Transcript

Barr Moses:    Hi, everyone. Great to be here. Super excited about this. I’m really stoked to be chatting with everyone about how to Root Cause Analysis for Your Data Lake. Something that probably all of you have experienced and are asking yourself, can we do this in a better way? There must be a better way. Really stoked to share with you some of our thoughts on this topic. With me I have Lior Gavish. [00:00:30] I’ll introduce myself briefly and then hand over to him, so I’m the CEO and co-founder of Monte Carlo. We’re a company that helps companies adopt data, become data driven by minimizing what we call data downtime. I’ll explain a little bit more about that later. Prior to this, we’ve been working with data teams for the last decade or so. And a fun fact about me, if you all don’t already know, I’m a huge Bruce Willis fan, so any Die Hard movie is my favorite. [00:01:00] And with that, I’ll have Lior introduce himself as well.

Lior Gavish:    Hi everyone. I’m Lior and Barr’s co-founder and CTO at Monte Carlo. Before Monte Carlo, I used to build a machine learning based systems for fraud detection at Barracuda, where I ran the engineering team. And fun fact about me is that I love Formula One. I do not own a McLaren, but I am a huge McLaren fan and excited to be here today.

Barr Moses:    Awesome. [00:01:30] Rumor has it that Monte Carlo, the company’s named because of Lior’s love for Formula One, but also we’re named after the simulation. Cool, so let’s dive right into it. The problem that we’re going to talk about today is what we call the good pipelines, bad data problem. As an industry, and across the board we’ve all invested a ton in setting up the best infrastructure that we can and having, investing in [00:02:00] the best data lake, and data warehouse, and BI, and machine learning models. And yet a problem that comes up time and again, is that the data itself that’s powering our digital product, that’s powering the decisions that we’re making is actually data that’s often faulty or wrong. That problem has been around for a while, and we haven’t been able to solve that yet.

We think there’s a better way to approach that. And we’ll share a little bit today, about what we mean about that. But let me actually start by something that might [00:02:30] be familiar to all of you, so next slide here, please. If you’re in the data industry, you probably find yourself somewhere you can can emphasize with one of these sort of personas, whether you spend most of your time in a spreadsheets like Rebecca, or whether you like Engineering Emerson work with Dremio and other technologies, or maybe you’re manager Michelle looking at reports and dashboards [00:03:00] in Tableau all day. Regardless of who you are, or what persona you are, maybe a different one from this, you probably encounter the problem of data downtime. I personally encountered this when I was leading a team, actually back in 2016, we were sort of responsible for our internal data.

This was at Gainsight and we actually call it the system Gainsight on Gainsight or Gong for short. And we were running a lot of sort of internal reporting [00:03:30] on our customer behavior. And one of the things that came up time and again, was that the data that we were looking at was actually incorrect. And there were all the different changes in the system that resulted in the data being incorrect. And it just felt like we were playing this game of whack-a-mole. We would wake up on Monday morning and just have this flood of emails and requests for why are the numbers in this report wrong. And why is this report suddenly all NULL values, and why [00:04:00] a customer suddenly noticed that a number of somewhere else doesn’t make sense. And it just felt like there’s all these sort of questions that we were getting.

And it took us a long time to figure out what is a root cause for each of these problems and actually to figure out, can we prevent that to begin with. It took a lot of time and energy. And actually, I think this problem is one that’s being felt by everyone, but also that’s being exacerbated. Next slide, please. The reason why it’s being exacerbated is because we’re seeing, [00:04:30] I’d want to say three major changes in the data industry. One is that relying more and more on data to power sort of mission critical products or decisions, and data’s becoming more front and center. The second, is that the way that we use data or how we build our pipelines has become way more complex.

And finally, sort of the last big change is that we have way more people and way more teams working with data. [00:05:00] We have data scientists, data engineers, machine learning engineers, product analysts, product managers, data product managers, a lot more people are actually both producing and consuming data. And that has led to the rise of some new architectures, like for example, data mesh, which has been taking the industry by a storm and actually yields questions like how do we best manage data? Who’s responsible for making sure the data is accurate? And when data is down, who’s responsible for making sure that you can identify the root [00:05:30] cause and how to prevent that? Next slide, please.

I mentioned the concept data downtime, a couple of times, I’ll just formally sort of define it here. Data downtime is a concept that we’ve actually sort of coined as a corollary to application downtime. And so in the past if your app was down, it maybe wasn’t a big deal a couple of a couple of ago, but today [00:06:00] that’s sort of something that is unheard of for your app to be down and to have the entire company or at least the folks responsible for it, really sort of all over it. And we measure sort of five nines of availability and really have a couple long way in terms of application downtime. We think the same thing is happening to data, so if your data was wrong a little bit here and there a few years ago, maybe you could get away with it, but today that really just doesn’t cut it anymore.

And [00:06:30] so we think that by introducing language to this problem, calling it data downtime, and actually measuring data downtime in a thoughtful way, we’ll be able to address this better as an industry and get ahead of it. I actually think this is perhaps one of the most important problems that we have to tackle as an industry. And with that in mind, I’m going to hand over to Lior who will actually walk you through some novel ideas for how to actually sort of work through data downtime issues, how to root cause those, particularly when they’re [00:07:00] in your data lake. And hopefully you all can leave with a few kind of tips and tricks that you can implement with your infrastructure tomorrow.

Lior Gavish:    Hi everyone, and what we will focus today on is how to think about Root Cause Analysis for Your Data Lake. And like Barr mentioned, data downtime is something we’ve probably all experienced and maybe you’ve gotten angry [00:07:30] Slack messages or emails about a dashboard that looks wrong, or a machine learning model that doesn’t perform as well as you expected. And now root cause analysis is perhaps what you’ll do right after you’ve learned that there’s an issue, right? It’s the process by which you will both find out what was the issue, figure out how to fix it, and ideally also learn for the future how you could avoid [00:08:00] similar problems, right?

And the concept of root cause analysis originates, or at least we’re taking inspiration from DevOps and site reliability engineering. Like Barr mentioned, application and infrastructure downtime is a problem that’s been dealt with for a long time now by software engineering teams and broadly speaking, the process of root cause analysis is where you start looking at the problem, [00:08:30] coming up with hypothesis about what might have caused this problem, and kind of trying to break it down and try to go hypothesis by hypothesis and try to either validate it or disprove it, using information that you have about the system.

For example, like a very simple example, if you have a certain dashboard that relies on a certain table in your data lake. And the dashboard [00:09:00] seems to be out of date or stale, maybe your first hypothesis might be that your ETL did not run and did not update this table today, and therefore the data is wrong. If that’s your hypothesis, you might go and look at your ETL logs to find out whether that was the case and kind of quickly find out that this was indeed the problem or have to look somewhere else. And this process as I mentioned, serves two [00:09:30] purposes, it will both allow you to understand how to fix the problem. Maybe you need to rerun your ETL. It’ll also help you understand how to avoid the problem in the future, because maybe you will learn about how do you make your processes and systems better, so ETLs don’t fail.

Now in DevOps traditionally, what you’d have to look at to understand various types of incidents is your code and [00:10:00] your operational environment, right? On one hand you look at the logic that’s running in your application, in your services and try to see what changed there, or what in the code might have caused the issue that you’re looking at. And then from an operational perspective, you’d be looking at your infrastructure there you’re looking at your servers, at your network, at your configurations, and trying to see whether there’s something [00:10:30] there that might have been causing the issue. Data systems, and specifically data lakes, introduce a new factor here. It’s data, right? On top of these being software systems, where you have code running and you have an operational environment that’s running your code, you also have data feeding into the system.

And for a lot of teams, that data is not always in your control. Like you do not have [00:11:00] full control of what’s coming in because you’re maybe pulling sources that are created by other teams or maybe even other companies. And so that’s just a factor that adds to the complexity of troubleshooting problems and root causing things in a data lake environment. What I’ll offer today is a framework of how to think about these three factors and what information might be helpful to you [00:11:30] as you go through these factors and try to identify the issues that you might have with your data. Let’s start with the data, which is kind of the obvious culprit in a lot of cases.

When you look at data, if you’re lucky, you might have some checks or some anomaly detection systems in place to help you alert about changes in the data that might be causing issues, right? In this example that [00:12:00] I’ve put here on the slide, we have a table and a couple of fields within it are starting to get NULL values that weren’t there before, right? And we all know if you expect a value and maybe you’re even joining on it or incorporating it into downstream transformations in various ways. And you’re suddenly getting NULLs where you didn’t expect, that might break the assumptions [00:12:30] of the code that’s transforming the data and might cause data issues, might cost data downtime, right? And so, I’m sorry… When you see an issue, the first thing you might look at in your data is what changed about it, right?

Is it getting more NULLs? Is it getting duplicates? Have things changed from cents to dollars, or from Pacific to UTC time zone, or a million other things that can happen [00:13:00] to your data, right. And when you look at these things, you might get a sense of what might be causing the issue that you’re looking at. The next step is probably trying to go one level deeper, right? And part of root cause analysis in DevOps, there’s a famous 5 Whys framework where you need to consistently ask why, and why, and why until you get to the really underlying reason why something happened. And [00:13:30] so what you typically do when you do find a data issue is you’ll start looking at perhaps what’s causing those issues. Why am I seeing NULLs. And there’s kind of two things we find that are extremely helpful in trying to identify that root cause, that deeper level of why. One is actually kind of a statistical analysis of the data, [00:14:00] right?

And the way to think about it is perhaps maybe one of the data sources that’s feeding this table today is broken in one way or another, or maybe the problem is kind of correlated with a certain type of user in our system, or with users that have done a specific transaction on our system, right? And so it’s extremely helpful to look at the data and really break it down [00:14:30] by going into the technical term, low-cardinality dimensions, but essentially fields that would help describe the data in a low dimensional manner and help us identify the issue. In this example, you’re seeing a breakdown of the rows that are experiencing those NULL issues that I showed on the previous slide. And they’re broken down by the source, in this case it’s campaign, [00:15:00] so it’s broken down by the source of where those campaigns were run, Twitter, Facebook, and Google, right?

And if you look at this and you compare the breakdown of what’s happened before the anomaly happened and after the anomaly happened, it might give you a clue. In this case, it looks like Facebook in this case is giving us a higher rate of NULLs, right? And so by going through this breakdown and [00:15:30] grouping your data by various dimensions, you might be able to get clues into what’s causing the phenomenon that you’re seeing, right? The NULLs, or the duplicates, or whatever else you’re seeing in your data. The other part that might help you in identifying what’s happening with your data is lineage, right? And this is a very powerful tool we’ve noticed for root cause analysis. But if [00:16:00] you happen to know that you have an issue with a particular table in your lake, or even a particular report that someone’s using in your BI tools. To kind of focus your efforts it’s extremely helpful to understand what’s upstream of that, right?

And this is something that a lot of teams have as full core knowledge, but actually codifying that and having a central repository of the dependencies [00:16:30] in your lake can really help someone that’s looking at a data problem, narrow down the options, and really eliminate a lot of possibilities and understand what two or three tables they need to analyze and look at in order to potentially find the root cause of the issue at hand. Lineage is a very, very powerful tool here to understand data-related problems. [00:17:00] Moving onto the next factor, let’s look at code, so data lake code can have many forms, you might be running SQL queries, you might be running Spark jobs, and you might be running other types of code that’s manipulating your data lake.

When you see an issue, it can be related to [00:17:30] a code change, right? It can be related to perhaps an innocent change to a particular query or a particular transformation that had unintended consequences, right? And so we always recommend when you’re looking at a data issue to take a look at the queries that generated that data recently, right? To understand what are the queries that ran today and modified this data, or have written this data, and what [00:18:00] logic do they represent? And more importantly, what changes have been made? You want to understand whether a code change has happened, whether a code change was pushed, and something that’s very specific to data systems in particular you may in some cases not have consistent code running in your system, right? With software, when you ship a new version of your code, [00:18:30] it’s very likely that you have the same version of the code running throughout your system.

However, with data it’s not uncommon to see ad hoc operations like backfills, right? Someone may have been back-filling the data and having an unintended consequence there. Or God forbid, someone may have been making manual modifications to the data, trying to correct a certain problem or fix a certain issue. [00:19:00] And having the visibility and understanding that these things may have happened and quickly being able to spot them and correlate them with the issue that you’re looking at is extremely powerful, and is something that has helped a lot of the teams that we work with identify data issues very, very quickly, sometimes within minutes when it’s related to code. Now, the third [00:19:30] piece is kind of your operational environment. And it’s amazing to see how many times data-related issues are actually related to operational issues. And I’m sure you’ve all experienced some of these things happening, but it could be a silent error in one of your ETLs.

In one of your airflow DAGs that’s causing an issue. It might be permission changes [00:20:00] to one of the source systems that’s preventing you from ingesting new data. It might be network changes that are preventing certain processes from running. Or it could be even an intentional scheduling change that made something run at a different order or at a different time and cause data issues. And finally, of course, performance issues can cause data issues if your systems are running out of sync. [00:20:30] And so looking at that operational environment, understanding what changes happened there, understanding any errors or logs that you may have from those systems can really help explain data issues in many, many cases.

And to wrap all of this up, I want to remind you one of the most powerful things [00:21:00] that you can do throughout this process and kind of spans across data, code, and operational issues is to use your peers. If you’re working in a large enough company, people around you probably have some relevant information for you that might help you understand the issue. Maybe they’ve worked with this table before or the pipeline, maybe [00:21:30] they wrote it, maybe they’ve troubleshooted a similar problem and have some insight about how to solve it. And so if you keep better metadata and documentation you might benefit from that, right? If you can go to a table and quickly understand who’s owning it? Who’s been writing the pipeline for it?

Who’s been working with it recently? What did they find [00:22:00] out in previous incidents that they may have handled on your data lake, you might be able to get to the right person quickly and efficiently. Ask the right questions, or sometimes even reach a conclusion without ever talking to them, if they’ve been diligent about documentation. And so having a central place where it can go and understand who are the people that are involved with a particular pipeline or table, and what they found [00:22:30] out in the past when doing root cause analysis can really accelerate the process and improve overall collaboration around these things. And with that in mind, I’m going to hand it back to the Barr.

Barr Moses:    Hey Lior, thanks for going through that. I will just mention sort of two additional tools that have been really helpful for folks that we work with. And then I think we’ll have just a few minutes for Q&A here. I think there’s some good questions coming [00:23:00] in. If anyone has questions, please feel free to drop them in and we’ll take them shortly. I’ll mention here there’s a sort of concept of setting SLAs and SLOs, which again has been very familiar in other industries. And we’re starting to see more and more folks kind of adopt in data in particular. For folks not familiars SLA stands for Service Level Agreement, which basically creates a sort of a contract between different teams on what you’d expect and what [00:23:30] you’re sort of aligning on in terms of the data, so this is something that’s been very helpful. And then just the next slide Lior if you can. Perfect.

The other thing that we’re seeing that’s quite revolutionary for teams is actually starting to think about metrics for root cause, and for data downtime in general, so actually measuring time to detection, literally how long until you sort of actually [00:24:00] detected an issue. And time to resolution, meaning how quickly you’ve actually identified the problem and resolved it. And having teams actually use sort of the tools that Lior mentioned today. We’re seeing folks actually able to reduce time to detection, time to resolution significantly with these. And then finally I’ll mention to the last thing here is that if you actually sort of track the number of incidents, time to detection, time to resolution, you can get sort of an accurate view of what we would call data [00:24:30] downtime.

It’s something that I think is pretty new to start measuring. And I don’t think that we can ever aspire to have zero data downtime. I think everyone has said, my data’s perfect. It’s always accurate, but if we can actually start measuring it and started developing the tools to improve it, then we can see ourselves reducing this. It could be from months and weeks to actually sort of minutes and hours. We’ll wrap up with that, Lior if we [00:25:00] could move on to the next slide, please. And I think there’s a few questions here that we can start taking. Let’s see. Lior let me ask you this question? Ben, is it okay if I just started taking questions from the Q&A?

Ben:    Absolutely, go ahead.

Barr Moses:    Awesome. Lior what are some best practices for staying proactive about handling data downtime? [00:25:30] Such a great question.

Lior Gavish:    Oh yeah. It’s a great question. And I think it’s tied, kind of to what Barr presented in the most recent slide, but I think there’s kind of two things you could do to be proactive. One is find out about them, right? If you don’t know that you’re getting data downtime and you’re hoping that the downstream consumers will let you know, [00:26:00] then you’re probably going to be in reactive mode and be doing fire drills, and providing a relatively low level of service to your consumers, right? And so the first thing you can do is make sure that you have monitoring in place and testing as well, right? Make sure you test your data and you’re monitoring it so that when data downtime issues happen, you know about [00:26:30] it before others do and you have the time to properly address it.

The other thing that you can do is just track it over time and create visibility in the team around it, right? If there’s a data downtime issue and there’s just one person that knows, and maybe they have a hundred priorities competing for their attention, it might get lost in the cracks. It might linger for a long time but if the entire team understands, these are the issues that [00:27:00] we have right now, this is what we need to cover and we’re accountable for it, and we want to meet a certain level of service, a certain SLA. I think [inaudible 00:27:10] can create a lot of momentum in the team, in the company around actually fixing those issues, and being proactive about them, and potentially even doing the work to prevent them in the future. I don’t know Barr if you have anything to add around that.

Barr Moses:    Yeah, I think that’s spot on. I think we have [00:27:30] time for probably one more just before the clock runs out here. Another question that seems to be voted pretty high is how can I monitor data traffic?

Lior Gavish:    Data traffic. That’s a great question, and I’ll say it’s one of the many questions you need to answer in order to fully understand your data [00:28:00] health, right? And we didn’t spend a lot of time on it today, but broadly speaking, there’s this idea of data observability. Like how do you overall understand the health of your data end to end throughout your lake and all the downstream usages of it. And part of the work we’re doing at Monte Carlo is kind of trying to define how that happens and what are the things that we need to track in order to understand your data [00:28:30] health, and data traffic is actually one of the biggest pillars there. There’s many ways you can slice and dice it, but the way we like to think about it is to actually track data volumes throughout the lake, and though the data volumes and throughput, right?

One way to go about it is to actually measure and track the volume of all of the data assets that [00:29:00] you have and how it changes over time. And if you have that visibility, and you have that monitoring, we can actually track that there’s particular increases or decreases in data that’s flowing through the system, which can indicate issues or help you optimize your data lake deployment. And there are ways to do it at scale. If you do proper instrumentation on your underlying storage and on your [00:29:30] ETLs, you can actually track that and have good visibility into it.

Barr Moses:    Fantastic. Thanks for covering that. Next slide Lior, please. There’s actually a bunch more questions here, that unfortunately I don’t think we’re going to have time to get through, but we would love to continue this conversation offline. Feel free to reach out to us, either myself or Lior. This is probably our favorite topic in the world to talk about. And [00:30:00] I hope this kind of was helpful to you all in terms of identifying ways that you can actually use this in practice. And again, please feel free to reach out. We have a blog that we sort of write a lot about these topics and try to find ways that are helpful, so thanks again for the great questions. And it was awesome to be here.

Ben:    Thanks everybody. That’s all the questions we have time for. And thank you so much, Barr and Lior for taking the time to present here. [00:30:30] If you didn’t get your question, please reach out to Barr or Lior through the channels they mentioned just now we also have the subsurface Slack channel. We also appreciate if you fill out the super short Slido session survey on the top right before you leave. And the next session will be in five minutes. Another great place to go is the Expo hall as well, where we have a bunch of cool booths. Monte Carlo also has a booth open. I think you can wear an Oculus, right? That’s pretty awesome, so check [00:31:00] out latest tech, awesome giveaways, and hope everyone enjoys the rest of the conference. Thank you so much.

Lior Gavish:    Thanks everyone.

Ben:    Bye everybody.

Barr Moses:    See you soon.