Dremio Jekyll


Subsurface LIVE Winter 2021

Data Observability for Data Lakes: The Next Frontier of Data Engineering

Session Abstract

Ever had your CEO look at a report and say the numbers look way off? Has a customer ever called out incorrect data in one of your product dashboards? If this sounds familiar, data reliability should be the cornerstone of your data engineering strategy.

This talk will introduce the concept of “data downtime”—periods of time when data is partial, erroneous, missing or otherwise inaccurate—and how to eliminate it in your data lake, as well as the rest of your data ecosystem. Data downtime is costly for organizations, yet is often addressed ad hoc. This session will discuss why data downtime matters to building a better data lake and tactics best-in-class organizations use to address it—including org structure, culture and technology.

Presented By

Barr Moses, CEO & Co-Founder, Monte Carlo Data

Barr Moses is CEO and Co-founder of Monte Carlo, a data/analytics startup backed by Accel and other top Silicon Valley investors. Previously, she was VP Customer Operations at Gainsight (an enterprise customer data platform) where she helped scale the company 10x in revenue and among other functions, built the data/analytics team. Prior to that, she was a management consultant at Bain & Company and a research assistant at the Statistics Department at Stanford. She also served in the Israeli Air Force as a commander of an intelligence data analyst unit. Barr graduated from Stanford with a B.S. in mathematical and computational science.


Webinar Transcript

Lucio:

We have with us Barr Moses, the CEO and co-founder of Monte Carlo. Before we start her session, I want to remind the audience that you can ask any questions that you may have on the chat window, just click on the chat tab on your browser. And also at the end of the presentation, if you can provide your feedback on the Slido tab, that you will [00:00:30] have a couple of questions in there to provide feedback on this session, and also on the event in general. All right, so without further ado, Barr, the time is yours.

Barr Moses:

Thanks [Lucio 00:00:40], see you a little bit later. Hi everyone. Welcome. I’m excited to spend this morning with you. I hope you’ve been enjoying the conference so far. I’ve been personally enjoying it, some great content here. Looking forward to speaking with you all today about data downtime and data observability. The concept of [00:01:00] observability has been very well understood in software engineering for many years now, it is really not well understood in the concept of data, and particular in data lakes. I’m excited to share with you all today some of the new thoughts and ideas in this space, what does data downtime mean? What does data observability mean? And some ideas for you in how you can implement this, and some practical tips for what you can start doing today to strengthen your data lake strategy.

[00:01:30] Just a little bit about myself. My name is Barr, as I mentioned, I’m the CEO and co-founder of Monte Carlo, we’re a data reliability company. We help enterprises trust their data by helping them minimize data downtime. Prior to this, I was actually a VP at a company called Gainsight, which is a customer success company. I worked with many organizations on their customer data strategy and needs, and my background is in math and statistics. Maybe the most important [00:02:00] or interesting fun fact about myself is that I’m a huge fan of Bruce Willis movies. Can’t wait for another one, if there will be. In particular, Bruce Willis is also a numbers guy. If you’ve ever seen Die Hard three, you know that in that movie, he solved a very important math problem with jugs and saved a lot of lives as a result, so a big fan of that.

Getting that out of the way, let’s jump in directly [00:02:30] into what is data downtime? As I mentioned, I believe data downtime is the biggest challenge that we’re facing today as a data industry, and one that will actually further increase in its importance in the future. Let me start by explaining a little bit about the context here and how you might relate to this problem. If you’re attending this conference today, you probably embody one of these personas or maybe a few others, but one of these main categories. [00:03:00] You might feel like you are Rebecca here on the left, who’s a data user who basically relies on real-time insights to develop punchy marketing strategies, working primarily with the marketing team or product team.

You might see yourself in Anna, an analyst who’s widely referred to as a SQL queen of her company for her ability to wrangle data sets in magical ways, working mostly with, [00:03:30] with SQL. Or you may see yourself in Emerson, a data engineer who’s responsible for building and maintaining your company’s data platform. Perhaps Emerson is starting a new data platform team and using Dremio, Snowflake or BigQuery. Or perhaps you might see yourself in Michelle here on the right. Michelle is a manager who relies on dashboards and reports in Tableau or Looker to generate [00:04:00] dashboards for decision-making, to showing how killer your team is. Well, regardless of which persona you see yourself in, there’s probably one common thing to everyone who’s been in data, no matter where you are in terms of what tech stack you work with… Whoops, sorry about that. See if this works. Here we go.

No matter where you are, no matter [00:04:30] what tech stack you work with, no matter what kind of data you work with, you’ve all experienced probably one thing in common, and that is when someone downstream tells you that the data is wrong. I don’t know how many times this has happened to you all, but it’s something that I’ve personally experienced a lot, and so I can share with you some of my experiences. As I mentioned, when I was at Gainsight, I was responsible for the team that was organizing all of our customer data and using it for reporting [00:05:00] on product decisions, internal decision making. It felt like basically every Monday morning I would come into the office, back when we had offices, and I would just get this flood of emails from our CEO and others, saying, “What’s wrong with the dashboard here? What’s going on with the report here?”

Getting that was really frustrating because we were oftentimes the last to know about data problems. That’s something that I see today across [00:05:30] the industry. Very, very often data organizations are the very last to know when data breaks, but this is something that happens to everyone, ranging from very small startups to large organizations like Netflix and Uber and Facebook, everyone is experiencing this kind of problem. Furthermore, when these issues arise, it’s very impactful for the company itself. These are all statistics that I’m sure you’re familiar with, but bad [00:06:00] data issues account for up to 80% of a data practitioner’s time. That’s an insane amount of time that’s spent on bad data. Also the occurrences of bad data put the company in serious compliance risk, and as compliance and regulations is becoming something more of a reality of a day-to-day for us, that’s something that’s significantly impactful in our company.

Then finally, actually a statistic that I learned recently, was that one out of five companies lost a customer due [00:06:30] to using incomplete or inaccurate data about them. How unfortunate is that, that data is supposed to create trust for us and create better ways for us to serve our customers, and yet when the data is wrong, we lose that trust and lose that customer trust, right? I was talking to this e-commerce head of data and he mentioned to me, “When our website is down, having the data [00:07:00] on the website being wrong is actually worse than the website itself being down. I’d rather have the website being down than having the website up with the wrong data.” I think this impacts companies and industries across the data industry. No one is spared.

But here’s maybe some of the good news, is that since it is across the industry, potentially this is something that we want to start [00:07:30] thinking about in a more methodical way or a more systematic way. Because of that, we actually coined the term data downtime. What does data downtime actually mean? It’s periods of time when your data is partial, erroneous, missing or otherwise inaccurate. Where is the origin of this term, if you will? Or why do we call it data downtime? It’s really a corollary to application downtime. If you think about it, a couple of decades ago, if your website was [00:08:00] down, probably nobody noticed, mostly because no one was looking at your website, so it wasn’t such a big deal if it went down. But in today’s world, if your app is down, if your website is down, it’s a really big deal and there’s entire teams that are formed, dev ops team around needing to keep your applications up and running.

I think the same thing is happening to data. Maybe three to five years ago, it was perfectly fine if your data was down [00:08:30] for a little bit, because honestly, not many folks were using it, maybe it wasn’t really driving your product or your decision-making as a company. But today that’s no longer the case. Today, many of our digital products are actually driven by data. In fact, many of our applications are actually data-driven instead of logic driven, so as a result, we need to get more rigorous in how we think about data downtime.

When you think about [00:09:00] data downtime and how it actually manifests in the organization, just to give you a few examples here, potentially this has happened to you all. When someone pings you on Slack and says, “Hey, the dashboard looks really wrong. Can you please check? What’s going on here?” An annoyed analyst that’s looking downstream. Or potentially it’s someone on the BI team who’s saying like, “Hey, this field that someone [00:09:30] is using or looking at upstream, where are all the different places in which this field is being used? How do I know that?” Or for example an exasperated exec that says, “Hey, the dashboard is down. I need this fixed ASAP,” right? All of these are instances of data downtime and I think the bottom line here is that one way or another, we are all paying the high cost of data downtime, whether we acknowledge it or not. There’s some ways that we can start measuring [00:10:00] the impact of this on an organization. I’ll just give here a few examples for you all, just to give you a sense.

For example, I mentioned that data organizations spend up to 80% of their time… If we’ll take a conservative assumption that folks spend 30% or so of your time, you can actually calculate the labor cost of what data downtime costs you. So the number of engineers time the salary, time a percentage of their time that they’re using. There’s also the [00:10:30] compliance risk that we discussed, so if you just take an assumption with GDPR, it’s 4% of your revenue in some previous year, that’s actually the cost of data downtime of some compliance risks due to data downtime.

Then finally, it may be the most important in my mind, is the opportunity cost. Everything that you could have done if you weren’t spending time on data downtime. It’s all those new products that you could have shipped. It’s those customers, it’s that new go-to-market [00:11:00] strategy that you could’ve launched. All that opportunity cost is things that we’re not doing because we are spending time and essentially paying this high cost of data downtime.

Hopefully the concept of data downtime makes sense. Now, the question is a little bit, what can we do about it? In a way it feels like we’ve become a little bit complacent about data downtime. As I mentioned, this is reality for you if you’re in a data organization, but I think we can no longer be complacent. [00:11:30] As more of the organization relies on us to deliver high quality trusted data, we need to come up with better ways to manage data downtime. I think the good news is that there’s actually a path towards this, and the path is from software engineering. In software engineering, the concept of observability is incredibly well understood. Engineers leverage these principles of reliability and observability to make sure that their applications are performing as expected.

How can we take those [00:12:00] best practices of observability and dev ops and take those to data? I’ll talk a little bit about what data observability actually is, how do we think about it. As I mentioned, as organizations generally grow… Sorry about that, okay. As organizations grow and the underlying tech powering them become more complicated, it’s very important for [00:12:30] dev ops team to maintain this pulse on the health of their systems, and what does that look like in engineering? So in engineering, this refers to the monitoring, tracking and triaging of incidents to prevent application downtime. As a result of this kind of shift that we’re seeing to distributed systems, actually in engineering, observability has emerged as a fast growing engineering discipline.

And at its core, [00:13:00] what observability in engineering is, it’s broken into three major pillars, as you can see on my slide here. We have metrics, logs and traces. Metrics refer to a numeric representation of data that’s measured over time. Logs are basically a record of an event that took place at a given timestamp. And traces represent events that are related typically in a distributed environment. Now, every engineering [00:13:30] team that respects itself has something like New Relic or Datadog or PagerDuty to manage this. Then the question is, why are data teams flying blind? Why don’t we have something like this that we can apply in data as well?

Well, why don’t we start doing that? We can apply those same best practices of dev ops observability to data pipelines, and just like dev ops, use monitoring and triaging to identify and evaluate data quality and discoverability issues. [00:14:00] So the definition of data observability is basically an organization’s ability to understand the health of the data in their system, thereby eliminating data downtime. Okay, great. Now that’s awesome as a theoretical definition, but how do we break that down? As we mentioned, in dev ops observability, it’s very clear what we have to measure, but what actually consists of data observability? Well, we’ve actually spoken to hundreds of organizations, ranging from small startups [00:14:30] that are data-driven to larger organizations that have been spending years of investment in this. And what we’ve done is we’ve actually asked them, like, “Tell us about your data downtime stories. Tell us about your incidents. Tell us what happened. What were the symptoms? How did you catch those issues? Did you even catch those and what was the root cause?”

And we took that and consolidated that into this repository of information that we use in order to basically define how do people tackle [00:15:00] data downtime, and what are they building, and how are they thinking about this? So using that, we’ve basically consolidated this into these five pillars, which together, we believe give you a good view into the health of your data. These five pillars are freshness, distribution, volume, schema, and lineage. Again, together they can give you a holistic sense of the health of your data. Now you might say, “Well data can break for a million different reasons and my data [00:15:30] is different than your data. And I’m a Snowflake and you’re a Snowflake and we’re all so different. How can we align on a holistic framework?”

Actually what we’ve been surprised to learn is that this does follow some of the best practices that we do see in software engineering observability. And in speaking to hundreds of organizations, there are actually patterns that we can draw, that we can learn from. Of course every business is very different, and data sets and the usage and application of data is very different, but there are some common methodologies [00:16:00] that we can use and we should use if we want to solve this in a scalable way. Let’s walk quickly through each of these and I’ll explain a little bit of what they mean.

Starting with freshness, freshness is a whole slew of metrics around the timeliness or up-to-date-ness, or availability of your data. This is an example of a particular table that’s being monitored for how often it’s being updated. And you can see that each of the lines here are a [00:16:30] tick for when the table was updated. So it’s semi-regularly getting updated, and then there’s this period of about three to five days or three to four days where the table is not getting updated at all, which is obviously a regularity. So in this case, this is a type of abnormality that your team would want to know about. Potentially this table is one that’s actually being used downstream that you’d want to know about. So this is an example of a freshness metric that you can start thinking about applying.

The second [00:17:00] pillar is around distribution. This is really at the field level. What is the distribution of the field, and forecasting what that should be based on, some baseline that we’ve developed. So in this example, you can actually see anomalous data detected. There’s a field here called count pets and the metric that we’re looking at is percentage null rates. You can see that historically this has been 0.6% and today it’s actually 99%, so obviously in this case, [00:17:30] this is a big change. You can look at a whole slew of different metrics under distribution. Sharing null values here, but it could be uniqueness, it could be negative rates, et cetera.

Moving to the third one, volume. So volume, what you see here is this is actually number of rows removed from a particular table. Basically answers the question of, “Is my data complete? Do I have all the data that I should be using, that I should be looking at?” [00:18:00] In this case, you can see how the number of rows that have been removed is relatively within a similar range, and then in the certain point here on November 14th, it dropped to an unusual number. This is an example of a volume issue that we’d want to identify.

The fourth pillar is schema. Schema, as you all know, the pillar here is around understanding changes in the structure of your data. Potentially these [00:18:30] fields were deleted, this field changed from float to string. Any of these changes in schema can help one, cause data downtime, but more importantly, knowing about these can help us prevent data downtime to begin with. So schema is a very important pillar.

Then the last pillar here is lineage. This is table level in lineage in particular, where you’re seeing a combination of views, external tables, tables, et cetera. And what lineage does is help [00:19:00] us tie all this together, right? Help us understand, okay what are the dependencies of a particular asset? If a table has a freshness problem, is there anything downstream that is impacted by that, where my downstream consumers need to know about that?

And the inverse of that, if there’s the particular problem with a table, what upstream might be causing that table to have that issue? So lineage is an important final component to tie this together and to understand the different data downtime [00:19:30] incidents that we are seeing. Another important tool in your data observability toolbox that I just want to touch on briefly, is metadata. We have our five pillars of data observability, but then we have all this metadata at our fingertips, and one interesting thing is that I think in the last few years, or last decade or so, we’ve gotten really good at collecting a lot of data and [00:20:00] we don’t always use it. We don’t always draw insights about it, but we sure do collect and store a lot of data.

I think we’re a little bit at a risk of doing that with metadata. Metadata is a very hot topic these days, the automation of metadata collection, how to use it, is widely talked about and used, which is awesome. I think we’re a little bit at a risk of doing the same thing here, where we’re just collecting a lot of metadata without actually using it. One of the important things about metadata is that [00:20:30] it is incredibly powerful when it is applied in the right use case. Metadata can help us understand the purpose of the data, the source of it, who created it, where is it located? When you use metadata in conjunction with the data observability pillars that we mentioned, together you actually draw powerful insights around the health of your data. So when you think about what are all the different components of data observability, metadata is obviously a very important one as well.

I wanted [00:21:00] to share just a couple of examples of what this looks like in practice, or bringing all of our tools in our toolbox together, and show you just a couple of brief examples. We won’t go into too much detail. Just example number one, this is in very active organizations, what could happen is a situation in which someone made a change somewhere upstream, potentially as we mentioned, [00:21:30] changing a field type from integer to string, or actually maybe deleting the field. And without knowing that, that has results downstream and missing or partial data that potentially, say your marketing team has been relying on. So maybe the change in the field was a change in your website somewhere and then your marketing team is using that to drive campaigns on Facebook. [00:22:00] So when someone makes a change to that field, the results in that data could actually significantly impact the ROI on that marketing campaign. Without knowing these connections, you might be inadvertently losing revenue due to this issue.

Now, in the instance where you actually have data observability practices in place, you should be having visibility into all of the dependencies of a particular field so [00:22:30] that when you are making that change, you can make that change faster with minimal damage. Meaning you actually know what’s going to be impacted. You can give the team downstream a heads up, or they can get an automated heads up that this change has happened and they can make the necessary changes on their end. The benefits of that is one, you can definitely move faster as an organization, and two, obviously you’re not compromising the ROI of your campaigns in the example, or anything else that relies on your data.

[00:23:00] Here’s another example. Potentially there’s a Looker dashboard that you’ve been using and it hasn’t been updated and you just weren’t aware of it. So you go on and use that data later on, to maybe feed it into a decision-making or some other process that you have internally. And potentially you’re not aware of that, and potentially the folks who are actually using those dashboards are not aware of it. [00:23:30] With data observability, if you actually have machine learning or automated ways to detect freshness issues, you can actually know in advance this dashboard typically gets updated every day at 6:00 AM and our product team uses it every morning at 8:00 AM to do a product review and today it wasn’t updated. You can get alerted in near real time about this instead of having the back and forth with the team on whether it got updated [00:24:00] or not.

Finally, the third example is oftentimes when these issues happen, there’s just a flurry of folks trying to understand what the problem is. If something breaks downstream, you then have to spend a lot of time, days, weeks, or maybe months, going upstream to try to figure out what exactly broke or why. Now imagine a world where you have data observability, where the minute that you know that there’s a problem, you understand also the root cause of it. And so troubleshooting [00:24:30] time goes down and the miscommunication between different folks drastically is reduced as well.

The final part here is like, okay there’s so many kind of folks involved, we mentioned different personas involved with data observability. One of the most common questions that I get is like, “Wait, but hold on, who owns data observability in a company? Who’s responsible for this at the end of the day? [00:25:00] Who should be working on this, turning us from a reactive to a proactive organization?” Well, what I’m showing here, and I won’t spend a ton of time on this, but just to plant the seed here. One thing that’s been very helpful for folks is what we call a RACI matrix.

A RACI is a framework that helps bring different teams together under a definition of who does what. R stands for responsible. That’s the person that’s actually [00:25:30] delivering on the issue and is actually executing on it. A is for accountable. Accountable means your neck is on the line. You are on the hook to deliver this. C is for consulted, meaning your opinion matters, but you might not necessarily have an active part in the process. And then I is informed, meaning you want to be aware, you need to be aware of the decision-making [00:26:00] or the different processes here, but you might also not have an active part in the process.

So what you see here is basically a proposal for how, on the left, the different kind of responsibilities, whether it’s facilitating data accessibility, driving insights, and then at the bottom here, we have maintain high data quality and deliver on data reliability. Then all the different organizations or the different types of teams that might [00:26:30] be involved in data, whether it’s the chief data officer team, whether it’s that BI, data engineering, data governance, or even the data product manager. And you see this sort of a proposal. There’s always only one person who is accountable, but there could be multiple people who are responsible for working on a particular issue. So oftentimes what we see when it comes to data observability and data reliability, the person accountable is obviously the [00:27:00] person leading the data organization. The data engineers and data product managers are often responsible for actually owning data observability as their responsibility, and then a few others consulted and informed.

One of the other things that we’re seeing teams do is that since this problem is a multi-team problem, is that folks are actually starting to set SLAs and SLOs for data. This is an example of a data reliability dashboard [00:27:30] that literally tracks what the data downtime is. And for a particular data analytics pipeline, what are the particular SLAs that we want to hold? You actually can have an SLA for each of those five pillars that we talked about. So more and more teams are starting to use these to better facilitate working on these problems between different teams, engineering, data, product, data science, et cetera.

[00:28:00] If you wanted to get started, a good starting place for what data downtime could look like, one thing that folks are doing are actually starting to track number of incidents that you have, like number of data downtime incidents. And then looking at time to detection, so how quickly did we identify those? And then time to resolution, so how quickly did we resolve those? Just starting to track this should give you very good insight into how you’re doing and whether you’re even progressing.

[00:28:30] Let’s quickly wrap up with some key takeaways. When you think about your organization, a couple of questions to ask yourself. One, are we experiencing data downtime? Is this an important issue for us? If you’re not already experiencing today, I can assure you that you will experience soon, the more that you use and rely on data. The question is what do you do about it? I think the way for us to solve this is to focus on an end to end approach data observability. [00:29:00] That includes defining what good data is for you right now. It might not be a perfect definition, but we want to start somewhere.

Want to use the five pillars of data observability to start thinking about what are the different things that I need to instrument and monitor and measure? You want to augment that with metadata in order to make that information useful and powerful and give it context. Then finally you want to use lineage to actually assess the impact of data downtime. If something happens, but nobody’s using that particular data for modeling or for [00:29:30] a machine learning algorithm, then maybe it doesn’t matter. But if there’s some data downtime incident that is particular important for your organization and lots of people rely on it for important use cases, you want to be on top of that.

Just to wrap up, as I mentioned, I think data downtime is the biggest challenge our industry is facing. It’s one that we need to address very seriously. The good news is that we can apply best practices of software engineering, and we can alleviate this pain for data ecosystems. [00:30:00] Thank you so much.

Lucio:

Excellent, thank you so much Barr, that was a wonderful presentation. I think we are right at the top of the hour, but we can dedicate a couple of minutes for a Q and A. I’m going to start reading from the earlier ones that came in.

Barr Moses:

Please.

Lucio:

There is one that says, “What are the criteria for good observability tools?”

Barr Moses:

Great question.

Lucio:

“And who benefits from observability?” [00:30:30] From the same person.

Barr Moses:

Yeah, so what is the criteria for a good observability tool? There’s a few. I’ll highlight with some that I already talked about and some new ones. I’d say the first one is that it has to be end to end, and it has to work with your existing stack. As you all know, data can break anywhere. I think in the old world we only had one point in which we ingested data. We had one place where we store data, and so we just needed to make sure that that data coming in [00:31:00] was clean and that’s it.

But today data can break anywhere, in your source data, your data lake, your data warehouse, your BI, your machine learning model. And so a strong data observability solution actually works end to end with all of these solutions. It should monitor and do the instrumentation and monitoring for your data at rest so you don’t actually need to take the data out. And I’d say maybe the final thing is, it has to be a security first architecture, right? Data is really at the core of your business and you want to make sure that you’re working [00:31:30] with a solution that has high standards for security and auditability. Oh, Lucio. I can’t hear you.

Lucio:

There you are. Can you hear me now?

Barr Moses:

Yes.

Lucio:

All right. So I cued in [Padma Ayala 00:31:45].

Barr Moses:

Oh awesome, hi Padma.

Padma Ayala:

Hi. Sorry, I was trying to get the recording of this session. That’s why I [crosstalk 00:31:57] I don’t have any particular, I was just listening-

Lucio:

All right. Did you have any questions [00:32:00] for Barr?

Padma Ayala:

No, no, I’m good. I’m just listening [crosstalk 00:32:03].

Barr Moses:

[crosstalk 00:32:03] we’ll share the recording. Yeah, no problem.

Padma Ayala:

Thank you so much.

Lucio:

Thank you. All right, I think we’re right on time. Barr, thank you so much for the wonderful presentation and thank you so much for the audience too, for being around. I hope everybody continue enjoying the conference and have a wonderful day. Stay safe and healthy. Bye bye.

Barr Moses:

Thank you. See you later.