March 1, 2023

10:10 am - 10:35 am PST

How to Strengthen DataOps with Continuous Data Observability

It takes a lot of discipline to achieve a sublime DataOps practice. DataOps teams have a lot on their plate to collaborate across their organization to drive agility, speed, and new data initiatives at scale. But what happens when you deliver bad-quality data at high speed? Worse, what if you don’t know you’re delivering bad data? That’s where data observability comes in to help. Join this session to learn how adding data observability to your DataOps practices can help you deliver trusted data at scale. In this session, you will learn:

  • Common challenges most data engineers face with data quality
  • How data pipeline observability strengthens DataOps practices
  • Live demo of Databand in action with tools like Apache Airflow, dbt, & Spark

Sign up to watch all Subsurface 2023 sessions


Note: This transcript was created using speech recognition software. It may contain errors.

Ryan Yackel:

Thanks for joining this session today. We’re gonna be talking about data ops and data observability, and we’re gonna giving, going over some demos of how observability can play into a, an example data ops workflow. May not be the exact one you may be dealing with today, but we’re gonna go through that. But before we get going here, I also want to do some polls. So arrows got some poles lined up, I think we’re gonna do. but the first poll is, is what is your top data quality problem within your warehouse? is this a issue with you all? And especially if you’re using D B T for doing any types of transformations in your warehouse. just give a little, little thought here about like, what, what’s actually your, one of your biggest problems, because we’ll be talking about some data quality problems today here as well. So we’ll give a little bit time for people to answer that.

And we have another poll too, coming up. Let’s see about that one as well. All right, let’s see here. okay, well, it doesn’t look like anyone’s or Ariel, if you wanna go to the next poll or, or hit see if any, there’s any results here yet. I don’t know if there’s any results that we have yet. Okay, so let’s maybe go to this poll. Let’s see if we have here. So on average, how long does it take you to resolve data quality issues once they’re detected? Is it one hour to four? Is it four to eight? Is it more than eight hours? Is it around one day, two days, weeks? You know, what, what, how long does it take you actually to resolve data quality issues when they’re detected?

Maybe people are coming to their coming from different sessions to come to this one. So okay. So it looks like we have, we have a shy audience here. We’ll we’ll stop, we’ll go ahead and stop doing the polls as we get going. So, okay. So let me go ahead and jump into our presentation today. And lemme go ahead and share here. I’ll go ahead and share my entire screen and we’ll jump into the presentation. So like I said as Ariel kind of reached or kinda mentioned to me, my name’s Ryan a product leader over at IBM Data Band, specifically under the data and AI space with IBM data ban for observability. So kind of a startup junkie. I was really into a couple of companies that you see here on the side from the DevOps space, all the way to cybersecurity space. I got one wife, three dog, three girls, and one dog. And I love coffee and superheroes. So if you’re ever in Atlanta, Georgia, hit me up. Happy to hang out with you. So I’m gonna go over a quick overview of what DataOps is versus DevOps. We’ll talk about some data quality challenges we’re seeing in the space. I’ll get into DataOps plus observability and kind of

What, what we can see in actual scenario kind of play out there. And then we’ll get into an actual product demo, which will be sort of the sort of the main event for today. And then we’ll end with some q and a. So when people ask questions, they usually get to this part around what is data ops, what is data observability? What is DevOps? What, where does all this kind of come together? It’s kind of confusing, right? well, it’s not that confusing if you look at kinda what DevOps is. And, and some of you on the call today are probably either in a data DevOps environment or you have a combination of software engineering skills that you’ve transferred over to maybe the data space. But really what you’re, you’re going for in the DevOps space is you’re looking at application and software development.

And that’s a no-brainer. but it’s really about this development and testing that’s across this value chain and mixture. You are delivering in, in a continuous way. The applications you have at, at a, at a at a pace where you’re not also de destroying any of the let’s say quality of the product when it’s actually delivered and comes to the observing or learnings that you get from DevOps. It’s really around learning and monitoring within the space of software driven innovation. Same thing. If you take the, kind of the same thing that you have here and reply to DataOps, it’s kind of similar. What, what you’re doing is not just application software development. You’re looking at producing high quality data that’s readily available for you to use fast. And there’s another, there’s another, there’s another word there that we’re be talking about today, which is continuous.

This is really about continuous learnings by monitoring, optimizing data pipelines. And that’s important because if you look at the full data ops information architecture, the main goal is really to deliver this trusted high quality data. And there’s a lot of things in here we could talk about. We could talk about governance, data storage, infrastructure version control. I don’t have time for that today cause I only got like 20 minutes. We’re really gonna be talking about data quality and observability specifically in this testing and monitoring space. So why is this a a problem? Well, if you look at what people are saying, these are all from executives from Cap Gemini did a report. Also, cornea Intelligence did a report back in 2021. This problem of data quality is really prevalent within a data ops workflow. 80% data quality say that they have data quality concerns, they represent bears of entry to data products.

 and then 70% said that they’re unhappy with their data quality. This is a problem within the actual space. So why is this so complex? Well, DevOps is a complex assembly line. If you look at taking from your sources all the way through your data lakes, data warehouses, and then at the access layer, there’s a lot of people involved with this from data engineering teams, analyst teams and that’s being consumed by a business user, right? So all this is going on, and that would be really simple if it was this very simple process, but it’s not. You have a constant you’d say assembly line of orchestration, transformation development testing that hits all these different areas. And when, when, when data breaks, it really comes outta nowhere. It hits you like this a lot of times. Oh, there was a schema problem.

Oh, there was a pipeline problem. All of that kind of leads us to go, okay, you know what? We really wanna make sure , we really need to get a handle of this to monitor this, this DevOps life cycle to make sure that when there is an issue with our airflow jobs or B B T jobs or issue in Snowflake or dremeo or issue at the access layer, that we’re able to catch that issue right away because it’s very complex. And it doesn’t have to be, it doesn’t have to even though it’s complex issue doesn’t mean we don’t have some reporting and analysis. It could be baked into it to make it a lot easier. So what does all this lead to data observability? Well, when you think about what data observability is without it, it’s kind of this reactive approach to data issues.

You have issues that are resolved, that are found way too late in the process. You have this explosion of noise that could happen where people are telling you, Hey, there’s an issue, but we don’t actually know where it is. A lot of times they’re reported by da reported by data consumers, or, or worse than I haven’t really discovered. So this is a problem without data observability. This data ops process is a very, very reactive. and recently there was a report that came out around data observability and how it strengthens data ops. And if you look at this port from Eckerson group, really it boils down to two areas. Data quality and data pipeline quality. So data quality, looking at the data sets and tables within where the data is landing, and then data pipelines as it’s actually in transit. And the whole point of this is really to have a layer of observability and monitoring over this data ops process so that you can basically catch bad things before that impact your business.

And that’s exactly what Data Observably is meant to do. So you move from this reactive approach to this continuous detection and resolution of issues by being able to notify issues in real time by collecting standard metadata and getting visibility into all of the data motions either in, in at in motion or at rest. So that at the end of the result, you’re able to detect earlier, resolve faster, and deliver trustworthy data to your business. And that’s really where data band comes in. That’s what we’re gonna be talking about today. and how this actually helps you deliver reliable and more trusted data with a continuous approach. And I always like to talk to, to explain it this way before I actually get to the demo, which is the way the data band works is it’s basically a continuous monitoring solution within your data pipeline.

It’s very similar to how things go on in Formula One. Formula One the whole goal is Formula One is to like drive that car as fast as you can and also not crash it to win the race. Right? That’s pretty no-brainer. That’s what you’re supposed to be able to do, right? Well, in Formula One, what you have is you have all these different alerting systems that go on within the car, that radio, back to the cockpit that will tell the driver and the crew it, Hey, you need to slow down. Hey, you need to take that curve a little, a little tire next time, Hey, you’re running low on fuel. Hey, you need to pull over. We need to fix your tires. All these things are continuously going on within that car and alerting when there’s an issue. So they know exactly when they need to slow down and fix something.

And that’s exactly what Database is doing. It’s similar to that Formula One racing crew. What we’re doing is we want engineering teams and data teams to go as fast as they can, but by connecting into your different platforms, we’ll be able to tell you exactly when an issue goes on, tell you where you need to go fix it, and then your, then you have more confidence in the actual data quality in your system. So how data Band works we automatically collect your metadata. So if you’re using something like E B T or Airflow for Orchestration or transformation, or you’re sending stuff over into dremeo, or you’re maybe you’re using Snowflake right now, we collect all that data and then we profile it based off of a historical baseline to say, Hey, you know what? You have some issues with your pipeline that shouldn’t be acting a certain way.

We alert you on those anomalies and then at the end of the day, we tell you the root cause of it and where to go fix it. So that’s basically what’s going on. And what I’m gonna show in the demo today is a demonstration of a, a process for an airflow job is gonna be interacting with a D B T job that’s also pushing data, moving data in a Redshift environment. So this is the, the issues we’re gonna be looking at today, and I’ll show you how the alerting and resolution capabilities of data band are in action. So let’s go ahead and hop over to the demo exit exit out here. Okay? so this is data band and what you’ll notice is this is a job that I have lined up here. I’m not gonna go into this right now cause I’m gonna come back to it after I go into looking at some of the alerts we have in here.

So like I was saying, if you have a very complex environment, lots of different tools in your environment, maybe you have, like we were talked about earlier, you’ve got airflow set up, you’ve got D B T, you may have other coded e TL pipeline jobs, lot of stuff going on within your modern data stack, right? Well, in data ban, what you can do is after you’ve collected that metadata, you can actually go in here and add in all these different alert types that are gonna tell you when something goes wrong. So for over here I’ve got 18 alerts that are set up. Some are data quality alerts, some are based off the DP test alert that we’re going, we’ll be going into. Some are around Arundel duration alert. So for example, if your pipeline was running longer than it should have, or if it never actually reached the warehouse at a certain time, all these alerts you can set up within data ban and associate them to receivers like Slack or email or something like that where you can actually alert people right away when there is an issue.

So you can remove all that noise within your process today of being alerted on stuff you don’t care about and actually isolate it to the areas where you really, really care about. So just a quick overview. These are all alerts you can set up within data Ban. Also, we have views into all your different pipelines and runs of those pipelines to tell you what’s failing, what’s being successful, and all the alerts you’re associated with those things. So that’s a quick high level overview of the alerting capabilities, and then the overview of how we can pull all this metadata into data band to alert you on any data incident that’s going on within your environment. So now let’s go back to the example that we have today. So in this example, this is the, the DAG view that I, I’m pulling in here.

And what we’re, what you’re seeing is this is an airflow job that’s connected into a D B T process. So again, airflow is one of the most popular orchestration tools out there today, and most commonly we have customers that use it within D B T. So what’s going on here is that we have a airflow job that’s gonna be kicking off a D B T job. And in that D B T job, you have basically two things going on. You have runs and you also have tests. So within here, if I’m a analytics engineer or data engineer, that is reliance upon making sure that a orchestration job is kicked off, the model is running successfully, and then we have tests that are gonna associate that, that are going to kick off of that. so that at the end result, I can actually give a report over into Redshift.

I wanna make sure that all this is going well and I don’t just blindly commit this over into a consumer that’s gonna going to be experiencing the report. So in this close request here, what you can see is we’re pulling in all of the data interactions in the code that’s behind this that’s coming from D B T. This is great because if you look over into D B T, a lot of times it’s a little bit harder to understand exactly where certain issues are going on within D B T. we can see right away that we have this that we have a successful run that was executed a part of the actual model here. But if you go to the test, what you can see is that we have a, a few tests that have been, that have ran, but then we have some failures.

And what’s great about this is I can click on each one of these tests here and I can see the sequel that’s behind all this. So I can see that here’s a test here that we were looking at and making sure that there was unique key was not null. and then what we’re also doing is we have a bunch of other tests that are being executed outside of that to make sure that the model that’s being produced is action, the model we want to get pushed over into that report. I can also see the different schema behind this. And if I had anything that went around with the schema, I could set up alerts around that schema to tell me if there was possibly let’s say a column that was removed or added or there was something going on within that schema, I could have alerts set up for that as well.

 and then what you’ll also notice is if I look over into the log tab here, sorry, lemme go to these guys. Right there. You can see that we have all the same logs that we have over in our D B T Cloud environment. So the same view that you will see in D B T, which is right over here, this is our example that we have set up, is the same exact information you’re gonna see over in data band. So if I’m an engineer or an analytics engineer, I can see exactly where the problem is within my D B T environment within data band. And I’m also seeing it alongside of all of my other orchestration jobs like Airflow, which is great. So going back over here lemme go back to this DBT job process here.

I can also see that for this test here that was actually broken it also will capture all the interactions of this. So here I can see that there was a nal test that was associated with this particular column. or sorry, this particular model, the issue failed. I can see exactly the, the what, what’s, what, what the warehouse was, which is going into Redshift, and I can also see the inputs and outputs. So all the information is right there for me to see. And all of this was basically tied to an alert that we had set up within D B T. So this is basically the overall view of what’s going on in our process. So you can see here, debug it, see the sequel of this behind it and quickly resolve it. But all of this is basically useless without getting an alert.

So if I go back over to the alert definition screen, what you can see is if I go and search by let’s say D B T test, I can see those same, that same that’s, that that alert that we just talked about was fired on a certain time, on a certain date. And it was actually sent over to the exact person that I want to take a look at. So this is a more of a, a alert overview screen that at times people will be brought to right away. You get an alert that your test failed, you can jump right into here. You can see the scheme that’s behind it, just like we showed earlier. I can see all the, the code that’s behind this particular D B T job. And I can also see the test query that’s here as well.

So that same information I just showed you on the other screen is also here. So we can see the query, we can see the error message, and then again, those same logs we can also see down below. This is all really powerful information because if I’m trying to jump around different systems within data ban, I’m able to see everything that I want to know about this particular D V T job process the runs and also the tests all together in one spot. Also aligned back to the orchestration job within Airflow. So this is a really awesome viewpoint where you can see all of that. And then getting back to our example we gave in a data ops workflow, all of the data quality and monitoring and testing that you’re doing is all together in one place now. So you don’t have to be jumping back and forth between different systems to understand exactly what’s going on there and, and figuring out how to resolve it.

So that’s a, a quick demonstration. that’s that I walked over just to to show you all. If you, if you all wanna see more information or see more videos demonstrations with different technologies, go on over to data and then go to the demo center over here. And you can see a bunch of different videos that we have both integrating with D P t, airflow Redshift, snowflake, all the the tools you may be interested in to today. That’s all over here where you can go and see exactly how data band is set up.