Eliminate Data Pipeline Downtime with Reliable Data Processing, Quality and Consistency

Session Abstract

Data pipelines are modern supply chains for digital information. When they break, business grinds to a halt. Avoid broken data pipelines with data observability capabilities that analyze information across compute, data, and pipeline layers to resolve issues that break production analytics and AI workloads. For example, at the compute layer, identify performance trends that warn of future outages. At the data layer, automatically detect anomalies in quality and consistency, both at rest and in motion, and use data drift monitoring to avoid impacting the accuracy of models.

Video Transcript

Mark Burnette:    I’m Director of Solution Engineering at Acceldata. My goal is to help organizations improve their data operations using our data observability platform to help support enterprise analytics and AI. I’ll begin with just a quick introduction to data observability and the goals that we’re looking to achieve around that. Then we can dive into the capabilities and have a [00:00:30] quick product demo as well.

First off, we’ve seen this trend over time around digital transformation of shifting and applying more and more analytics into an operational context. The traditional role of data warehousing, and analytics, and supporting insights for management decisions, that hasn’t gone away but of course today there’s more and more analytics driven into machine-to-machine processes and to make apps smart apps to make them smarter. [00:01:00] Now, of course, application performance monitoring tools have really helped ensure that those applications run reliably and are performant, but the challenge for many organizations is that they have gaps when it comes to the data and analytics that’s driving smart apps and smart processes. You might call this performance monitoring for data, and that’s really an area where we focus.

Another aspect of this shift is that things [00:01:30] are just a lot more complex. You have a lot more people being served; employees, customers, suppliers, partners, that all want to take advantage of digital transformation. They have many use cases. It introduces, in many cases, a lot of new technology to address those. There’s lots of processes that pull data in near real time to serve those folks in their business activity and operations. Of course, that’s pulling in a lot of data and a variety as [00:02:00] well. This isn’t really anything new but as things get more and more complex it presents a number of challenges.

On the one hand, it can be hard to manage all of this complexity reliably, and yet reliability is even more critical because if things break then you might have your business operations breaking as well. The time and talent required to handle all of this tack and use cases, it can be considerable [00:02:30] and if not done efficiently it can get very, very expensive. Our goal is to use a data observability approach to help organizations be successful with these complex environments.

The goal is to mature their data operations to get a higher return on investment and to really address those challenges; to help organizations be more reliable in terms of dependable delivery of quality data, processing, and pipelines at scale, to eliminate [00:03:00] friction points at the design, development, and scale phases when you’re developing new solutions, and to be optimized to identify ways to get the right performance and capabilities at the lowest cost.

How do we do that? Well, first off, it’s a combination of monitoring analytics and automation. We monitor your data processing engines, your data stores, your data pipelines. We capture all sorts of [00:03:30] operational data around that, your performance metrics, metadata, utilization, and many others. We apply analytics to gain insights to help you meet your SLAs, improve data quality, get a better price performance, and a number of other use cases. You could think of it as operational intelligence for data.

This is a quick view into our product offerings on the left and really how we help serve those goals of being more reliable and stable, [00:04:00] being able to scale transformation and innovation and be more efficient and optimized. The three layers of the stack that we address with three different products are, first at the bottom there’s Pulse. Acceldata Pulse is aimed at improving your data processing. Torch is aimed at improving data management. Flow, which is currently in beta and getting rolled out at a couple of sites, is aimed at improving management of your data pipelines.

If we run through this at [00:04:30] a high level, when it comes to your data processing we help organizations transition from being reactive and fire-fighting to be proactive, to be able to use trend analysis to predict and prevent issues before they occur, and that’s really how you achieve reliability and stability. We help identify bottlenecks and other things that inhibit your ability to scale your data processing, whether it’s an R&D, developing new machine [00:05:00] learning models or when you’re rolling into production at scale. Lastly, we identify areas to improve efficiency and to reduce the amount of resources consumed and therefore your cost.

When we move up to the next level, [inaudible 00:05:15] site and observability into the data itself. The goal is to improve on a variety of aspects around quality and reliability, to help make data easier to access, and leverage, and scale, [00:05:30] and then again, areas of opportunity to improve efficiency and data management, whether it’s productivity or eliminating redundancy and things like that to lower your costs and improve productivity. At the pipeline layer, we give you a view of an end-to-end pipeline and the ability to identify issues and then drill down into areas of concern, whether it’s on the data quality side or reliability or around the processing, et cetera.

We allow you to monitor the timing across the stages within [00:06:00] your pipeline to identify bottlenecks and improve throughput and we provide, again, areas of identifying the costs of your data pipelines and aligning that to the business benefits to help inform your strategy. There’s quite a bit of capability we bring here but, essentially, we want to help you be more reliable, scale innovation, be more efficient, and we have products that address that at three different layers of the stack, the processing layer, the data itself, and the end-to-end pipeline.

[00:06:30] Let’s drill in a little further around the data processing capabilities. First and foremost, to get from being reactive and fire-fighting to proactive and predicting and preventing incidents, we provide some trending analysis capabilities. The idea is we track the consumption or the timing of resources over time, and if things are taking longer to run or consuming more resources we can surface a variance [00:07:00] score that allows you to identify jobs that might be running successfully and within your SLA today, but if the trend continues you may run into trouble down the road. Getting ahead of that is something we can help with, with our trending analysis.

Then, the best alert I say is no alert at all so we can monitor for conditions and then automate actions. We have a couple dozen auto actions that we can take and it’s in a framework to allow you to extend [00:07:30] that. As conditions emerge that are of concern, we can do things like automatically do clean up or configuration or provisioning to self-tune and self-heal over time. Then when all else fails and you do have an incident, we provide a number of analytics to help you get to the root cause; whether you’re looking at resource contention, the overall environment, or looking at a historical comparison and maybe what’s changed. We give you deeper insights to [00:08:00] address and understand some of the more complex challenges.

As we transition from reliability to scale, the idea here is we provide three levels of depth in helping you to scale your solutions. The first is out of the box configuration recommendations. This helps identify whether you should add a little more memory, more compute or other types of configurations that can help you achieve better performance. That may be all you need. [00:08:30] For a deeper level of tuning, we provide a simulator. The simulator can tell you things like how fast will it run on a minimal configuration. If I want to meet an SLA of, say, under a minute or what have you, what should I configure to get to that? Or how fast can this run on the environment as configured, and where would adding more resources not really improve performance and just be a waste of money? That simulator really helps you dial in on getting the right performance for [00:09:00] your requirements without wasting money.

Then the third level of depth is around workload analysis. The goal here is to identify bottlenecks, unnecessary overhead, understanding how execution occurs. This gives you insights to improving your job or your query or your data engineering. Where the first two help you dial in on the right configuration, this third one gives you insights into improving the solution itself. The last aspect of data processing [00:09:30] is really about efficiency. Now, here the idea is being able to look across the entire environment, and looking at things like capacity, scheduling, where you have bursty loads, things that help you determine, what one and where you should apply new solutions to balance load, to get the right hybrid strategy, understanding your planning and your chargeback. All of those sorts of aspects are informed by this high-level [00:10:00] entire view of your environment.

The next is around optimizing the data processing. Those same efficiencies that we talked about on the previous slide, we look across all the jobs that are running and flag those where there could be improvements around memory or bottlenecks or overhead or IO, things along those lines so you can meet your capabilities at a lower cost. Then there’s aspects of data engineering, so looking at areas where you have perhaps data that’s not being used [00:10:30] or redundancies or hotspots or other aspects where you could take these insights and improve the way you’re engineering your data.

With that, let me give you a quick overview of some of the capabilities in the product to kind of bring these things to life. The first thing I’d like to call out with our data processing tool called Pulse is that we cover a lot of different technologies. Some of the ones that are of greatest interest are things [00:11:00] like Spark and Kafka running in a variety of environments, but let’s drill into the application Explorer. The idea behind the application Explorer is I may want to start by, of course, looking at any issues that I have with jobs failing.

From here, I’m a click away to getting at the logs that might indicate what the error is, but really to get better insight I want to look at the history to see when has this failed before [00:11:30] or what is the resource consumption over time. I might want to compare different runs to see what the difference is between them and I may want to look at this in terms of the context of the overall environment. Here I can go in and look and see from various perspectives how are the resources being allocated and is there contention or strain in the overall system that could be affecting this particular job. These are just a few examples of [00:12:00] insights to get to root cause that are challenging for many organizations.

Really, I want to shift from being reactive to being proactive so I’m going to actually filter on jobs that are running successfully today and I’m going to take a look at our variance score. In this case, I can tell, perhaps over the last week or so, which jobs are taking longer to run. Maybe this job is taking 27% longer to run than it did a week ago and if that trend continues I’m going to have a problem. [00:12:30] From here, I can do a deeper level of analysis. The first thing I can do is take a look at configurations. We provide recommendation engine that provides configuration recommendations out of the box. That might be all I need to tune this particular job to get it to be on track.

If I want to go further, I can take a look at the simulator. In the simulator, I can tell things like how long is this taking to run at different configuration [00:13:00] or different allocations of resources. I can tell where adding more resources doesn’t create much improvement in performance and would be an over provisioning, and where do I need to provision to meet whatever target I have for this particular job. If I want to go even further, I can get additional insights into this job so we flag inefficiencies, jobs that perhaps have stages that spin off lots of sub-tasks, that have a high overhead. [00:13:30] I may want to take a look at whether certain parts of my code are running in parallel, and I can see that they’re getting distributed and they’re not single-threaded, or I may want to look at the timing of different parts or stages within my job to see where the bottlenecks are. These are just a few examples of ways I can get a deeper insight into performance quickly and easily so I can optimize and get the throughput that I need.

The last thing I’ll call out is, [00:14:00] we do provide recommendations that allow you to identify which of the jobs, if you have many, many jobs running, would be targets for optimization and how you could drill in to perform those optimizations. At a high level, that’s a quick run-through of some features around how we can help you identify issues, predict and prevent issues, as well as scale up new solutions and be more efficient in the process.

Let me pivot here and touch a little bit on [00:14:30] the capabilities to improve data management. First off, there’s the notion of data reliability. When it comes to data reliability, it includes what I would call some of the classic data quality challenges; missing data, data out of range or incorrect format, et cetera. We provide capabilities out-of-the-box for all of those things. We can go beyond by allowing you to define your own custom patterns, your own business rules, et cetera, to really address [00:15:00] your specific requirements. Beyond the data quality issue, the fact that data is in motion, especially in real-time use cases, and there’s so much more technology, data moving from point A to point B. We provide an easy way to do reconciliation, and that is to ensure that data gets from point A to point B as expected, even if there’s some chance formation that might occur along the way.

Then the last really is about drift or really change [00:15:30] and change management. There’s kind of two flavors here, one is schema drift. The idea is, did someone change a source system and would that potentially break a data pipeline or a process? You want to be able to detect that as early as possible or perhaps in your dev or QA phase so that you can get ahead of that and avoid the business impact. Then the other one is data drift. This really has to do with spotting trends in the data itself. [00:16:00] You may get a new pattern emerge from just changes in the world and it might meet all of your data quality rules. The data might arrive as expected, there’s no structural change, but you may want to have visibility or awareness of those types of trends.

It could be an indication there’s a problem with the data. It could be an indication that there’s basically a trend that a business user would just want to see or know about, or it could be an indication that it’s time to re- [00:16:30] tune or re-train or test your AI and machine learning so that they can continue to be accurate, given new events or new trends that are showing up in the data. At a high level, being able to address classic data quality issues more efficiently, being able to ensure data in motion arrives as expected, and being able to handle different types of change, whether it’s structural change or trends, is something that we bundle under the higher level term [00:17:00] of data reliability.

Now, under the hood we do have a catalog that’s used to inventory these assets. In some cases, we’ll tie into customer’s existing catalogs. In other cases, this provides a great way for organizations to manage their data asset inventory and to be able to collaborate, for those that are consuming that data. Lastly, there’s the notion of data economy. In many cases, organizations find themselves with dark data, unused data, [00:17:30] data that’s redundant or processes that are redundant. Being able to identify these things can help, both from an expense perspective as well as a usability perspective, because dark, redundant, and cold data are expensive, and they’re in the way, and they’re not yielding much in terms of value. We provide capabilities to look across your environment to help you spot these things to improve efficiency, usability, and lower your costs.

One thing I’d call [00:18:00] out is that in many respects customers come to us and they describe something that is a bit of a death by a thousand cuts. There’s so many things that can go wrong, so many aspects of governance, and so much data, that it’s very difficult, even with great tools, because there’s so many steps involved. We apply a number of things, a number of machine learning algorithms and automation capabilities to help reduce that level of effort so that you can cover more ground, [00:18:30] more accurately, and overall improve data operations.

The first is around the ability to identify and tag your information. This might be just, first off, identifying what it is but then also looking at relationships, identifying similar data assets or assets that relate to each other. It helps both from a usability perspective as well as a management perspective. As we profile, and identify, and relate these assets, then we can apply [00:19:00] data quality rule recommendations. This provides several advantages. First off, there’s always going to be rules that are customed to your organization, but the vast majority of rules should be things that you can quickly, without having to pour through field-by-field, table-by- table.

We surface those standard recommendations to allow you to do that quickly, and easily, and broadly across your data assets as well as with greater precision [00:19:30] by being able to profile your information and determine what the right thresholds are for validating your data. Then when it comes to process automation and management, we help organize multiple rules into policies. We allow you to call out to Acceldata to perform those validations, and then we can feed the results back to your overall workflow so that you can focus on your engineering efforts where needed and then provide a simpler approach using data observability [00:20:00] when it comes to validating that engineering. Because we’re built on Spark we’re highly scalable and we also allow you to tap into all of the capabilities that come with Spark, whether you’re looking to do sophisticated things with machine learning or streaming or what have you.

We provide some incident management capabilities but also the ability to tie into existing ticketing systems. Let me quickly show you a few examples of this within the product to give [00:20:30] you a sense of how it works. Here, for example, I’ve got an asset, a table that I’ve inventoried. This could be an asset from a streaming source or from a dashboard or a variety of different tools. We give you quick visibility into what the data looks like. Then from there you can begin to apply policies. Whether it’s a drift monitoring or schema drift monitoring or data drift, data quality policies can be added, [00:21:00] and here we provide recommendations.

These recommendations can be added with just a few clicks, and this is based on our machine learning recommendation engine. Of course, you can add your own rules as needed and even do things like check that you’re exceeding receiving rows of data within the ranges. We can even do validation checks that, from the source to the target, the data is arriving as expected. Of course, you can define how you want to incrementally scan for these things, [00:21:30] how you can configure scheduling or workflow automation. We have a number of channels for communicating notifications, emails, slack, teams, and many others, and the ability to hook into other services for downstream processing.

This is a quick example of how you can click your way through to applying data quality. We have the ability to import rules so if you have data assets that are repeated [00:22:00] you have the ability to call out to other rules that have been created and re-use them without having to go through the process over and over again. We do provide the ability to monitor all of this at a high level so you can view all of your data quality policies or other reconciliation policies. You can see the trend of how you’ve been scoring over time. You can drill down on a specific execution. You can look at the specific rules and see the metrics, [00:22:30] drill down on a particular rule, see where there’s been violations.

We can, of course, mask or hide the details, depending on your role, but this makes it very easy to manage data quality at scale across your environment without having to look inside each and every ETL job, for example. That’s just a quick overview of some of the capabilities we bring around improving data governance and data management. [00:23:00] From here, let me call out a couple of things that we can do to support you. First of all, we have a range of professional services to provide product support, consulting engagements, managed services, to provide the support that’s appropriate for you. I’d like to call out a couple of examples of where we’ve helped customers improve on these metrics.

First off, at GE we helped lower their costs. They were experiencing [00:23:30] poor performance and system instability. We helped them troubleshoot those issues and optimize their environment and that actually lowered their infrastructure costs, the resources that they needed for their environment by 40%. At True Digital, they had a problem with being able to scale. They were currently only able to process 50% of the data that they were ingesting and they were looking at a very expensive upgrade to double their infrastructure. [00:24:00] We were able to identify inefficiencies and configuration recommendations that allowed them to double their performance on their existing infrastructure, allowing them to process all of their data without having to spend more money on their infrastructure.

At Phone Pay, which is a Walmart company that provides an app-based payment solution, they were able to improve their reliability [00:24:30] as well as scale their solution. When we first engaged with them, 60% of their talented engineers were spending their time fire-fighting instead of innovating and expanding the organization’s solution. We were able to not only get to the root cause but applied those predictive trending analysis capabilities so that they could get ahead of issues before they occur. They’ve now gone over 12 months and counting without a single [crosstalk 00:25:00] [00:25:00] one issue. These are just a few examples of how data absorb-ability is lowering costs, improving performance and reliability at scale.

Just to recap, a data observability approach allows organizations to get a better return on their data investments. This is really aimed at three primary areas; improving reliability in terms of dependable delivery of quality data [00:25:30] processing and pipelines at scale, eliminating friction points when it comes to designing, developing, and scaling new solutions, and really implementing efficiencies and optimizations to allow you to get the right performance and capabilities at the lowest cost.

With that, I’d love to open it up to some questions. We’ve got a few minutes left and I’m very curious to get your feedback and address any questions that you might have.

Speaker 2:    [00:26:00] Yeah. Great session, Mark, that was super awesome, amazing. Love the interaction and kind of getting into the product. We do have one question in the chat right now and I’ve also promoted one person who had a live question but I’m waiting for that person to join. Julia asks, can you speak specifically about monitoring and optimizing within hybrid distributed environments?

Mark Burnette:    That’s right, yeah. One of the challenges is around [00:26:30] having to pull the data to a separate environment in order to provide your quality checks. One of the things that I would call out with our architecture is we’re able to centralize the metadata, and that’s a very small amount of data. When it comes to monitoring things that you might have on Azure or Google or Amazon or On-prem, we can leverage those environments [00:27:00] to actually perform those scans. For example, all of our scanning is done through Spark so we can deploy the actual Spark jobs within Google or Azure or Amazon or On-premise to be close to the data and to be able to do that on scale. Then we can pass back the results of that to our centralized repository, which is a small amount of data.

Now, the second aspect to being able to [00:27:30] handle things in a distributed environment is, regardless of what environment you’re operating in, because all of that data is centralized you don’t need to manage a separate solution in each of those separate environments. The last thing I would say is, in many of those environments you have very large amounts of data and, as a result, the Spark architecture allows you to do distributed processing within that environment [00:28:00] to handle as much data as you need to scan and be able to do that as quickly as necessary to meet your SLAs so that the scans, and the quality, and the validation is done quickly and gets you further down the pipeline.

Speaker 2:    Great. Jeff asks a question, do your three products work together or do they work independently?

Mark Burnette:    Yeah, so these products are designed to work together but many customers will purchase them independently. An example of where [00:28:30] two products could help you, working together, is if you have a complex data environment. Let’s face it, many organizations have a data swamp. To address that data swamp, you may use the Pulse product to identify where you have large volumes of cold data or inactive data. Then you can use the Torch product to then look inside that data and determine, from a business perspective, what’s the best way to handle it; to tear it out, to archive it, or you need to keep it online. When [00:29:00] it comes to cleaning up the data swamp, there’s an infrastructure aspect to it and a data aspect to it together that together allow you to improve the data lake.

Speaker 2:    Great. Well, this has been awesome. Mark, thank you so much for your support, for your session, it was very informative. We don’t have any more time for questions so if you didn’t get to ask your question there’s still an opportunity to. You can join our Slack community, which is the Subsurface workspace, and search [00:29:30] for Mark’s name. You can ask him a question there or you can send him a direct message through the Subsurface conference platform.

I do ask that everyone take a minute and fill out the session survey in the Slido tab at the top right of your screen. Again, the [inaudible 00:29:48] wall is open as well so please check out all of our partners, all of the sponsors at Subsurface, and all the amazing things that they’re doing as well. With that being said, thank you so much everyone for attending. Mark, thank you so much for the great presentation and have a wonderful [00:30:00] rest of your Subsurface.

Mark Burnette:    Thank you.