May 2, 2024

Unleashing Data Agility with Virtual Data Marts and ZeroETL: The End of ETL as We Know It

Traditional ETL processes bog down efficiency with complexity and cost. Learn how Dremio’s Virtual Data Marts simplify data pipeline creation and management. We’ll examine the challenges and costs associated with excessive data movement and talk about how Dremio’s approach to data virtualization and its innovative Reflections query acceleration capabilities can make performant, self-service analytics a reality. Learn how you can minimize data movement through a broad connector ecosystem to diverse data sources. We will also review the strategy behind Virtual Data Marts and deep dive into how an intuitive semantic layer let’s users access curated data through logical view layers for easy-to-understand analytic access.

Topics Covered

DataOps and ELT/ETL
Governance and Management
Lakehouse Analytics
Performance and Cost Optimization

Sign up to watch all Subsurface 2024 sessions


Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Alex Merced:

Hey, everybody. Be excited. Be, be excited. OK. That’s always like, you know, I like doing that chant, but when I think about it, it’s like, wait, that’s like the wrong movie to reference when you want to get people happy. It’s a requiem of a dream reference for those who haven’t seen that movie. It’s very depressing. But anyways, what I’m going to be talking about today is unleashing data agility with Virtual Data Mars and Zero ETL, the end of ETL as we know it. Bold words, but let’s talk about it. Let’s explore what I mean by that.

So bottom line, let’s take a look at the world of today. So you’ve probably saw a version of this chart in the keynote at the beginning of the day, where basically, hey, we’ve got all our data sources. And then we have all these data pipelines that– and again, it’s not just like the pipeline itself. It’s the fact that I have to write the code. I got to test it out. Then I got to deploy it. And then every time I change it, I got to test it again, redeploy it. Oh, that takes time. Takes compute, whatnot. And then again, it’s not just one pipeline. Oftentimes, this pipeline’s dependent on this pipeline. We’ve all written those DAGs to make sure that everything runs in an orderly fashion. And it gets a lot of fun. But that’s just half the story. We just land the data in the data lake. And then we do the whole story again into our data warehouse. And then even then, we still have to deliver that data. And again, we have all sorts of complex things that go on within our tools, whether we’re creating BI extracts in our favorite BI tool and doing other sort of machinations in order to make everything work fast enough. And then even then, we’re still serving some of the data from our data lake to our clients, not just from the data warehouse. So this all ends up taking a lot of time. And then every time, like, hey, one of your consumers says, hey, I need an extra column, or can you rename this column? Well, now I got to go back and take a look at this big chain of pipelines and figure out, hey, where are all the places I need to make a change, test it out, redeploy it, and all that stuff. So it becomes– it reduces your time to insight, because your consumers aren’t getting the changes they need fast enough. It’s not as quick to iterate. So while this all works good at a certain scale, you get to a point where this just becomes too brittle, too fast. 

What’s the Problem, Cost of These Movements

So we want to kind of move away with that, because what happens is that you end up in a situation where pipelines break, and bad data gets ingested. And guess what? Now you’re spending a weekend backfilling data, because that data needs to be ready come Monday. Or basically, it took so long to make a change, you have basically angry consumers because of late data or inconsistent data. And they’re screaming at you, because they don’t realize how complicated this whole thing is. They’re just like, hey, I asked for an extra column. Where is it? And there’s also the cost of it. 

There is the storage costs. Again, every time you’re creating another pipeline, that creates another duplicate of the data, even if it’s a transformed duplicate, that’s more storage costs. That’s more compute. Every pipeline, that’s compute that has to run, and you’re paying for that. And there’s networking egress costs. Every time you’re making a request to S3 or your favorite object storage source, you’re paying for those requests. And then when you move data out of the cloud network, you’re paying those egress fees that can really add up and surprise people, because they’re like, I was just moving the data to Snowflake. Where did this cost come from? Because you calculate storage, you calculate compute when you’re doing that initial assessment, but everyone always forgets the egress part. And they’re like, oh, I didn’t think about that one. 

And then you have the lost productivity, because the more time people have to wait for their data– that’s stuff they can do– and also, hey, the more time you’re spending working on pipelines you already have, just updating them, that’s not the time you’re not spending creating new pipelines, access to new data, expanding what you already have. Regulatory fees– oops, some PII got stuck somewhere, because someone decided to build the wrong extract in the wrong place, and you didn’t have the visibility into it to see the PII was there, because it was out of your purview. Oops, fees. Data model drift, because there’s just so many changes in the data, things kind of move away from their original modeling. And then the cost of bad decisions made, because all these things that can go wrong. So it can be really costly. So doing this in a better way not only will get the data faster, but it will also eliminate all these kind of costs. 

The Dremio Approach

So this is where we get to the Dremio approach, where we just say, hey, take your data sources and connect them to Dremio. Either you can directly connect them to Dremio, so you can connect MongoDB, Postgres, DB2. There’s all sorts of different connectors that we have. And right there, you have access to the data. So at that point, you’re at zero ETL. You haven’t moved any data. Or you can maybe land the data as an iceberg table in your data lake, which has a lot of advantages. But that still would just be a one-time movement. So that would be low ETL. You’re really just doing a very minimal amount of ETL. And then basically, what you do is that on your data lake house, you can just model the rest and then deliver it to your clients. So basically, the pipelines become a lot simpler, because it’s just about either connecting the data or landing the raw data in iceberg. And then again, everything gets virtually modeled on top of that. 

So again, Dremio allows you to federate those data sources, so you can directly connect to Dremio. Again, that’s the zero ETL. Or again, you landed an iceberg, so low ETL. You’re very minimal data movement. You use Dremio semantic layer to virtually model on top of that raw data. So instead of making duplicates, you’re just creating those diff changes through SQL views. And then you use data reflections to kind of clear up any bottlenecks. So anywhere you need a little bit faster, you just zap that part and let Dremio– essentially, basically, what a reflection is, it goes beyond– like again, Sender mentioned this morning, a lot of times, people will think of this like a materialized view. Well, really what it is, it’s like an automated pipeline that you would have otherwise– otherwise, you would have had to write this pipeline. But instead, Dremio’s thinking, OK, hey, you want this thing to be materialized. We’ll handle when to update it, how to update it, when to substitute it. All that stuff is kind of done for you. So that way, a flip of a switch, it’s just faster. So it’s easier for you to execute, but also easier for the analysts to use, making it very, very powerful. And it eliminates the need for things like materialized views, cubes, extracts. You just have one mechanism, one switch you have to turn. And then again, now we’ll suggest to you when to use it. So you don’t have to even think of, hey, what is the most strategic place to use this? We’ll tell you where those places are. 

Benefits of the Dremio Approach

So what’s the benefits of this approach? Less pipelines, which means they’re going to break less. Landing a bunch of raw data in Iceberg is going to be a lot easier than chains and chains and chains of data pipelines that can easily break. Less backfilling, fresher data sooner. And then we get to be happy like this AI-generated picture. If anyone follows me on LinkedIn, you know I love my AI-generated images. And these happy AI people– and you can tell it’s AI because there’s always the hands, right? But these data consumers getting fresh, correct data. Look how fast, happy they are. That new column showed up just on time. They’re happy. We want this, the Dremio approach. But again, also it’s cost savings. Again, it’s not going to just make your life easier as a data engineer, make the data analyst’s life easier to access that data. But it’s also going to take your business leaders and you can tell them, hey, guess what? I’ve also saved you enough money, so can we add to the higher headcount so we can get to those projects that we wanted to get to? 

Because you’re making less copies, so less storage. Less pipelines, so less compute costs. Less data movement, so less network costs. Because Dremio is actually caching a lot of those requests. If you access the same Parquet file like 10 times, Dremio is going to cache that. So that way, you’re not requesting that file all the time from S3 or whatever object storage, which is going to speed up the query but also reduce those network costs. And because you’re not moving the data outside of your cloud, you’re not going to get those egress fees. So overall, you’re going to really compound and really reduce your total cost of ownership. Less copies, that means there’s less things that can go wrong. Much easier to make sure that you’ve covered your PII. Instead, you can just do that with Dremio. You can create row and column-based access rules to hide PII and other things and create like masked views and whatnot. All sorts of different options without having to duplicate the data. 

And fresher data means better insights. And again, with things like reflections, people don’t have to create local copies or extracts. Basically, acceleration will exist at the engine level. So that way, no matter what consumption tool they use, whether it’s Power BI, Tableau, Superset, they’ll all benefit from the acceleration you’ve already created at that layer. So you get a much more bang for your buck on all these levels. OK. 

How Does Dremio Make This Possible at Scale?

Now, how does Dremio make this possible, though? Because it sounds really good. But the question is, technologically, in the past, we used to have– because data virtualization isn’t new. This whole idea of connecting a bunch of data sources in one place, there’s been people who’ve done that before. The problem was, it was back in the day, you would just connect the data. And what would happen? Every time you did a query, you would be pushing down to all these systems. And that’s great to a point. But what happens when you connect to my SQL server that I’m using for operational purposes, and you’re pushing down all these analytical queries, and they’re all competing over the same system, the same resources? Then you start hitting bottlenecks. OK. 

But Dremio has kind of cracked that high performance virtualization at scale. And really, reflections are the key to that, right? Because using that example of SQL server, if I connect a SQL server to Dremio, I could use reflections on some of the tables that people query. So now, when people query that SQL server table, to them, they’re just accessing the table in SQL server. But Dremio is going to use the reflection that’s on your data lake. So in that case, that query doesn’t get pushed down to SQL server. So you’re not competing over those resources. And you don’t hit that bottleneck or that scaling issue with your underlying database system. OK. So it basically enables you to be able to really do this at scale and make it feel like a virtualized system without the bottlenecks. But Dremio also, even without reflections, Dremio is really fast. OK. Many of you have already ran queries in Dremio and already realized this. Part of it is because of Apache Arrow. 

Now Apache Arrow is a little bit of everywhere. Everybody’s switching to Apache Arrow because it’s fast. And since it’s a standard format, it allows systems to communicate with each other and send the data to each other faster with Apache Arrow flight. So it’s a very fast way to send data around, to process data. So that’s a big key part to it. And actually, that started– that originally, in the earliest stages of Apache Arrow, before it was Apache Arrow, it was the in-memory format for Dremio. Basically, that all started here. And then one of our co-founders, Jackson Doe, got together with Wes McKinney. And then that built up into what’s now known as Arrow. So we’ve been working with that since the beginning. 

Then there’s a reflection story. The old way, again, you would make all those copies. And then you would move it to the data warehouse, where you’d create all these data marts. And again, more and more copies. But with Dremio, I can just connect my data sources directly or ETL it to Iceberg. Either way. I have my two physical tables. And then, again, here I have my virtual layers. And then again, if I want to accelerate both of these, I could just put a reflection on the view, on this view right here. And guess what? That’ll be substituted when you query view A or view B. Because the way the Dremio query engine works is that when I say, hey, I want to query on view A, since the semantic layer is built into Dremio, it knows that, hey, this view is built from this view. And this view has a reflection on it. So that means that reflection is a candidate for accelerating a query on this object, which is very different than a lot of other systems. Because a lot of other systems, the acceleration is tied to that specific table. Like you create a materialized view, it’s only for that table, only for that namespace. 

So you’d have to create a lot more materializations than you normally– than you do with Dremio. And again, you have to make sure you know what the right name is to query. Here, I create one reflection. And now all three of these are going to get accelerated because of the way they relate to each other, because Dremio has that awareness of the context of the objects in your semantic layer. So that’s where it really makes– again, that virtualization scale, makes it possible to work with virtual data sets very viable at very high scale, because of that context awareness, because of the way reflections work, you’re not having to materialize every layer of your data to have that same performance and have that same accessibility. So that’s what makes it pretty exciting. So you can reduce a lot of that ETL work and just deliver the data. 

Aggregate Reflections to Eliminate Extracts/Cubes

Now, focusing on the aggregation side of it, basically, usually, you would accelerate stuff for BI in two different ways. You would either, one, well, you would use BI extracts in cubes. Usually, BI extracts would be at the BI tool layer, where you’d be doing extracts within your BI tool. Or you’d be doing these big pre-computed data sets, cubes, that you’d have to redo every once in a while. The problem with them is that oftentimes, they’re static. You generate them, and then later on, you have to generate them again when the data changes. So then you have to manage that. Pipeline, because it gets stale. Then you would have data storage maintenance. Basically, the tool itself, depending on what tool it is and how you did it, may not be aware of how to deal to update it. So you have to be the one who has to update the data. You have to recreate the data. And also, you might end up with out-of-memory issues, because you might have this huge cube with lots of aggregates over a really large data set over many dimensions and measures, and the tool cannot fit that into the memory in order to manage that. 

Now, when Dremio does it, well, we have a special type of reflections called aggregate reflections. So again, what it’s basically doing is creating an iceberg table with all those aggregation calculations already. But it’s created in a way that Dremio understands. So Dremio knows how to manage the memory. So you never run into these massive out-of-memory issues. Dremio updates it for you on a cadence. And again, new features, now you have incremental reflections. So those will update, not do full updates. They can actually just do the additional for iceberg tables. And then three, you have reflection recommender, so you know which ones to do. And there’s now a reflection scheduler. So you can actually schedule when those refreshes happen much more easily. But it does it over. So basically, all I would do is I would say, hey, this particular table, I want to optimize for these dimensions and measures. And guess what? Dremio is going to automatically update it, so you don’t have to create pipelines to keep this updated. Dremio is going to know to handle that in memory to be able to process those reflections. Yeah, basically, once you turn it on, it’s done. You just let Dremio do the work. And it makes life a lot easier. So again, it’s allowing you to just basically connect the data to Dremio and be able to handle huge-scale workloads, vital. But again, it’s because of these other abstractions that Dremio has that are fairly unique to Dremio. Mood lighting. Oh, good. Oh, good, you set the mood. It’s good. [HUMMING] OK, so again, reflections are a big part of the story. Like, it’s really– you can tell I really love reflections. They’re really cool. 

Columnar Cloud Cache (C3)

But that’s the deal. And again, we also have the columnar cloud cache. So this is the other thing that’s going to save you money and performance. So basically, what this does– this is more specifically for when you’re operating off the data lake. As you’re making a request to, let’s say, S3 or ADLS, Dremio is keeping track of which files, which portion– sometimes even which columns you’re accessing a lot. And so let’s say you may be accessing this one file not that often, but you’re accessing this one column a lot, specifically. Instead of caching the whole file, it’ll cache just that column. But basically, it’s caching all this stuff on the nodes. So that way, when those queries that repeat over and over again get hit, you’re not requesting all those objects again and hitting all those network fees for accessing those data sets. So that’s going to, one, speed up, because basically, that node can already access that data right there from the node itself, from its memory. But two, you’re going to notice a cheaper cloud bill. So Dremio is reducing your cloud costs from different angles. Because at the end of the day, we want you to keep using us, right? So we want to make sure it’s affordable, right? That’s a good way to make you happy. And we like to make you happy. 

And then, last bit, so they made the announcement today that this is now going to be available in software, which is really cool. But in Dremio Cloud specifically, you have these t-shirt-sized engines. You guys might be familiar with that, like Snowflake and other cloud vendors who do that whole extra-large engine, small engine, whatnot. But in Dremio, what you could do is you could do auto-scaling. So in this case, I could sit there and actually say, hey, I have this one engine. And I want there to be up to five replicas at any given time to handle concurrency. And I can actually have multiple engines. And I could sit there and say, hey, queries by these users go to this engine, because those are low-priority queries. I don’t want to spend a lot of money on those queries. And I can say, hey, these queries go to this engine over here. And that allows me to manage my costs and manage what computer am I using for what. So that way, the stuff that needs the more powerful compute always gets it, and the stuff that doesn’t isn’t. So I’m not getting banged up financially for low-priority queries. So you can do that. And that’s now going to be something you can do in software, which is going to be really cool, because that really makes– that’s just a really cool thing to be able to do on-prem. Because, again, it’s not something that many on-prem solutions really have. 

But that’s another level of ways that Dremio is able to give you that performance. And the really cool thing about this, when you really think about it, is that, again, you’re separating the computing power from the coordinating of everything. So going back to that SQL Server example, when you’re using traditional databases, there’s vertical scaling limits. So what happens– or horizontal scaling limits. You can only get so big of a machine. And that SQL Server eventually gets so big that you now need two SQL Servers. And they have different data in it. And it becomes hard to unify. But with this, I could set up five concurrent engines. And then, again, they can all process that reflection on that SQL Server table. I really don’t have a limit to the scaling on that SQL Server table now. Because, again, it’s not using the resources of that SQL Server. It’s using the reflection on the data lake. And, again, all that is infinitely scalable. So you can actually get more juice out of the systems you already have before you have to really do big, expensive upgrades and things like that. So there’s a lot of ways this can really be used to your advantage. And, again, reducing the amount of ETL work you have to do, reducing a lot of the other cool stuff. 

Now, I think a lot of people here have already been hands-on with Dremio. But if you haven’t, I recommend that you do so. And a good way to do so is to scan this QR code where I have an exercise where you can literally spin up Dremio on your laptop, along with Minio and Nessie. And you actually get to see a lot of the stuff that we’ve been talking about and use it on your laptop. And the great thing about that is that zero cost, right? Because it’s on your laptop. The cheapest compute, your laptop. You can actually get hands-on, play with it as much as you want. You can actually even connect it to your actual data sources and query stuff from your laptop. So assuming that– now, if it’s a really huge data set, your laptop might take a while. OK, I’ve tried it out. I’ve definitely– the laptop’s only so powerful. But it’s still pretty fast. At the end of the day, if you query some other data sets that are probably more within the right level for your laptop, you’ll still feel like, hey, this is pretty fast. So I recommend trying that out. It’s a lot of fun, too. And then the cool thing is sometimes I just have Dremio running for a week on my laptop, just for small work. Like, I’ll just be doing– the other day, what did I have to do? I was doing some number crunching. 

So you see those pie charts where we talk about all the different contributors to Iceberg and Delta? I have to do those calculations. So what I do is actually I scrape all the data from GitHub, and I load it into– this time around, what I did is I loaded up into a Parquet file. I uploaded that Parquet file to Dremio running it from my laptop and just did all the number crunching that way. And it was really easy, because I was just able to run SQL on those Parquet files pretty easily, just doing that without having to open up a notebook or anything like that, because Dremio was already running on my laptop from earlier in the week. So it’s a cool thing. I love it. But I highly recommend doing that exercise.