Gnarly Data Waves

Episode 41

December 5, 2023

ZeroETL & Virtual Data Marts: The Cutting Edge of Lakehouse Architecture

Join us for an enlightening talk on "ZeroETL & Virtual Data Marts: The Cutting Edge of Lakehouse Architecture." In this session, we'll explore the pains of data engineering, unveil innovative solutions, and guide you through practical implementation with tools like Dremio and DBT. Transform your data landscape!

Embark on a transformative journey with our insightful presentation, “ZeroETL & Virtual Data Marts: The Cutting Edge of Lakehouse Architecture.” In this engaging session, we’ll delve into the intricacies of modern data engineering and how it has evolved to address key pain points in the realm of data processing.

Our presentation illuminates the challenges data engineers face, from the complexities of backfilling and brittle pipelines to the frustration of sluggish data delivery. We’ll introduce you to the high-impact concepts of ZeroETL and Virtual Data Marts, demonstrating how these innovative patterns can dramatically alleviate these common pains. By reducing the need for manual data movement and preparation pipelines, you’ll discover a more efficient, agile, and responsive data ecosystem.

Join us for a practical guide to implementing these transformative patterns. We’ll walk you through the steps to bring the power of ZeroETL and Virtual Data Marts into your own data landscape. Leveraging cutting-edge tools like Dremio, DBT, and more, you’ll gain hands-on experience in designing and deploying these patterns to streamline your data workflows and supercharge your analytics capabilities.

Don’t miss this opportunity to stay at the forefront of data architecture, enabling your organization to harness data’s full potential while reducing complexity and overhead. Join us for an exploration of the future of data engineering – a future where ZeroETL and Virtual Data Marts pave the way for data agility, speed, and innovation.

Register to view episode

Speakers

Alex Merced

Alex Merced is a Senior Tech Evangelist for Dremio, a developer, and a seasoned instructor with a rich professional background. Having worked with companies like GenEd Systems, Crossfield Digital, CampusGuard, and General Assembly.

Alex is a co-author of the O’Reilly Book “Apache Iceberg: The Definitive Guide.” With a deep understanding of the subject matter, Alex has shared his insights as a speaker at events including Data Day Texas, OSA Con, P99Conf and Data Council.

Driven by a profound passion for technology, Alex has been instrumental in disseminating his knowledge through various platforms. His tech content can be found in blogs, videos, and his podcasts, Datanation and Web Dev 101.

Moreover, Alex Merced has made contributions to the JavaScript and Python communities by developing a range of libraries. Notable examples include SencilloDB, CoquitoJS, and dremio-simple-query, among others.

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Opening

Alex Merced:

Hey, everybody! This is Alex Merced, and we’re going to be doing a presentation on ZeroETL and virtual data marts, basically just talking about what are the main pain points in building out your data infrastructure, and how can we get rid of those frictions. Because what do I hear when I hear the words ZeroETL and virtual data marts? I’m thinking no copies, and that’s where we’re going to be going.

Part 1 - The Vision

Traditional Chain of Data Movement Pipelines

Alex Merced:

So first, let's just talk about that pain and that vision of where we're going. So if you take a look at the world of data today, and where things have been, you have all your data sources. [The] bottom line is we have an application or a business, and you have lots of data systems that are meant for the operation: to look up customer records, for people to log into their accounts, all sorts of things, whether it's SaaS app on-prem applications, custom applications, enterprise data business applications, IoT, third-party data, you have applications that engage databases for your operations.

Now, you want to start using that data for analytics, so you can discover insights, like helping you predict what you should be doing as far as purchasing inventory, or how you should be investing your money as far as growth in your business, predicting where the best place is to to go and promote your business, where there are areas for growth––you need data to develop these types of insights. So you want to move your data to data analytics. And that requires lots of different systems, such as data lakes, data lakes being this big catch-all repository of data that could be done across clouds, on-prem, using things like Hadoop, or in the cloud, using like AWS, S3, or Azure data lake. All these things are great, and then, again, you're able to put your structured and unstructured data all there in one place. But oftentimes to do that, you have to write ETL pipelines or, a.k.a, code, that you run that extracts, E, from your data sources. So basically [it] pulls the data out, transforms [it]––makes some changes to the data––so it's more ready for how you're going to use it on the other side, and then loads or lands the data in that target destination, a.k.a, in this case, your data lake.

So you're going to have all these ETL pipelines, and oftentimes it's not as straightforward, point A to point B, oftentimes there [are] many intermediate steps. You may land the raw data, then you have a pipeline to clean up that data, but then once you have the cleaned-up data, they may have other pipelines to join or denormalize that data. And you go through all these layers of changes, and you're each time you go through one of these layers of changes, you're making physical copies oftentimes. And to get that final result, each of the previous pipelines has to play out, so [it] becomes this unwinding chain. If you've ever seen something called a DAC, a directed-to-cyclical graph, you can get an idea of how complex these changes make, where you're trying to track which change should happen first, and in what order.

And the thing is that once you get all that data curated in the data lake, some of that data is going to move to a data warehouse because oftentimes that's going to be your front-line data analytics. So you take that sub-portion of the data that you've curated in the data lake, and then you make more data pipelines, more ETL pipelines, that extract the data from the data lake, transform it again, and load it into the data warehouse. And then within the data warehouse, you're going to create a whole bunch of data marts––little mini-warehouses in your warehouse. There's the central data warehouse, and then all the little mini-warehouses for your different business lines. And then there, you're going to go through other layers of transformations for specific use cases––again, more copies––and then, everything becom[es] dependent on this big, long chain of pipelines, across all these stops. All this data is going to be used to fuel BI dashboards, custom applications for data analytics, self-service data visualizations, data extracts––you're serving all this stuff, which some of that ends up also pulling data from the data lake. So basically, you have data moving in all directions, all sorts of copies that are interdependent [on] each other, and this creates a lot of headaches because you have to write the code, and maintain and deploy the code that takes this long chain of steps, and make sure that they always are consistent––and that's a lot easier said than done. But also at times, you run the risk of having so many copies, that you can't track what changes are being made, are things like adhering to your data contracts, or business models, or the things that you expect,

and so forth.

Broken Pipelines Require Tedious Backfilling

Alex Merced:

So what's the problem? When one pipeline breaks, it causes every other pipeline in the chain to break, and that means you have to go back and fix it. That means removing the bad data and replacing it with the right data, a.k.a., backfilling, which can be very tedious when you don't have the right systems in place to make that disaster recovery easier. When you get to the bottom line, it takes time for each pipeline to play out. So the more complex this long chain is, the more copies have to be created, and the longer it takes for the data to get to that end user, that data consumer. And they get angry––they get angry because they're not getting the latest data to make accurate decisions based on, and a lot of times, that data is wrong, because it's just so many different places, and so many points of failure along the way.

Cost of These Movements

Alex Merced:

And on top of all that, just on top of the tedious backfilling and the angry consumers, there's also just the cost of it all. [For] every copy, you're paying for more storage. Every time you run a different pipeline, there's more computing power that has to be used, and you pay for those cloud compute instances, so the longer they run the more you pay. And then, when you're moving data outside of a particular cloud, oftentimes you're paying Egress costs. So you're paying to move the data outside, from, let's say, your data lake into your data warehouse, on top of the compute you caused for those ETL jobs, and the storage from landing that data in the warehouse, that you already had in the data lake, you're also paying all those Egress costs the cost of just moving the data out of the system, and then the cost of moving data within the system. And then just lost productivity––while those data is proliferating, there's just stuff you can't do, because you're waiting for the data to get there. And then also the time that you're spending building all these pipelines that you could be spending, curating new layers of data, new use cases, all the regulatory fees––because, oh, no, I forgot to mask the PII on this one view of the data, and now I get regulatory fines––so many different security risks, and making sure that all the access for all these data sets in different places of access are governed––all of this adds up. And then there's a situation where basically, your data models are shifting––so many transformations to these columns and rows and whatnot, that things move away from what the data was originally meant to represent, resulting in bad insights. And then just the cost of making decisions, bad decisions, based on the wrong data. There are lots of things that can go wrong in this setup.

The Dremio Approach (ZeroETL & Virtual Data Marts)

Alex Merced:

I can minimize a lot of this if I can minimize the number of pipelines I have to build, maintain, and deploy. One, it’s going to reduce the amount of time I'm focusing on doing what we already have, instead of building what can be. But it's also going to reduce my costs and make my end consumers happier, and whatnot. Now, how do we do that? And this is where we get to the Dremio approach, the ZeroETL virtual data mart approach. So ETL is the movement of data––so, [the process of] creating these pipelines, writing the code, deploying the code, and maintaining the code through updates over time. ZeroETL means we can minimize your need to do that as much as possible. So ZeroETL means reducing the number of pipelines you have to write as much as possible. Ideal world, zero [pipelines], as close to that as possible.

Virtual data marts––so basically, once the data gets to your data warehouse, you build all these data marts where you model your data through layers of physical copies. Virtual data marts would mean we achieve the same thing without layers of copies, but just through layers of logical views of the data. So if I do these two things, if I model my data virtually and minimize the movement of the data, technically, there's very little in the way of pipelines I have to write, maintain, or deploy.

How is that possible? Let’s first establish what this is. The bottom line is, that you would take your data sources, and at best, you might have export load pipelines, meaning you just land your data, your raw data, in a data lakehouse, the difference being the data lakehouses––you just have your no-data warehouse, you just have your data lake. You store your data on the data lake, and you use tools like Dremio, Spark, and Flink, to interact with that data in your data lake, to fill the gap at the warehouse used to fill. You offload more and more of what you do in the data warehouse onto this data lakehouse pattern––the idea of treating your data lake as your data warehouse, particularly enabled through data formats like Apache Iceberg and Delta Lake that allow tools like Dremio to interact with your data lake like it's a database.

So basically, you can either connect your data sources directly to a platform like Dremio or land your data in a data lakehouse and connect that to Dremio. And then that delivers your data directly to the clients. So basically, you're just moving the data once. You're just moving it to your data lake. And that's if you even do that move at all, because Dremio can connect directly to your operational sources, if need be, and then you just deliver that data directly to your clients. You don't have these layers and layers of copies that you have to build out, because all that's done through [Dremio]. Again, we connect all that data, we federate it. So again, [you could] connect your data lake, your databases, and data warehouses directly to a platform like Dremio. So in that case, you have very little ETL, either have zero ETL, because you didn't move any data, you just connected the sources, or maybe you just did one movement of your raw data into a data lake, so you have low ETL, and you're going to allow everything else to be handled logically, through logical views. Then you're going to use the Dremio semantic layer where you can create all these layers of logical views and model your data like you would in a data warehouse. [It is] the same way you would build data marts––but instead of making physical copies of data, you're just creating more layers of logical views in the data, and Dremio has a very high capability for creating very complex layers of views.

To make that practical, now usually where people get caught up is [that] Dremio is not the only tool in the world that allows you to make logical views. So why don't you just use any tool and make logical views? Well, where you usually get caught up is performance. You want logical views. One, they're much more lightweight, and they're much easier for self-service. But oftentimes you're still operating on those same original data sets, and if everyone's carrying those same original data sets, those full table scans, that's not necessarily the most performant thing in the world. And this is why you end up getting all these chains of copies and whatnot. We'll have a quick example of that a little bit later on, but it becomes very messy.

But Dremio unlocks the key to making this practical through its features like data reflections, this eliminates a need for materialized views, cubes, extracts, and all these complex webs of copies that you would have to manually curate and maintain get thrown out the window with the difference of a flipping of a switch, making life a lot easier.

Cost Benefits of this Approach

Alex Merced:

What are the benefits of this approach? One, the [fewer] pipelines that you have to write, the [fewer] pipelines that are going to break, so less backfilling. At the end of the day, Dremio is going to abstract the way a lot of the pipelines you would have written, and the places where maybe some acceleration is needed, Dremio is going to abstract all that away, you just flip a switch, and just enjoy the performance, and just enjoy that your data is getting where it needs to be. You just land that data, you model it logically, and let Dremio handle the rest. So that data is going to land there quicker because you're not going through as many hops of physical copies, which means you're going to have to do a lot less fixing and curation. Your users are going to get that data faster and fresher, and it's going to be curved more correctly because again, it's much easier to think through doing things right when there are fewer steps than when there are more steps. And on top of it, you're going to save a lot of money––fewer copies, fewer storage costs, fewer pipelines, less compute, less data movement, less network cost, you're not having all the Egress, because you're not moving data into a data warehouse, fewer pipelines, the more likely the data stays fresh because you don't have so many versions of the data that need to be updated all the time. If you have fewer copies that are in only one place, then it's easier to comply with regulations, and you're less likely to accidentally have PII exposed somewhere, your data models are less likely to drift––your data is going to be fresher, so you're going to have all these benefits, because you're simplifying the key benefit here.

But the question is, how do you simplify and still maintain performance? And this is where Dremio unlocks the key. It allows you to simplify scale because Dremio handles the performance piece, and it's going to allow you to get all these benefits when everything is set and done.

How does Dremio Make this Possible at Scale? - High Performance Visualization

Alex Merced:

Again, how does Dremio make this all possible at scale? It's two things. [One,] high-performance––Dremio is designed to try to find ways for performance and intelligence. And again, the keywords intelligent performance. It doesn't expect you to have to turn all the knobs and settle the dials to get performance. It's going to try to handle as much of that intelligently as it can for you, that way, you can focus on modeling through virtualization and working when it can, working with the original data source. So that way, again, you don't have this complex web of physical copies, so through high-performance virtualization, so again, anchoring with the idea that we want to virtualize as much as possible, by coming up with the abstractions and providing performance, we make this scalable. Because a lot of times you have other platforms that say, hey, we can federate a lot of data sources; hey, you can virtualize your warehouse through logical views, but you get stuck with performance, because a lot of times, going back and sending those queries to those federated sources for every view on the data ends up not necessarily working as well as you'd like at scale.

Fast Query Engine Leveraging Apache Arrow and Apache Calcite

Alex Merced:

But Dremio again addresses that pain point. One way it does that is just that Dremio is fast. Dremio uses the Apache Arrow for processing. So what is Apache Arrow? It's an in-memory format, meaning [that] when data is loaded from wherever source you have loaded into memory and processed by Dremio, it's in the standard open-source format. The Apache Arrow format, actually in its earliest stages, was the in-memory format for Dremio. So it is a project that very much came out of Dremio's efforts. And then, basically by creating this standard format, that allows you to represent more data in less space, and do it in a columnar way which lends more to analytical processing. You're going to get faster in-memory processing. And through things like Andiva, that allows a lot of processes to be precompiled––the binary, and then also fast transportation of the data through things like Apache aero flight. It allows that data to be now processed and moved very quickly. So even out of the box with the federated sources, Dremio can probably outperform most other engines just based on pure engine power alone.

Also, another open-source project that Dremio leverages is Apache Calcite, which is generally used for SQL parsing, it allows for cost-based optimizations, so [it can] sit there and be able to come up with different sorts of SQL query plans and figure out, hey, which one's going to be the most performant, and reduce your cost the most, be able to create custom rules for that way it'll constantly improve your SQL queries and optimize them for you, and do a lot of that work for you. So that way, even if you don't write the best SQL queries in the world, [it] can still perform like they are, to the extent that they can be optimized.

Raw Data Reflections to Enable Performant Visualization

Alex Merced:

And then there are reflections. Reflections are like a secret sauce in Dremio. Basically, they make the need to create this web of physical copies less necessary. There are two categories of reflections in the Dremio world. There are raw reflections, and raw reflections basically fulfill the role that you typically see with materialized views. So with materialized views, you would create a physical copy of a query result. So you'd query a table, and you'd get a subset, and you would materialize that, a.k.a., make a physical copy of that. Now that physical copy wouldn’t auto-update, so you would have to figure out how to keep that up-to-date and fresh. You would have to write that pipeline, maintain that pipeline, and deploy that pipeline. Also, that physical copy would be under a different namespace, so your users would have to know to change their queries to query that other name, which can be tricky.

But with raw reflections, all you do is, again, instead of having this big web of physical copies, you would just have your tables, and you would just create logical views on those tables. In this case, the join might be a good thing you would normally want to materialize because joins can be very intensive. So instead of having to run this join, if I want to query View B, instead of it having to do the join, and then do this subquery every single time I want to query View B, it'd be nice if the join was already precomputed. So by just flipping a switch, the reflection switch, that's going to be pre-computed. You're going to end up having that materialization of the join in an Iceberg format, which allows there to be some more advanced, like partitioning and sorting, benefits to it, and some indexing involved, but this nice optimized version of that subset of the results, and it's going to be auto-maintained, so Dremio will track to update it. But now, when I query view B, I don’t have to query a different name, or if I want to query this joined view, I don't need to query another namespace. I just query it. Dremio is aware of this precomputed, this summary, this materialization, that exists. Dremio tracks these, and not only can you have Dremio make them, but you're able to bring your own to the table. You can create your materializations and register them with Dremio, so Dremio can be aware of them.

And when Dremio knows that you're querying these data sets, it can then swap out and say, don't use these physical tables. Let's not go through that whole process again, let's just use this materialization. The end user doesn't even have to be aware that this is happening. Dremio will just see the query, be aware [of] these physical pieces that it can use to optimize a query, and it will intelligently do that for you. So Dremio is constantly, intelligently finding ways to speed up your query, so that way, you don't have to spend as much time thinking about that and spend more time thinking about how you model your data, and the questions you are asking of your data. So again [it] eliminates a lot of headaches you have there.

Aggregate Reflections to Eliminate Extracts/Cubes

Alex Merced:

But then there are also aggregate reflections. When it comes to aggregations, which are typical for BI dashboards––when you're creating a dashboard, usually the way you're seeing in those charts are the results of sums, maximums, and all these types of aggregate calculations. So typically what would happen in the past is that you would precompute these aggregates by creating manually BI extracts [and] cubes. These would be just the data structures where you precomputed all these numbers, but you did it yourself. Again, you have to write, maintain, and deploy these pipelines yourself. That's more work for you. Again, they don't auto-update. So you have to make that logic, and then generally, oftentimes, when you rerun it, you're making an additional copy, so every time you rebuild that cube, you rebuild that extract, you're making an additional copy, so then you have to make sure you swap out the dashboard, point to the new version of the cube and the old version, so you have this massively building storage of all these copies of the cube, you're having to do all this manual work to keep all your BI dashboards pointing to the right version, so it's just a lot of work. And oftentimes, these things can get so large they can cause memory issues that you would have to figure out on your own.

Dremio takes care of all this for you. Dremio will create these pre-computed aggregations as a reflection. So again, this will be an Iceberg data structure, or Parquet files with Iceberg metadata that will have all these aggregated statistics precomputed, and it will swap out when it sees those types of queries to that data set. So in that case, instead of having to build a BI dashboard, specifically using a cube, you just build it on the the relevant table, and if Dremio is aware of these reflections, it'll swap them out. Dremio will make sure those reflections stay up-to-date, you don't have to write any pipelines to maintain [or] deploy. Dremio will handle it, and again, Dremio will handle this data structure––how the handle and how to read it, to avoid memory issues with the Dremio query engine, is processing these reflections.

So raw reflections replace that need for those manual materializations, and aggregate reflections replace the need for BI extracts and cubes. It just comes down to you flipping a switch on data sets when you need these, and Dremio is essentially building the pipelines you would have built yourself, for you. So it's not that the pipelines completely don't exist, and that there are absolutely no copies at all, but you build fewer copies because you only need them where you need them.

How do know where you need them? This is where Dremio also enhances with its new reflections, recommendations, and features. Dremio will be monitoring the performance of your Dremio cluster in the queries that are coming in, and if it notices that there are queries that could be faster or should be faster, that can be enhanced through a reflection, not only will it tell you, hey, you should have a reflection here, it'll give you the SQL to run to create that reflection.

So basically, it'll for the most part spoonfeed you what you need to do to keep everything nice, fast, and fresh. Again, the idea is that Dremio is going to be as intelligent as possible to abstract as much as it can away from you. So that way, again, you can focus on modeling, you can focus on what your data models should be on, your analysts can focus on what questions they should be asking, but also to enhance performance, because we mentioned that one of the other costs was like the whole Egress cost. So imagine that every time you ran a query, we had to go fetch that data from your cloud storage. You could be hitting all sorts of different network costs and Egress costs, moving that data back and forth.

Columnar Cloud Cache (C3): Boosting Performance and Reducing Network Calls

Alex Merced:

Well, if you notice that certain things in your object storage are accessed very frequently, the nodes in the Dremio cluster have a C3 cache, the columnar cloud cache, and what it's doing. It's caching like different Parquet files, different row groups, whatever it sees that is accessed frequently enough, so that way on future queries, it doesn't have to go make that trip to your object storage, saving you those API calls which can reduce your costs for your cloud quite a bit, and also increase the performance of your queries.

So you have the reflections, which are generally going to be materialized in your data lake. But then you have your call and cloud cache on your Dremio cluster that prevents the need to constantly go to your data lake when not necessary. All of this can then be used a accelerate the performance of and make that virtualization at scale possible, because it takes things that you would have had to do otherwise, takes all the difficulty out of them, and takes the visibility out of them. So basically, you don't have to be thinking about, oh, there are these ten different versions of this data set, there's just the table. And if there's a reflection that would better serve your query, Dremio will know. Dremio will swap it out if we have something cached that can help speed up that query. Dremio will know Dremio will handle it. You just focus on what your query [is].

Unlimited Concurrency and Auto-Scaling

Alex Merced:

On top of that, you have a limited concurrency, in the sense that you can scale your Dremio clusters. So if you're using Dremio Cloud, you could create rules that say, how many

replicas of a cluster can exist at the same time? How long would they exist? So that way you can make sure you can always support the amount number of queries that you expect to have. You can create different levels of clusters, you can create smaller clusters, more powerful clusters, and then route your queries, so that way high priority queries, which are your more powerful clusters and your more powerful compute, while your lower-priority queries can go to cheaper, less performant compute. That way, you allocate based on what your priorities are, instead of everyone fighting over the same compute, and paying top dollar for queries that don't necessarily need top dollar.

So in summary––and then actually, one more note about the scaling, is that if you're using Dremio software, if you're using the Kubernetes deployment, there is, in the helm chart, there is the ability to scale the nodes, so you can create rules where it'll add and subtract nodes

to scale up your cluster and scale down your cluster based on your rules. So whether you're using Dremio software or Dremio Cloud, you have options for scaling.

In Summary…

Alex Merced:

Because of all the things that I've talked about, you're eliminating most of those ETL pipelines because most of the pipelines you would use to create these layers and layers of copies get replaced with virtual modeling, layers of logical views, and reflections that create the physical summaries and caches that would be needed to make that performant, without making the use of them more complicated. It eliminates a lot of maintenance because Dremio is maintaining a lot of stuff, updating it automatically for you. You have the semantic layer again that allows you to create those layers of logical views across your different business lines. In the same way, you would model data marts in a data warehouse. But you can also govern it and have tight governance rules. You have all sorts of things lending to performance, whether it's just the high-performance query engine, the columnar cloud cache, or the data reflections, all allowing you to again have that high-performance virtualization, along with that unlimited concurrency. [The] bottom line, [the] fewer pipelines, [the] less cost. It's less work you have to do building, maintaining, and deploying pipelines. The more work, the more time and resources you have to focus on modeling and expanding the reach of your data. So again, when you think about like, hey, if you had an extra X number of dollars to use, what could you do as far as investments in your data infrastructure or your headcount? If you had more time, what projects could you be spending more time on that would provide new business value versus just maintaining what you already have? These are the transformations that adopting Dremio can give you.

Understanding Why Dremio is the “Intelligent Query Engine”

Modeling with Logical Views Only

Alex Merced:

Now, again, just to drive this point home, let's just take a moment and look at the whole story from a bird's eye view to understand why Dremio is the intelligent query engine. So again, we're talking about, hey, ideal world, we'd like to model with logical views. Only one is going to be better for self-service. It's much easier because you're going to have a lot fewer views. And those views are going to be much more semantic for your end users. They're going to understand, like in this example here, that we have voter data that we're enriching with some third-party survey data. We have the join, and then we break that out into an Ohio View and Illinois View. Very logical. If I were an analyst I could go in here, I know exactly what that Ohio voter's data probably is, and then Illinois data, that voter is.

Cool. The thing is that these are all logical views. So just in a standard, unintelligent query system, what's going to happen is that when I query that Ohio voters view, it's going to go back, start over here, reprocess. This joins between my 2 tables over here, and then, once it's recreated this view, it's going to do that query on Ohio voters. So I have to wait for that whole chain to happen for every query. So while this may be a more easy, more self-service way to access the data, the problem is going to be performance. So what do we do to deal with our performance? We start introducing the copies. So that enriched voter data, we create a materialized view of it. We create materialized physical copies. So again, those purple datasets are physical copies.

Self-Curated Copies as Traditional Solution

Alex Merced:

So we create a physical copy of that voter data, and then maybe create new views on that, because these new views are based on the materialized view. So they're going to be quicker than the original ones that are based on the view version of the join. But the problem is, now my end users need to know that if they query this, they're still going to get that slower performance, they need to query this name instead. So that means I have to make sure that they understand that. And if they don't, I'm going to have to create mechanisms to make sure that I adjust their queries so that way I create some system that filters their query. It says, oh, oh, you queried, enriched voter data. Let me swap that out for MV-enriched voter data. It just creates more work.

And on top of that, we create precomputed metrics, those extracts, those cubes, powers, and BI dashboards down here. And then I have to make sure I maintain those when they build those BI dashboards, they have to make sure that they target those cubes and extracts, so they have to be aware of them. The idea is that there's just more the end user has to be aware of. There's more that I'm maintaining. So you break this whole self-service paradigm. It's not self-service anymore, because it's gotten too complicated for that end user to serve themselves. I just have to be involved to maintain all this stuff to make it work.

Return to Logical Modeling with Dremio’s Intelligent Querying

Alex Merced:

So we broke in that original goal, we have the performance, we don't get the self-service anymore. I'd like to get back to that view-only world where it was much more self-service but still [had] the performance. And this is where Dremio comes in. So now I'm back to where I just have the views, but all I have to do is flip a switch. I just flip a switch here and add a reflection. And now, going forward, these views are going to be much faster to query, because instead of redoing the join, it's always going to start with the reflection that was created on enriched voter data, so it's starting from that materialized result set that Dremio is going to maintain and keep up to date. Once I create that reflection with this SQL recommended by Dremio, I'm good to go. I don't have to do anything else. It just maintains itself. And my users, they're not going to know why. They're just going to notice that things are faster. So again, keeping that self-service feel, they don't need to be aware of all these layers of stuff, and nor should you be having to maintain it. And then over here, the BI dashboards that we're creating on the Ohio voters, Dremio recommended that I create an aggregate reflection on it. So I run the SQL to create that aggregate reflection again. Now when someone builds a dashboard, they're still just building the dashboard based on the Ohio voters table. They just going to notice that that dashboard is fast, really crisp. They don't even know that all the stuff is happening again. I don't have to maintain it. So it's easier for me, the data engineer. less headaches. So I can go focus on how the quality of my modeling and expanding the reach of my data, and is easier for my end users because we have a much more self-service. It's much easier for them to curate views on this, and even further reviews for their specific use cases from here, so overall you end up with performance and self-service. It's no longer a trade-off between simplicity and [speed]. You can have it easy, and you can have fast things to Dremio.

Part 2 - Reference Example

The Story Begins - Connect Your Data Sources Low-ETL or ZeroETL

Alex Merced:

So how would this all work? So like, let's say you were deploying Dremio today, what would this whole thing look like? And again, just a little summary––what are we talking about Dremio is eliminating the pipelines, making everything faster, and maintaining all these accelerations for you, making everything easy, fast, and open. And because it's open, you can begin by just connecting your sources wherever they are. So in this example, I have a Postgres in my SQL database, my operational databases. I could connect them directly to Dremio, which may be fine for some use cases, or I could use something like a Spark or Flink, and then land them as Iceberg tables in a data link. Which can be even the reason why you would want them as Iceberg tables, is that when you build reflections on Iceberg tables, if I use Iceberg tables as my raw data, and then I enable reflections on them as we talked about before, those get incrementally updated. So basically, those get updated even faster.

So if you're talking about having the freshest of fresh data, basically starting with Iceberg tables, and then building reflections on those Iceberg tables are going to allow for the lightest footprint as far as updating those reflections, and also the freshest data always available in those reflections. So it makes that acceleration just that much more powerful. So that would be more like low ETL. So ZeroETL, I connect directly, low ETL, I just land them as Iceberg tables and then do all my modeling virtually. But again, I can also upload CSV XLS files and JSON files directly into Dremio. Maybe I have some third-party data sets that I got from the Snowflake marketplace, or I borrow, that I got access [to] through Delta sharing, and I can go access those, because Dremio can read Delta Lake tables along with Iceberg tables, and can read all that Snowflake data you got from the Snowflake marketplace to write right there, making data sharing much, much easier.

Model Your Lakehouse - Virtual Data Marts

Alex Merced:

So if I want to enrich my data over here with data from over here, I just connect all the Dremio and just bring it all in one place, and then I can model that data. So then I can create folders for each of my business units. In the same way, you would create a data mart for each of your business units, and I can maybe do a medallion structure––the bronze, silver, gold, so basically, what I'll do is I'll create a view of those raw data sets in my bronze folder. And then, clean, validate, move that over to my silver folder, denormalize, do some masking of PII, move that to my gold folder…and I have that chain of views. And in this case, where it's right here, is where maybe I see that maybe people will query these directly and these directly. So maybe I'll turn on reflections at this level. I never had to create a physical copy of my bronze level. I can just turn on reflections at Silver Level. And then Dremio will physically again maintain those summaries, those caches, those whatnot, those materializations, behind the scenes for you and swap them out, and those will just automatically, transparently be swapped out when people are querying here or querying there. And again, I can do that for each data for each of these virtual data marts. So the modeling is all done virtually, and then I just turn on reflections where it makes sense. Again, a lot of that can be recommended to me using Dremio's reflection recommendations.

Methods to Generate Your Virtual Data Marts

Alex Merced:

Awesome. Again, building those views requires me to write that SQL. You remember from a few slides ago––let me go back to that, so you can see that again. I would write these different queries to create those reflections, it would be nice if I [had] a lot of use, a complex model. How do I do that, I mean? I could just go to the UI and generate the views directly from Dremio's UI. I could send SQL queries through Dremio’s REST API or JDBC directly to Dremio.

But Dremio also has a DBT integration, and using that DBT integration, I can focus on designing my models and allow DBT, can't handle what DBT does handle, knowing which tables and which views are going to depend on what other views are, and being able to create those views in the right order. And I just focused on actually saying, hey, this view meets this schema, depends on this field from this model over here, and then when I run the DBT job, it will then generate all those views for me, and again, using SQL with DBT, I can create spaces and folders. I can create views. I can create reflections. I can grant it––well. let me rephrase that. using Dremio and sending SQL to Dremio, whether through DBT's integration or one of these others, the Dremio UI, the Dremio REST API, JDBC, [or] ODBC. I can create spaces and folders, and create views. Everything can be done via SQL Dremio. It's just about sending Dremio SQL. to do what you need to do.

Govern Your Data - Role-Based Access Controls

Alex Merced:

But once I have all those views curated, then it's just a matter of me granting access to those views, everything. As the admin, I have access to everything, but all my other users start with zero access to nothing. What I can do is then grant them different levels of access to the individual datasets that they should have access to. So they don't have to have access to the raw dataset to have access to the view on it. And I just give them access to the views that they need.

And again, Dremio, behind the scenes will swap out relevant reflections. So they just get to enjoy the performance. They see what’s there. And the cool thing is, you can also write documentation. So now, when they see the data sets they have access to, they can read documentation, and there's a new feature where you can use AI to help generate that documentation. So that way, it'll automatically generate documentation on data sets based on their schema to make it easy just to provide all that information to make it easy for users to understand what's going on.

Govern Your Data - Column-Based Access Controls

Alex Merced:

But not only can I just say, hey, these users have access at this level. I can create masking rules on the tables that can filter different columns. So here, you see, what we do is we create a UDF that has the logic for how we filter the columns, and then we apply that to that table, you could also do it for rows.

Govern Your Data - Row-Based Access Controls

Alex Merced:

Again, we create a UDF that applies the logic for how we filter the rows. Then we apply that to the table, so I can apply row and column-based masking in a very easy and intuitive way.

Deliver Your Data - BI Dashboards

Alex Merced:

And then I might have some users who are going to be using BI dashboards on certain data sets. And what I can do is if I know what those data sets are, I can just flip the aggregate reflection switch, and going forward, they're going to have sub-second query performance when they connect to their favorite BI tools, where that's with specifically, with built-in integrations, literally one-click integrations with Tableau and PowerBI, integrations with pretty much any BI tool you can think of. And you're going to get those fast BI dashboards, because when those queries come in, Dremio is going to know to swap in those reflections, because Dremio can intelligently figure these things out for you. You just worry about asking the questions you're trying to ask.

Deliver Your Data - Notebooks

Alex Merced:

Then that data can not only be delivered to BI dashboards but you can, through ODBC and Arrow Flight, you can pull that into your favorite notebook, for things like AI and ML. So here we have an example of just basically running some queries via ODBC or PI Arrow against a Dremio cluster. Right there in your Python notebook, you can build full-on data applications for them.

Deliver Your Data - Data Applications (Use Dremio’s REST API)

Alex Merced:

So in this case, using Dremio's REST API, we have an example of a Flask application that generates an HTML page that visualizes the data. Here we have a back-end Flask server or REST API server that has an endpoint that receives the SQL and gets the results of that SQL, then passes it on to another page that generates the HTML that generates a chart. You now created a data app to view to do visualizations of the data, so you can create custom business data applications right off Dremio across all your data that you have federated and accelerated with Dremio at scale. So Dremio is very powerful and enables so much––again, it allows you to bring in your data from everywhere, model it with little to no copies, in a very easy and intuitive self-service way, and then deliver that data anywhere. So you pick the data from everywhere, model it easily in the semantic layer, and then deliver it everywhere, to be able to be used performantly, and this takes out a lot of the friction so you can focus again as the engineer, you can focus on the modeling of your data. And as the analyst, you can focus on the questions you're asking. So that way, you can get the data to the analyst faster and smarter and the analyst can then use the data faster and smarter to generate business insights, to generate business value.

Thank you!

Alex Merced:

So hopefully, guys enjoyed that. Again, my name is Alex Merced, Developer Advocate here at Dremio. This is our been our discussion about ZeroETL and virtual data marts. And the basically why Dremio is awesome. And what I want you to do is if you have any questions, put them in the Zoom chat, and then I will make a follow-up video going over any questions that are in the Zoom chat. But with that, I'd like to thank you guys all for attending. I'll see you all soon.

ZeroETL & Virtual Data Marts: The Cutting Edge of Lakehouse Architecture

Register to view episode

Speakers

Transcript

Opening

Part 1 - The Vision

Traditional Chain of Data Movement Pipelines

Broken Pipelines Require Tedious Backfilling

Cost of These Movements

The Dremio Approach (ZeroETL & Virtual Data Marts)

Cost Benefits of this Approach

How does Dremio Make this Possible at Scale? - High Performance Visualization

Fast Query Engine Leveraging Apache Arrow and Apache Calcite

Raw Data Reflections to Enable Performant Visualization

Aggregate Reflections to Eliminate Extracts/Cubes

Columnar Cloud Cache (C3): Boosting Performance and Reducing Network Calls

Unlimited Concurrency and Auto-Scaling

In Summary…

Understanding Why Dremio is the “Intelligent Query Engine”

Modeling with Logical Views Only

Self-Curated Copies as Traditional Solution

Return to Logical Modeling with Dremio’s Intelligent Querying

Part 2 - Reference Example

The Story Begins - Connect Your Data Sources Low-ETL or ZeroETL

Model Your Lakehouse - Virtual Data Marts

Methods to Generate Your Virtual Data Marts

Govern Your Data - Role-Based Access Controls

Govern Your Data - Column-Based Access Controls

Govern Your Data - Row-Based Access Controls

Deliver Your Data - BI Dashboards

Deliver Your Data - Notebooks

Deliver Your Data - Data Applications (Use Dremio’s REST API)

Thank you!

Ready to Get Started? Here Are Some Resources to Help

Whitepaper

Dremio Upgrade Testing Framework

Whitepaper

Operating Dremio Cloud Runbook

Webinars

Unlock the Power of a Data Lakehouse with Dremio Cloud

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?