Transcript
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Opening
Alex Merced:
Hey, everybody! This is Alex Merced, developer advocate here at Dremio, with another episode of Gnarly Data Waves. We're gonna be talking about materialized views and Dremio reflections, an important topic when it comes to figuring out how to best to optimize your data for [the] best performance possible.
Apache Iceberg: The Definitive Guide
Alex Merced:
Now, before we get into that, let's talk about scanning this QR code, so that way you can get yourself an early copy of Apache Iceberg: The Definitive Guide, an upcoming O'Reilly book on the topic of Apache Iceberg. Let's just scan the QR code––this is about 180 pages in the current early-release version. The final manuscript will be about 300-plus pages. So a lot of content enjoy so definitely look forward to that file release early next year. Also, make sure to go try Dremio. If you haven't tried out Dremio Cloud, head over to dremio.com and click on Test Drive. It's no cost or obligation. You're not creating your own account, you can just try it out, and get hands-on. You just try it out and see what you think. Just go ahead over the dremio.com.
Dremio Events
Alex Merced:
Now, make sure to visit Dremio at any of these events, we’re gonna be at Big Data AI Paris and Big Data AI London, which I think might have passed by the time since this is airing, but in October we will be at Coalesce by DBt. And me and Dipankar Mazumdar, the developer advocates here, and Jeremiah, we're gonna be in Austin for Data Day Texas in January, so make sure to come. Join us there for an Apache Iceberg Ask Me Anything session where you can ask us any questions you have about Apache Iceberg.
What Do We Want?
Alex Merced:
So now onto our feature presentation, talking about materialized views and Dremio reflections. So bottom line, let's just kind of anchor this in––what is the point? Okay, what do we want? We all want to query our data and get business value. That's really everything else is just a distraction. So anything that gets in the way of me being able to query my data and discover those insights that derive business value are things we just want to have abstracted away. At the end of day, all I really want is for my analyst to be able to drill into insights, so we can make business decisions that derive value.
What is the Problem?
Alex Merced:
And that's usually straightforward enough, but what happens [when] your data gets bigger? That gets harder to do––queries start taking longer and longer because you have more data that you're processing. Now, there's a couple different ways you could fix that problem. There's the expensive way and the money saving way.
The expensive way is to throw more processing power at it. So this means, like bigger clusters more powerful instances on those clusters, which means a bigger compute bill. It'll get those queries done, but it'll be expensive.
What is the Solution? - Materialized Views
Alex Merced:
The other way of doing it is scanning less data. So as a solution, what we can do is instead of scanning all of the data, [we] scan a subsection of the data. And that's essentially the premise of the solution that we [call] the materialized view. So let's pretend that we have this table that’s 3 terabytes called voters. So it's a table of voter data. And you know, maybe right now, operatives in Ohio are doing a lot of voter analysis to plan canvassing efforts and whatnot.
So we create a view. Now, just a view––this is a logical view, meaning it's not a copy of the table. It's just essentially a preprogrammed SQL statement, so it's just a query I have pre-saved. But I'm still running the query against the original set of data. So I had this view for just the voters in Ohio, and we named the view voter_ohio.
So this again––not a view. But we want to kind of speed things up, and we don't have to scan the whole table. So [what] we do is we create a materialized view, which is a physical copy of the results of a query. So we would run that query for all voters in Ohio, and create a materialized view from it. And we had to give it a name, so we'd call it voter_ohio_matt, so that we know it's the materialized view. Now, when I scan this, it's only going to scan that subset. So, I mean much faster, because it's already starting with the subset of just Ohio voters, instead of starting with the full table.
Problem: Juggling Namespaces
Alex Merced:
That's good, [but] why? Here's the thing you need to know about this––so if you're a data analyst, someone has to tell you that you need to query this namespace, because otherwise, if I still query the voters table, I'm still doing a full table scan. If I query the view voters_ohio, but looks like it should work…but nope. Full table scan––okay, these are unaware that this is here to use. The only time it's going to use that physical copy is when I query that specific namespace, in this case voters_ohio_matt, and then I'll scan the materialized view. So even though the materialized view can then help me have a faster query, there's still really decent chance for people to accidentally scan the full table
because they require a certain awareness and effort to use the materialized view.
Other Problems
Alex Merced:
On top of that, if you have multiple views that you want to make, you might have to make multiple materialized views. If I wanted to create a materialized view for Illinois, for Nevada, then, in that case now, I have to create a materialized view for each one, and you should want to make a physical copy, and each one's going to be another namespace.
So now I'm multiplying my storage and namespaces quite quickly, and basically, I have to figure out how to keep it synced, because materialized views automatically update when the underlying data updates. So that 3 TB table updates; that materialized view may not necessarily update.
So now, I got to figure out: how am I going to sync it? How am I going to refresh this materialized view? There's all these kinds of issues. So these are typically the challenges that exist when you're using materialized views.
Raw Data Reflections
Alex Merced:
Now, data reflections attempt to improve upon all those challenges. So the way a data reflection works in Dremio is this, just a knob you can turn on. As simple as you can just turn it on, on any table or any view you have in your Dremio catalog, and what it'll do [is] it creates a representation of that data, but it doesn't just create a one-for-one representation, it creates a representation specifically in the Apache Iceberg table format. So just by doing that, you're gonna get improvement in performance, because of the Apache Iceberg metadata structure in a way, to help make smarter query plans. And Dremio is very optimized for scanning Iceberg metadata. So right there, you're gonna start seeing a performance boost. But the data reflection can also have custom sorting and partitioning rules. So I can say, you know what, I had this large table of voters. That's not partition. It's just a large table of voters. But I want you to create a data reflection that hasn't [had] the data partition by state. So already, we're gonna see benefits, because that data reflection is partitioned and it doesn't create a new namespace.
Reflection Matching
Alex Merced:
Okay, so the bottom line is, now, if I were to do that––so let's say we create that data reflection, what's gonna happen is that you just scan the things you were going to scan. So if I run a query on the voters_table_Dremio, under the hood, it’s gonna be aware that reflections were turned on on the table, and that there's this sort of physical Iceberg representation of the data and say, okay, well. you know what, we’re gonna use a data reflection to scan it. And then, if I do voters_ohio, which generally comes from that view, it's not the same namespace. But even though it's not the same namespace. Dremio is aware of the relation of this view to the table that's been reflected and we’ll know, hey, you know even better that this data reflection was partitioned by state, and take advantage of that and basically speed up the querying of that. We’ll basically use a data reflection to process that query, so those queries are going to get faster, and the end user didn't need to know a particular namespace. They didn't have to think, well, you know, to take advantage of the reflection. I gotta query X voters_ohio, reflected. No, they can just query the voters table. They can query the voters_ohio view, because if Dremio knows that the particular view you're querying is derived from another data set or view that has a reflection on it, it will check to see if that reflection would apply or would help the card query. So you get and get much more bang for your buck.
This makes life easier for the data engineer, because the data engineer doesn't have to worry about syncing, because that reflections will be auto-synced, they will be refreshed periodically to match the original data. You can also trigger manual refreshes and then it's easier for the data analyst, because they don't have to think about juggling this bunch of different namespaces. They just query the tables that have the data they want, and
Dremio will figure [it] out, if there's any reflections that would accelerate the query. And even better is, you can create multiple reflections, because, going back to the materialized view world, you might create the same materialized view on the same data multiple times, because you want the data sorted differently, or partitioned differently to optimize for different queries.
Multiple Reflections on the Same Data Set
Alex Merced:
But now people need to know which namespace they'd have to do for each different setup of the table––not with reflections!
With reflections. I can create multiple reflections, each one with a different partition rule or different sorting rule. And again, it doesn't change the namespace. So what's going to happen is that a user just queries that voters table, and based on the query, it will then get routed to the reflection that will best handle that query. So in this case, we have 3 reflections, one that's partitioned by state and then sorted by ZIP. So each partition has all the data for predicting state, and then sorted by zip code within those files.
In this case, we have one that's partitioned by party. So basically, each partition has data files only covering that party. And each of those files are sorted by the state that they're in. And then we have a third one that's partitioned by the first letter of the last name, because you can use Iceberg transforms in your partition rules, which is pretty cool because that gives you extra flexibility in how you partition these reflections––it's sorted by state. So in this case, a query to select from voters from the State of Ohio, it'll automatically go to here, because that's already partitioned by state and Dremio will recognize that, and just send and use that reflection to execute that query. Me, the analysts, don't even have to think about it. I don't have to be aware of all these reflections. Dremio’s just matching it behind the hood. Me, as a data engineer. I just create the data reflections that I think I need based on the query patterns that I'm seeing.
Okay, that gets even easier. We'll talk about that a little bit later on. I just match it up, and the matching will just happen. I don't need to go tell the analyst, hey, just so you know, query these namespaces going forward, they're just gonna start seeing a benefit. They're just gonna start seeing faster queries. So it makes life again easier for the data engineer and easier for the data analyst. But in this case, people who are now praying for voters in Ohio are going to see a performance benefit. And the cool thing is, I don't necessarily now have to go create a separate thing for every other state, since the table is partitioned by state. Basically, everybody, for every state is going to start noticing a performance benefit, because that partition's going to do that. And then if people start querying it by party, they're also going to know its performance boost, because it'll use this reflection instead for those queries [whether] we square based on a specific party, or if we query based on a specific letter and/or a specific last name. Or even if I just said, hey, a last name––that's you know, the last name, Sarandon, you know. It'll automatically go find the partition for people whose last name starts with S. And then it'll use this reflection to do that, because this reflection is partitioned that way.
Logical Views Benefit from Reflections
Alex Merced:
And even better, as I mentioned earlier, views that are derived from those data sets also benefit from the reflections. So if we do create logical views, we're not duplicating data, but we create logical views that cover different things, like voters_ohio table or a voter_green table. Those aren't going to result in extra copies of the data, but they're still gonna be performant because they're gonna be benefiting from the same group of data reflections that have the different partitioning and sorting rules. So no matter what the query pattern is, I can have data reflections that kind of lead into it. Now, again, you don't want to proactively make data reflections for every possible query pattern––you create them as you notice the story––query patterns that are regular and need an extra performance boost.
You don't want to make the mistake of prematurely creating data reflections and then incurring any storage costs associated with those reflections when you didn't need to. But when you do notice that there's certain regular query patterns that need a performance boost, data reflections give you a very cheap way to do it, and a very easy way to do it, overall, you end up spending less, because, instead of me having to create a separate materialized view for both these situations, this one data reflection will handle both. So basically, you're not creating as many copies of the data, depending on how you set up your reflections.
Other Raw Reflection Use Cases
Alex Merced:
Now, other use cases you have for these raw reflections, where you're basically creating physical Iceberg representations of the raw data, could be for optimizing non data lake sources. So, for example, you might be connecting a Postgres database to Dremio or Snowflake to Dremio, or using JSON and CSV files. Now for database and data warehouses, queries are gonna get pushed down, which may not be a behavior you want to be seeing. Because, you know, this Postgres database is my transactional database from my application. I don't want a bunch of queries from Dremio, also, [it’s] clotting up the compute for my transactional application, or my JSON CSV files are just generally not going to be as performant as something like Parquet. So by turning on reflections on any of these, you're going to create that Iceberg representation that is backed by a bunch of Parquet files,
and it'll use that to run those queries. But again, it'll keep it in sync, so periodically it'll go back and check the database and update the reflection based on what's inside the database.
But the queries, the analytical queries coming in, will be against a reflection, so they'll be faster. And you're not necessarily disrupting the resources available for the database itself. But again, you don't have to think about how to keep the 2 in sync––reflections basically handle a lot of these things you would create data pipelines for and basically mix it with those data pipelines [that] are not necessary. The data reflections, in a sense, become your data pipeline, because you're not creating as many duplications of the data that you need to manually keep in sync. Reflections are doing all that thinking for you.
And also, if you're using patterns like ELT, where you load the data first, you connect the raw data to Dremio, and then start doing your normalization or denormalization on Dremio. But what you could do is you could denormalize the data––so let's say I join some fact tables that I have, such as in a Postgres database and in a Snowflake table, and then I join them together. And I have now, with this joined view, I can turn on data reflections on the join view set, and they'll create that physical representation of the join. Then again, It'll refresh and keep it in sync with all the underlying sources, and then that reflection will then accelerate any queries of the join, and any views derived from the join.
So there's a lot of really strategic ways to get a really good bang for your buck here and again to your end users. They're just querying the things they would normally be querying, they don't have to really think too hard about taking advantage of this, it just works.
Aggregate Reflections
Alex Merced:
And there's other types of reflections––so in Dremio, there's also aggregate reflections which can be turned on in any view or any data set. And what this does is that instead of creating a physical representation of the raw data like in raw reflections, you're creating a physical representation of the aggregated information––so like count, min, max, average, the kind of data you use for a BI dashboard. So in this case, this is really, really, really helpful when it comes to BI, because usually what happens is your big table is just too big, and then every time you change the setting on Tableau, you wait like 5 min for the Tableau to update, and that’s just not going to work for you. So what do you do? You end up having to make a physical copy of [the] BI extract, the smaller subset of the data, to create multiple dashboards that are somewhat faster. But now we have to worry about––how do you keep this data in sync and data freshness and regulatory concerns? You would rather just have the BI tools work directly from the raw data, but you wanted to do it performantly, and this enables that.
So basically Tableau, or powerBI, or superset, or your favorite BI tool choice, can connect to Dremio through a live connection. And then when I make changes to the dashboard––let's say I change the dashboard, so that way I see total voters by state––a live query is sent to that live connection to Dremio. But since Dremio knows that there's an AG reflection on this data set, instead of you running a scan of the full data set to do this long-winded aggregate computation, because it's got to go through all the records in that column to create and reduce it to a single value, what it can do is it can use the pre-computed AG reflection, and you get that subsecond response, because the number already has a number ready waiting for you to update your data. You can now have much faster BI dashboards using these aggregate reflections. So generally, if you ever had a problem with. you know, BI dashboards that just weren't performant enough, this is generally the answer. You got to see it to believe it. It's pretty amazing.
New Features
Alex Merced:
Now, there's many new features that are coming to Dremio, or actually are here. These are features that we've just announced within the last week, and they're pretty cool. [The] first one is incremental reflections on Apache Iceberg tables. So generally what happens is, when a reflection refreshes, it would normally have to recreate the reflection. So essentially, if you have a really big table and you have a reflection on that table, well, it's gonna take a while for it to recreate an entire reflection on that table. So even if you set the reflections to update every 5 min, if the process takes longer than 5 min, this could create a not ideal situation. Now you can incrementally refresh reflections, if the source is an Apache Iceberg table. So more reason to make sure that all your data is an Apache Iceberg! There's plenty of reasons for that. But if your data is in Apache Iceberg, then what happens is that Dremio, when that table updates, it's able to just capture the differences in the change. So it could be like a 5 petabyte size table, but when the table gets updated, Dremio is able to discover the changes to the table, and then use that information to update the reflections right away in a very lightweight manner. So that way, basically, everything stays in sync pretty much real time.
So that adds a tremendous amount of value to this whole process. Also, whatever I mentioned, you don't want to just create reflections on every possible query pattern. There's now a reflection recommender that'll analyze your query patterns and recommend which reflections you should create to get the best value for yourself, and it'll give you the SQL to create those reflections.
And then there's reflection hints––so sometimes you don't want to use a particular reflection, or you want to only use a particular reflection. So reflection hints are going to be like little hints you put in comments in your SQL to tell the Dremio engine to change its behavior regarding reflections for any use case that pops up, or if you just really just want to turn off reflections for that particular query, for whatever reason.
Closing
Alex Merced:
Okay, so yeah, that's a start with reflections. It's a pretty amazing feature––again, it's not just giving you sort of something like materialized views. It's better. But it's also something that pretty much eliminates many of the pipelines you probably already have, because a lot of the pipelines you have are to manage multiple copies of the data optimized for different use cases. In this case, this just automates that process you can create, you basically tell it, hey, I need these versions of the data that are sorted and partitioned in these ways. And it's just gonna maintain the sinking, and handle the matching of queries to it. So that way it's easier for the engineer, and it's easier for the analysts. It's a very powerful feature. It's one of the key things that make Dremio the powerful platform that it is.
So hopefully, you guys enjoyed this presentation! Head over to dremio.com and try this out for yourself.