March 2, 2023
11:45 am - 12:15 pm PST
Partition and File Pruning for Dremio’s Apache Iceberg-backed Reflections
Learn how partition design can be used to improve query performance in the context of Apache Iceberg and Dremio Reflections. We will start with an overview of Dremio Reflections and how partition design plays a key role in query planning and execution. Then, we will look into some Dremio Sonar query execution profiles to see partition pruning and stats usage in action.
Sign up to watch all Subsurface 2023 sessions
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Thank you guys for attending today’s talk. I’m a principal software engineer at Dremio. I basically lead the Reflections development. Today’s talk is a pretty technical one. It’s about partition and pruning in the context of Dremio Reflections and Apache Iceberg Let’s get started. In terms of the agendaI just want to quickly talk to you. What are exactly Dremio Reflections? Then we’re going to talk about how partition pruning works during query execution. We’ll look at some query plans. Then also talk a bit about how partitioning stats can help us with query planning in terms of selecting the correct Reflection to use logical planning. And finally, we just opened up the Q&A.
What Are Our Dremio Reflections?
What are our Dremio Reflections? Dremio Reflections is basically a technology that we have to accelerate user queries. The way I like to think about it is, it’s basically a Relational cache that’s basically backed by materialized views. These views are transparent substituted into user query. Which means that as a user writing a SQL statement, they don’t actually have to directly reference the materialized views, you put ’em in. To illustrate as an example, let’s say you have this simple SQL statement, like select c1, sum(c2)group byc1, and you’re joining three different tables. Typically when you do query planning, you are going to convert this into a sequel node tree and then into a Relational Tree.
Then at that point, this is when Dremio’s planner will then figure out how to substitute Reflections into this query plan. One of the things that we do is we will basically normalize queries into various alternatives that make substitution matching easier. For example, in this example, you can see how this query gets turned into a Relational Tree that has an Agg. on top of a Multi Join with these three tables. The reason why we have a Multi Join is because suppose these are inner joins, then you can swap the tables in any order and still get the same result. The query basically gets normalized into this form. Then we’ll look for our Reflections that can match into this query.
Suppose we got a Reflection that has materialized and an expensive join between table A and table B. Then we will basically rewrite the query so it looks like this. Whether this plan actually gets used in the end, gets selected in, of course, it’s still going to depend on costs. That’s something we’ll talk about a little bit later. Dremio Reflections is basically married to Apache Iceberg because the Reflection is physically stored in Apache Iceberg table format. Then we basically leverage the Iceberg Metadata to do all the partitions involved pruning. To give you an example, in the future slides, I’m going to be showing you a bunch of plans that are based on the San Francisco Police Incident Dataset.
In all the times when customers use Reflections, you are going to catch the same piece of data in different ways. Depending on the user query access pattern. Maybe you have a Reflection that’s partitioned by category by day and by day of week. Each Reflection , as I said, is stored, is materialized, in its own Iceberg table. Then the Iceberg table itself, underneath the covers, is going to organize its underlying Parquet Data Files by the Iceberg Tables partitioning scheme. That’s what I illustrated here. I think most people already know what Apache Iceberg is but underneath the covers, the way I think of it is, it’s a lot of metadata around your data files. In our case, it’s Parquet Data Files.
At the bottom you have lots of Parquet Data Files that represent the data inside your Reflection, maybe partitioned by category. Then on top of that, you’ve got manifest files. For each manifest file, you have a range of partition values, and they point to the data files, and then that’s organized together into your manifest list. Then because you have different snapshots of your data than you have these different snapshots, and then finally, your catalog points to the most recent version. The table format is a lot of metadata, but around these Parquet Data Files.
How Does Partition Pruning Work During Query Execution?
The second agenda topic is how does partition pruning work during query execution? In these examples I want to illustrate using the San Francisco Police Incident Dataset. In this dataset I’ve got a pretty small data set. It’s 2.2 million records based on the year range that I pulled off the internet. In order to show you the split involved pruning in action, I artificially set the Max Parquet File size of 1-megabyte. In reality, you usually want to set your Parquet File sizes to 256, 512-megabyte sizes. For the demo purposes I set it really small so that we generate a lot of files, a lot of splits. In this dataset, I just want to point out that the category field has 39 distinct values and the day of week, obviously has seven distinct values.
We’re going to be looking at query execution. We’re already done with the planning. Let’s make an assumption that you’ve got to use your query access pattern that can benefit from a Reflection like this. This Reflection is partitioned by category and then sorted by day of week. Once you build this Reflection, then this Reflection is stored as an Auto/Parquet Table in your distributed storage. Let’s take a look at what that Iceberg Metadata looks like. The first thing that you have is your manifest list file. The manifest list file lists all the manifest files and for each manifest file, you have your lower and upper balance partition value. What this means is that, if you’ve got a query that comes in that filters on, for example category equals arson.
Then you only need to open up this particular manifest file to find your Parquet Files. You can prune out all these other files. In this example, just keep in mind that I have 18 manifest files in this Reflection. Now to drill down this hierarchy of metadata, let’s look at the manifest file associated with the LARCENY/THEFT partition value. What you’ll notice here is that this partition value contains 21 Parquet Files. Because I did a local short on day of week, these Parquet Files are sorted by day of week. What that means is, that if I filter on this partition value and a particular day of week, I’m likely not going to need to open up every single Parquet File. For example, Friday is going to be localized to a much smaller number of Parquet Files.
It’s not going to be scattered across all Parquet Files. That’s going to be a second level of pruning that you can do. In addition to any other additional columns that you filter on, this manifest file also has lower and upper bound values for every column that’s in the Parquet Files. Finally, I want to point out that if you were to just run a simple unit command, like, find all the Parquet Files and count the lines. You’ll come up with 115 splits or files. In this particular Reflection, that’s stored in Iceberg, I have 115 physical Parquet Files. Let’s take a look at the query execution and see the partition and pruning in action.
Partition and Pruning In Action
Suppose you have this query, you’re going to group by PD District. Then count one and then filter on which category is equal to LARCENY/THEFT and day of week is equal, these two values. Because you know category is a partition of one of the partition fields we’re going to see, we would expect to see partition pruning on this category column. Now, if we look at the physical plan for this query, this is just a snippet, but the way that you read these query plans is you read them bottom up. 02- 03 is the first operator, which corresponds to the manifest list false scan, and here we can see that it has pushed down that partition filter where the category is equal to LARCENY/THEFT.
This will do the first level of the manifest file pruning. Then the next operator on top, 02-02 is now scanning the manifest files, which are looking for which Parquet Files so I need to go and scan. Here we’ve also pushed down both the category and day of week filters. Remember, because we did a local sort by day of week, we’re going to expect to prune out many, many Parquet Files. Then finally 01-04 is the operator associated with the Parquet File Scan. Here we’re going to do additional pruning, because now the row groups, we’re going to do row group header pruning. Here, because the Parquet Files don’t scatter the days of week values, we’re going to also be the rows that we read from each Parquet File are likely to be from Saturday and Sunday. There may be other ones, but most of them will be Saturday and Sunday.
Inside the Actual Execution Plan
Let’s take a look at how this looks inside the actual execution plan. This is a screenshot from Dremio’s query profile analyzer. These are the actual number of records coming out of each operator. For the Iceberg Manifest list sub-scan you can see that seven records come out for this query. We basically pruned out six files, so we’re left with seven. Only seven manifest files came out of the Iceberg list sub-scan. This is the first level pruning where we filtered on category is equal to LARCENY/THEFT. Then if you look at the second, if you look at the operator above 02-02, this table function is associated with the manifest false scan. Here you can see we pruned a lot of Parquet Files. We went from 115 down to seven. This is the seven Parquet Files that we have to open up to satisfy that user query.
Then 01- 04 corresponds to the Parquet data files scan. Then you can see that we open up the seven Parquet Files, and we scan about 150 K records. Then on top of that, we have to go and apply the Saturday Sunday filter, and that results in 140 K. You can see that this number and this number are pretty close together. Most of the data as I said before, that comes out of these Parquet Files are associated with Saturday and Sunday because we sorted the data by Saturday by day of week. Another interesting thing to point out is you can see that there’s like this lightning bolt icon. That tells us that our data was scanned from the materialized view or the Reflection.
Just to look at what happens if you didn’t have a partition. If you didn’t have partitioning of sorts, this is how the query plan would’ve looked like. In this case O2-02 corresponds to the manifest files scan, which produces the Parquet Files that you have to each edge you have to go and open up and look. Here we have 89 files, 89 splits. This is significantly a lot more than seven. Then another thing to note is that when you actually go ahead and scan the Parquet Files, now we’re coming out with 480,000 records as opposed to 150,000 before. That’s because each Parquet File now contains other days of weeks. Now we have to do additional filtering on top to get that down to 140K. It’s pretty clear, if you want performance, one way to do it is to design your partitions and your sorts correctly so that you maximize your pruning at all the different levels. At the Iceberg Manifest List File, the Manifest File level and also, the Parquet Row Group level.
How Partition Scans Can Help With Query Planning
Let’s take a look at our third agenda item, which is how partition stats can help with query planning. When it comes to query planning you have to figure out what plan will have the lowest cost. There’s going to be lots of different alternative plans. In this example, I have three A, B, and C. Which one should I use? Maybe I have a combination like A and B or A and C. Once I figured out which combination of Reflections should I use in my query, I also need to do join ordering. For example, if I’m doing a hash join, I want to have the bigger table on my probe side and smaller table on my build side. In order to make these types of decisions, you need a cost-based optimizer that has good estimates of the row counts coming out of the different scans. Partition stats can help with this because it can help you get more accurate row count estimates.
Let’s take a look at an example. Suppose you have a query that looks like this, you’re just projecting three columns, and then you’re filtering on category equals burglary. In this case, dir0 is an implicit partition column in the underlying dataset. Meaning that the underlying data set’s already partitioned by category here. Then also filtering on day of week equals Sunday. When planning this query, this is one option, which is to go directly just scan the base table, with no Reflection. That’s because the base table is partitioned by category. Now, let’s imagine for this example that I also create two additional Reflections. Here I’m going to partition by day of week and you can see it’s checked here.
Now I’m also going to create a second Reflection that doesn’t partition by any columns. Oftentimes customers may do this because maybe they want to accelerate slow physical storage. Maybe the data is stored in JSON or CSV, but they want to have the data physically stored in Parquet File comstore format, which is much more optimized for querying. These are three different alternative plans that could be considered for entering this user query. Let’s take a look at the logical plan in terms of how we determine which one we should use, or which one the planner should use.
This is the first logical plan. This plan turns out to be the best plan to use. This is the one that doesn’t use a Reflection in the scan base table. If you look at the logical plan, it’s pretty simple. It was just a project on top of a scan. Here the scan, if you look at the table involved, this is the actual underlying base table, Incidents. We can see that, first of all, that a partition filter was pushed down. Here it was in the category equals burglary, Then besides that, there were also the other filters, like the day of week, this non partition range filter here on Sunday. Now the estimated row count for this scan was 13,000.
Where did this number come from? This partition filter was pushed down. We were able to access partition stats on this Iceberg Table to get a very exact row count for how many rows are in burglary. Based on this number, then we further reduce this using heuristics based on the filter on day of week equals Sunday. That number is further reduced by some heuristic factor and that results in 13,000 estimated row counts. The cost based planner is trying to figure out which plan to use. This 13,000 is very key in terms of trying to cost which plan to use.
Let’s look at what happens with the second plan. In the second plan you can see here this is the one that is scanning the materialized through, that’s partitioned by day of week. In this case here, you see that the table here that’s used is the actual Reflection. This is the Reflection ID, and this is the materialization ID. Then this case, the partition filter that we pushed down was the one that’s on daily equals Sunday. For this particular partition value, we were able to get a very accurate row count. Then from there, we further reduced it based on that category filter. This resulted in 17,000 rows. Because 17,000 is greater than 13,000, then this plan was not selected.
Then finally, if you look at the third plan, the one that had no partition. The Reflection that had no partition, you can see here, again, we’re still scanning a different Reflection table. Reflection is a different Reflection ID, but we come up with a much higher row count estimate. Where did this row count estimate come from? This row count estimate comes from knowing that the total number of rows in the table, which we have, and then we further reduce it using rules in terms of, okay, I’ve got a category filter equals burglary, and another one her day of week equals Sunday. We further reduced the count, but it’s not going to be as accurate as like the first two plans where we actually had the partition stats. I just also want to point out that we do support analyzing table commands on Iceberg Tables. That can help you generate more accurate stats that we can use during planning.
In summary your Reflection partition design is going to be based on your user query access patterns. What that means is, that if you’re trying to accelerate some dashboard, and your dashboard always filters on some column or always prompts on some column, and that column is low to medium cardinality. That would be a good column to consider, like partitioning on your Reflection. This piece of light is especially true when your Reflection is placed on the top of your semantic lyric in Dremio. When it’s right underneath, your dashboards or reports. For the second bullet fundamentally, the reason why you’re partitioning and sorting your Reflections is to really fundamentally minimize you the expensive data access.
If your Reflection has a hundred thousand splits, a hundred thousand files, you want to minimize how many Parquet Files you actually go and have to open up and scan, to satisfy the user query. Then as I showed you that, the Reflection design, I mean, the partition design is important across all phases of planning. In planning, you can see how it helped the cost space optimizer take which Reflections to use, and help with joint ordering. Then also, like during execution, you’ll see the actual splits being pruned. It was pruning at different levels. It was pruned at the Manifest File level, Manifest List file level, Manifest File, and in the Parquet Files, Parquet Row Groups.
Everything I’ve shown you here is that pruning logic, even though I showed you in the context of Reflections, it’s true for any Iceberg Table that the Dremio Sonar core engine queries works. If you look at the plans, it’s the same logic. You can observe the pruning happening and the partition filter push downs. Finally, I just want to point out there’s two more resources you can go to learn more about this or get help. First of all, today’s presentation was based on a blog that was published maybe two weeks ago on the Dremio website. It’s called this, you can just go directly to Dremio’s website and read more about everything I talked about today. The blog contains the full query profiles. You can see the full plans for both logical and physical planning. Then finally, we’ve got a Dremio community forum online where you can just ask questions and get help. I’m pretty active there, so if you ask me, I’ll respond pretty quickly. Thank you for attending my talks.