March 1, 2023

12:20 pm - 12:50 pm PST

Listing Files on Object Storage is Slow How Dremio Addresses That for All Datasets With Apache Iceberg

Listing the files needed for a query on object storage can be fairly slow. Unlike other query engines, Dremio built a metadata caching layer to solve this and improve query performance. Because this metadata was cached, it did not need to be retrieved at runtime, which can be slow on data lakes, resulting in the ability to provide sub-second response times. However, as data and deployments grew, over time it became clear the caching layer’s architecture needed to be updated.

The change that was made was to still address the problem of listing files on object storage taking a long time, but leverage Apache Iceberg to address the issues that eventually arose in the previous architecture of the metadata caching layer.

In this talk, we’ll go through why Dremio caches metadata, the original architecture, the pros and cons of that architecture, why the new architecture relies on Apache Iceberg, and the net benefits of this new architecture that Dremio administrators and users experience.

Topics Covered

Iceberg

Sign up to watch all Subsurface 2023 sessions

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Balaji Ramaswamy:

How Does Dremio Handle Querying Metadata?

So I wanted to talk about how Dremio addresses the concept of when a database or even a query in general, like Dremio, has to plan a query it needs metadata, right? 

Without metadata, it’s going to be difficult to generate an optimized plan. Now, obviously, there are two places where you can collect data. You can collect it during query runtime, and that is probably going to be expensive because that’s going to be a heavy lifting work, right? You have to go open the Parquet folder, look for all the statistics that the Parque file has, and then use those statistics to estimate row counts, what joint, what exchange is needed. So Dremio came up with a solution where it pre-caches this metadata by collecting metadata in the background.

And the frequency can be set every hour or every two hours, depending on your data freshness. Or you can also do ad hoc on demand metadata collection. Now, the biggest problem this solves is when the query is executed, the planner says, oh, I have all the information needed to plan this query in a highly optimized way. And then the execution is also perfect, right? This was working out very well, but obviously there were a few challenges that the version one of this metadata collection had, right? So this whole metadata process was done on the coordinator and the coordinator is obviously a single node, and because it cannot be scaled out as deployments grew bigger, production data grew, more use cases came in.

The other issue was the metadata was stored locally on the coordinator, even though it was a fast SSD or it was on a network at our storage. It was, again, a single point of failure, right? Like, so if your local crashes, then your Dremio’s metadata layer is absolutely needed to plan the query until you restore the backup and bring back Dremio, that metadata layer is not available. And finally it was also the whole topic about listing files is, obviously in the lake big data sets have tons and tons of files. And a rerun of these metadata would do a full refresh every time. So, right, it would go and refresh the entire copy of the Parquet data set on every round.

Benefits of Using Unlimited Splits With Dremio

So what we did at Dremio is we came in with the concept of unlimited splits, where Dremio could now make this even more performant and optimized. And how we got there is that Dremio wouldnow list only the incremental files. So if, say, 10,000 Parquet files under a partition and you’ve just added one new Parquet file, then Dremio would only refresh that Parquet file. Because it already has information about the older thousand files, which have not changed. The second thing that was done is this heavy lifting work was taken off the coordinator because it was a single node and it was given to the executors. And when we have executors, you obviously can scale horizontally depending on how much work the metadata layer has.

And, that brought in a much more powerful engine because executors are one too many from the coordinator. And the third thing we also did was we would move this metadata layer into the distributor storage. So if your data is on S3, we coexist the metadata also on S3 so that there is no downtime. S3 doesn’t go down, and it’s always high availability. Now I want to show a Dremio profile that walks through this entire process, and that will be very clear, right?

Ben Hudson:

Do you want to share some stuff on the screen as well? Oh, yeah.

Balaji Ramaswamy:

Yes. I have to share it now, sorry.

Ben Hudson:

Yep. Feel free to go ahead and share whatever media you have.

Balaji Ramaswamy:

I’m already sharing. Should be coming on the screen, right? It says stop sharing.

I’m trying to see where the share button is, sorry.

Using Unlimited Splits to Reduce Metadata Refresh Time

Oh I got it. Good. So you should be able to see my chrome tabs. And, if you see, we have a table called catalog sales. This is from the TPC-DS benchmarking data. And when we refresh metadata for catalog sales, and you go look at the visualized plan, Dremio will only refresh what has changed, right?

So it would go and list what is there on S3 for this data set. It would list what Dremio has got, has already learned from the previous metadata pressures, and then it’ll only do a full outer join and filter out only the differences. This way the refresh data set is super fast. That’s number one. And number two, you could also see that this work is distributed across executors now. So they’re all going to multiple nodes instead of one node. And there is also very detailed instrumentation on where the time was spent. And if the metadata effort was slow, then we can find out that the listing took time, or the actual scan of those Parquet files to collect metadata was taking time.

So this way we were able to eliminate a single point of failure on the coordinator, distribute the metadata refresh work across executors, and only do incremental work. So those three gave an excellent performance benefit over the old method. Now, what we want to do next is to look at a profile where Dremio is actually running this query on a partition column. So this column is a partition column in Parquet, and there is also another column, which is a non partition column, and which is also used as a filter clause. And we want to see how Dremio uses Iceberg and how the metadata was calculated for this. And the query performance was optimized by using partitioning, right? So, if you go look at the planning tab, which gives all the details about how the query was planned, you can see that there is something called an Iceberg manifest list table, right?

So I think sometime back somebody gave an in-person presentation about the Iceberg flow, where there are two types of files in the Iceberg flow. The manifest list file, which actually stores the paths to the manifest files. But not only that, it also stores the max and min partition values in each of these manifest files. So what wound up happening here is Dremio used that information, and if you can go look at the profile here, it’ll actually tell you the manifest list filter expression is for the partition column, right? So we are applying the max and min here, so that if there are 20 manifest files, because we have the max and min part partition values, we are applying on each of these files, Dremio would probably only read two out of those 10 manifest files.

Filtering Metadata Based On Partition Values

And I’m also going to show you how we are going to see how many manifest files and data files we did. Then using that information, Dremio again applies the filter. So you can go look at the manifest read, which is again, in the Apache Iceberg flow, where it applies this filter to the manifest files. Now, the manifest files contain paths to the data files. Not only that, it also contains the different partition values in each of these data files, because that information is stored, and this is applied to the manifest files. Dremio will say, oh, there are 2000 data files, but this partition value only exists in two files, for example. And Dremio will only read those two files. So any filter you see in the manifest list scan and the manifest scan, which you can see the filter, again, the same filter, are actually partition values. So if you look at the column name, you can see that it’s a partition value, but if you remember in my query, I also gave another column name. Now, where is that column name, right? So let’s look at that. So we go back to the planning tab now, and you go to the actual Parquet scan. It’s again called table functions, so it’s confusing, but if you look at the table operator 105, you can actually see the table name obviously, because if you go to the right.

You can see the table name, which we just used, catalog sales. But in the scan, you can actually see the second filter.

Partition Pruning With Dremio

Now, this is not partition pruning, as everybody may know, and this is what some people call this predicate push down or filter push down. Now what is the difference, right? So the difference is in partition pruning, we don’t even touch those files or directories. We just completely ignored them. But in filter push down, we have to actually open the Parquet folder, look at those values on the Parquet statistics of max and min, and then basically we can either prune the file completely if that value does not exist, or you can prune some row groups or prune some pages. So obviously partition pruning is most desirable, but sometimes, or many times end users and business users also run queries on non partition column filters. And that time definitely the Parquet Folder has to be opened. So if you go look at one five and you see a pretty high wait time on one five here, then that means that Dremio has opened and closed too many data files, right?

And maybe S3 or HDFS is slow. So one other indication, what you can do is you can go to the operator details of one five, I’m sorry, operator metrics of one five and Dremio will actually tell you the number of pages it pruned, right? So based on the filter, how many pages read and how many were pruned, right? So if it’s a very high selectivity filter, you should see a lot of pages pruned. Let me also tell you how to find out how many manifest files and data files we read. So if you go to the actual manifest files scan you should be able to tell the number of data files and number of manifest files, right? Just to say how much we proved, I also did a count from the same table. And if you actually go to the table function here on, you can see that this table has 1,840 data files.

How Iceberg and Dremio Work Together to Filter Metadata

So out of 1,840 data files, because we went and stored all this information in the Iceberg layer during metadata collection, Dremio ended up only reading one data file. Doing only one data file, right? So it’s highly desirable that this flow is required, but the only requirement for this flow to be required is you have to turn on unlimited splits, which is documented. And the second thing is the file format to take advantage of this has to be either Parquet or it has to be ORC, which is another column format file format or ARO. If you have one of these file formats in your data layer, and once you turn on unlimited splits, which by default is turned on from Dremio 21.x, you would be able to use this feature and it’ll be super fast.

So that’s one part of it, right? Why is this called unlimited splits? Is the second part to it, right? So going back to the old flow where Dremio would do all the maternity work on the coordinator, Dremio was dependent on the source row group size, the Parquet row group size, to calculate the number of splits. So when you say splits, it actually helps in parallelism and how many readers we can deploy for the scan. Just like any hybrid Spark jobs, I think it’s high. So there’s high splits. Now what that wound up happening is if for some reason there are too many small files and there are too many splits, somebody was just talking about how to address small file problems just now if Dremio has to load all these splits information into the heap memory.

Why Can You Just Not Increase the Heap Size?

And that time, if we remember all this metadata work was done on the coordinator, the coordinator is a single node, and all this information was getting loaded into the heap memory of the coordinator. And once it crosses a certain number of splits, the heap will be maxed out. Now, obviously, everybody can ask, Hey, can we just not increase heap? Now, obviously, we all know that when you start increasing heap, then you start running into garbage collection and long GC post times. So to avoid that, in the old V1 metadata flow, we had put a cap on the number of splits, but lots of customers and end users are like, oh, we have a million splits and how do we use Dremio, right? So when the V2 version from Dremio was brought, which we’ve been talking about now, Dremio said, irrespective of the number of splits in that Parquet file, let us go and scan 300 megabyte chunks of data and call them as a Dremio split.

How Does Dremio’s Unlimited Splits Work?

So if you had a one GB Parquet file and the Parquet file prior to the unlimited splits feature, the row group size was say, 128 megs, then that would be eight splits. But with the new Dremio version, now Dremio is scanning 300 megabyte chunks, it is only three splits. So Dremio said, now I’m able to open up the gates and have unlimited splits. Not only that, but the heavy lifting work of metadata has now moved to the executors and it’s going to be shared across all the executors, and each executor is only going to do incremental work. So it solves all the issues that we had on V1 where there was botlink by the coordinator, and the performance was a problem. And obviously the last one is troubleshooting. It is obvious that sometimes your source S3 or HDFS source is slow.

And the metadata refresh takes time, and in the old flow, there was not enough information in the profile. But right now, if you actually go see the whole metadata process is like a query execution. It gives you exactly what are the phases, how many phases, what is the parallelism of each of these phases, and what are the different operators in each of these phases. Not only that, you can go and see which queue this metadata refresh happened. So as we all know, Dremio supports engines, and we can actually go create a separate engine with a separate pool of executors and offload the metadata work into that engine so that it does not conflict with your dashboard queries. Lot of times people when they turn on unlimited splits, they’re like, Hey, my dashboard query is waiting for a long time, it’s not getting a slot.

That’s because now that the metadata refreshes happen on the executor, it is just like a query execution. So you have to tune Dremio to run these metadata refreshes in its own engine. And once you do that, the dashboards go to their own engine. And that could be a much powerful engine with say, 20 executors or 40 executors, but the metadata engine could be just two, three, or a few executors, a smaller pool of executors. That’s the work and they don’t conflict with the actual dashboarding work. So this way we are able to solve the performance issues faced in the partition. One other thing I want to touch upon is if you do have a brand new data set, which has never been learned by Dremio, that will always be a full refresh, right? So if you have a newly created table on Hive or a newly generated table on S3, the first run will always be a fresh run, right? So for example, I can take a couple of minutes and give you a quick demo on that.

Brand New Data Set Demo

And so let me go and forget metadata on this. So Dremio removes it. And let’s run a query on this. So this is a brand new data set. Now I forgot metadata and Dremio has to learn the metadata on the fly. So if you go look for the internal job, which is a refresh data set, you would now see that, so you see that you don’t no longer see the hash join and the Iceberg listing because it does not have any listing for it. So Dremio just goes and does a full, so for a really big table, the first time Dremio is going to learn in the new flow, could be time consuming, but then after that everything should be incremental and super cool. I think that’s all I had to cover.

header-bg