Dremio Jekyll

Running SQL-Based Workloads in the Cloud Using Apache Arrow

Transcript

Jacques Nadeau:

All right, let's get started. Hi there everybody, I'm Jacques. I am going to be talking about running SQL-based workloads faster using Apache Arrow. So, quickly about myself, I am co-founder and CTO of Dremio. Dremio is a company that's been around for about four years. We're downstairs if you want to know more about Dremio and I'll talk about some of the things we use inside of Dremio with Arrow in this talk. And I'm also the PMC chair of the Apache Arrow project. So, Apache Arrow started about three years ago when myself and Wes McKinney go together and realized that data scientists and database people needed to get closer together and work well together. And so that's what a lot of this is built on top of.

So, let's start out. So, what I want to talk about here is, if you think about performance, performance by itself is actually meaningless in the cloud. Like, on prem, when you have fixed resources, it's all about how much power, how much speed, can I get out of this set of resources. In the cloud, I can always add more resources and so, really performance is about cost. And how can I balance my performance and my cost, most effectively?

And so, in many cases, I can scale, but I'd rather not just spend millions of dollars on my infrastructure. And so, rather than talk about, "Look how fast things are," which is what this talk could be. I'm going to talk about a specific set of things that we do to make things faster and why they work and sort of give you overview on the technical side for those things.

Running SQL-Based Workloads in the Cloud Using Apache Arrow

But really, different kinds of use cases, different kinds of challenges have different sets of technical solutions to those things. And so, for a particular use case, some of these solutions may not be appropriate and others may be appropriate. Some use cases, all of them may be appropriate. And so it's really a set of tools that you can be thinking about, in terms of how you're solving your problems and which tools can you bring to bear for this particular set of challenges.

I would say that the last thing is that if you think about the sort of way that approaching things, one of the things that I think is really important is that you want to think about things to be sort of loosely coupled. So, if you think about the old school way of doing things: build a date warehouse, let everything into the data warehouse and do everything inside of the data warehouse, there were some nice things about that. But, in reality, we have a much more complicated sort of data ecosystem today and so being able to loosely couple those things and solve things whether or not you've loaded the date or whether you've interacted with the date before, all of those things become very important.

Running SQL-Based Workloads in the Cloud Using Apache Arrow

So, let's talk briefly, just one slide each, on Dremio and Arrow. So, Arrow, as I mentioned before, the project's been around about three years. Seeing huge adoption, really, sort of, beyond, I think, everybody's expectations, including my own, in terms of how well it's gone. The goal with Arrow, was really to try to provide a common way to represent data and a way to process data very efficiently. And it's been incorporated into a bunch of different projects, some of the ones that you probably know: Python, Spark, are two big consumers of Arrow. Nvidia and their GPU Rapids initiative, also uses Arrow. Dremio, we use Arrow extensively. And really, what Arrow's about is about allowing you to have this loosely coupled set of technologies that work together and achieve faster analytical performance.

The second thing is Dremio. I'm going to be talking about that and talking about Arrow in the context of Dremio and how to make things go faster. And so Dremio, quickly, is a way to interact with data, find it, access it, share it, curate it, collaborate with others on it. Both for the cloud, more than half of all Dremio usage is in the cloud, whether that's Azure, AWS or Google. It's, at it's core, is a processing engine for Arrow data. And so inside of Dremio everything that we do is against the Arrow representation of data. And really, it's trying to provide in SQL for any source. So no matter whether you're interacting with a relational database or a non-relational database, a no SQL system, a data lake, a cloud storage, all of those things should have a common interface that allows people to think about how they're interacting with the date rather than the mechanics behind how that data is being stored.

Running SQL-Based Workloads in the Cloud Using Apache Arrow

There's an open source and an enterprise edition, so you can go download the open source community edition and play with it all you want. Everything that I'm demoing here is something that you can try out in the community edition. And one of the key things that we sort of approached this, sort of world, a little differently is trying to integrate multiple technologies together. There are lots of solutions out there that are trying to solve a sort of what I would describe as a niche, one niche or another niche. I want to just curate data and make those data sets available for analysis. I want to figure out what all my data sets are and catalog those and share those with my colleagues. I want to accelerate the performance of access to data. I want to federate multiple sources together.

Those are all really useful things to be able to achieve, but if you can combine them together we think that you have a much better experience in terms of how you're trying to achieve sort of your end goals, which is to get business value out of your data.

So, these are the four things that I'm going to talk about. So, I'm going to go through each of these and sort of talk about how they work and how they can impact your workload. So, the first one is Gandiva. So Gandiva is a processing initiative that was built in Dremio and then we contributed it to the Arrow project so that anyone can use it if they're working with Arrow data. Then, secondly, I'll talk about column-aware predictive coalescing IO and how that can substantially impact your performance working in the cloud. Then I'll talk about what we call our columnar cloud cache or C3. Which is about how you can cache portions of data to make things go faster. And then I'll also talk about Arrow Flight and how that can accelerate the access of data from whatever application it is that is consuming the data at the end of the day.

So, let's start with Gandiva. So, I didn't go in to a lot of detail about Arrow, I think that I've done enough talks with Arrow that hopefully most of you have some sense of what Arrow is, but let me talk about Gandiva. So Arrow's a representation, at it's core, of data and it's a language agnostic, in memory representation. The goal behind Arrow was to basically have a better way of moving data between systems. And the way we figured out how to solve the transport problem was to solve it for processing. To say, "Hey, if two systems have a common way of representing data for processing purposes, then it can be very efficient to move data between those systems." And so Arrow's designed, first and foremost, for processing even though it allows a communication between different systems.

And, Gandiva delivers on the promise of that by basically saying, "Hey, if we have a canonical representation of data, then we can do really efficient things in terms of how we process that data." I don't have to work through any different APIs to interact with that data. So if you think about old ways of interacting with data, it was very much API based. So you would call an interface where you'd say, "Hey, give me the next record. Give me the next cell in this record."

And that worked well because you could build an application and the internals didn't matter to you, but the problem there is that if you're doing huge amounts of processing than all of those interactions, all of the those method calls, invocations can be a substantial amount of performance penalty. So, if you can have a representation of data that is very well known and interact directly with that memory representation, then you can gain a lot of things. And so that's really what Gandiva's about. High speed compute and we leverage it inside of Dremio. So if you download Dremio today, we're using Gandiva to process and that's one of the key reasons that we process as fast as we do.

It is built primarily for CPUs today, but the vision is actually to also extend it to GPUs in the future. So at its core, the Gandiva is built on top of LLVM. And so it will take an expression tree and compile that down. So the expression tree is generated in whatever application you have, you compile that down with the Gandiva compiler into an actual piece of execution code. And then the execution code does something very simple. The execution code is simply responsible for taking in batched of Arrow data and then outputting in batches of Arrow data. And so the complexity is all in how we compile this information but how it actually plugs into the application is quite simple in that you just say, "Hey, I got a stream of Arrow data and I can apply this operation on top of it and now I've got a new stream of Arrow data."

Running SQL-Based Workloads in the Cloud Using Apache Arrow

And so what does it actually do? Well it supports an arbitrary set of expressions. And so when you think about a compilation inside of a query execution, there are lots of different operations. You might be doing a hash join, you might be doing aggregation, you might be doing some kind of expression calculation. If you look at those different things, things like aggregation and joins, don't need that much run time compilation to perform well. The reason being is that the patterns and where they spend the time are things that are fairly consistent across different people using those operations. With expression of valuation, runtime compilation becomes extremely critical to performance because every single expression that someone will write is going to be a little different. And so, a plus b plus c or a plus b divided by c are going to be different expressions and I want to figure out what's the fastest way to run each of those things and compile those things. And so that's really what Gandiva's about, is saying, "Hey, whatever arbitrary expression you have, I can compile that very very efficiently."

And there's simple versions of this, a plus b, that's a pretty straightforward thing and there's actually primitives inside of LLVM that already allow that basic operation, but then you have to apply things like SQL semantics on top of that. And SQL semantics include the consideration of null and null ability and how that impacts the different operations. And so how do I calculate that correctly? And so the Gandiva Expression Library has, I think it's more than hundreds now, it probably has a thousand or several thousand expressions now. Basically supporting filtering and projecting operations to a apply a set of conditions to a stream of Arrow records.

Running SQL-Based Workloads in the Cloud Using Apache Arrow

The other thing to note, I think, I've got that in the next section. So, how does it perform? Well this goes to the Arrow representation. So, for those of you that don't remember or need a reminder, the Arrow representation of data is a columnar representation of data. It's columnar, in memory; so Parquet, Orc, those are columnar, on disc representations. Most systems typically will bring data in memory as a row wise representation. They take advantage of the underlying IO capabilities by using columnar there, but they don't actually take advantage of the CPU capabilities around vectorization by using columnar there. Arrow is trying to change that by saying, "Hey, the best representation in memory is also columnar."

Now, the representation of an Arrow integer. So let's say we're doing an a plus b operation. An Arrow integer is, let's say, a four byte integer. A and b are both four byte integers. The Arrow representation of that actually independent chunks of memory for each of those things. So a four byte integer is actually a four byte integer, end, end, each of the different values. And then separately from that, there's a second structure which is a bitmap of whether each value is valid or not. So it describes the nullability of each of those values. And so when I'm doing a plus b, I'm actually working with four data structures, the validity and the data for each of a and b. Okay.

And so because we know that's the representation, inside of the LLVM code and the Gandiva operations, we can actually take advantage of this and start doing null decomposition. And work with null values and how we deal with null resolution completely independent of the data itself. And so in this example here, right, so we've got a validity and data vector for a, we've got a validity and a data vector for b. We can actually take the validity vectors and the and of the validity vectors will define what is the validity of the output according to SQL's null in, null out semantics. So, SQL's standard semantic is, "If one of my arguments is null, then my output is null."

And so normally when you would write code to solve this problem, you would actually write a bunch of conditional statements. You would say, "If a is null, then null. If b is null, then null. Else, a plus b." Okay but if you know CPUs well, then you know that all this branching logic can substantially reduce the pipeline of the CPU. And so, instead of doing that, you just decompose it to these two different structures and you can say, "Okay, I'm going to do a bitwise and." And that's going to allow me to work on each word, most likely, or maybe larger than a word if I've got SIMD instructions, but let say each word. So I can do 64 value calculations at once. Where I say, "Hey, I'm going to take these 64 values and these 64 values, do a bitwise and and now I know the nullablity of those 64 output values." Right?

Then separately, I can do the SIMD add for the data. Okay. And so I can add the two vectors together. And so the simplest version of an add would be, "Hey, let me just loop through these things and add them." Okay? A more complex one would be, "Hey, I can work with things that are wider than four bytes and so I'm actually going to add both of them together with the larger SIMD operations." But neither case, there's no conditions there either. And so what happens then is that this means that there's no conditions anywhere in this logic, no branching anywhere in this logic, so the performance can be substantially faster. Plus, when we have SIMD we can take advantage of those operations.

Okay? And so this is one very trivial example of how you can customize what's going on in the compilation to improve the performance of things. So when you talk about that, the null management, this is one example of null management. Nulls are a big problem in terms of processing, to improve performance. So inside of Gandiva there's basically a whole set of primitive around how nulls are managed. So there are expression where it's null if null, there are expressions where I'm always a non-null output even if I have a null input, there's expressions where depending on what's going on and maybe a nullable input or a non-nullable input. Case statements are actually a really interesting thing when you're applying null semantics to it to improve performance. So, basically, inside of Gandiva is all of the logic to deal with all of these things, so that you can just build an expression that's going to be working against the Arrow data and it have a very good performance.

And so, as I mentioned, the other thing you can do is that because these representations are all columnar in nature, there actually designed to be this consistent with the representations that the underlying CPU expects. And so if you want to do a SIMD operation where you're adding multiple values. And so here's an example of a bunch of two byte integers. I've got this collection of two byte integers, a and b, so this is a two byte version instead of a four byte version, I'm going to add them together. When I do that, I could do it where I read the first integer on each side and then add those together, then I read the second integer on each side and add those together, right? It's going to be the number of instructions is basically the number of reads plus the number of adds. So it's going to be 2x the number of records, or 3x the number of records in total instructions. The number of one side.

Running SQL-Based Workloads in the Cloud Using Apache Arrow

But the reality is that CPUs are better at this than that. So we can take the CPU and say, "Hey, CPU you know how to do 128 bit adds at one time where these things are co-located inside these chunks of memory and do that add all together." And so all the sudden this becomes three instructions instead of whatever it is, 24 instructions. So an 8x improvement in the number of instructions that we have to apply here to complete this operation. Now, it's not something that you actually have to worry about once you've got something like Gandiva to solve the problem, but this is why Gandiva can go more quickly than other things. And the core to this is that underlying representation also lines up with this. If you have to do step one was reformat my data into a data representation that the CPU can understand to do SIMD operations and then step two was take advantage of the SIMD operations, you're going to lose a lot of the benefits you could otherwise gain by using these operations. In this case, because the representation is consistent with what the CPU expects, we can improve performance substantially.

So, how is Gandiva used? Well the nice thing about Gandiva is that it's language agnostic. So it's trying to continue to achieve on that promise that Arrow has, which is, "We don't care what language it is." So, Arrow has something like 10 or 12 different language bindings, all the common languages you might work with today. Gandiva has several of those already, I believe it has Java, C++, Python and Rust is actually not up here, but I believe we have a Rust set of bindings now for Gandiva. In all those cases, you build up the expression tree, the expression tree is communicated with Protobuf, because that's a common format that works across different languages, and then hand that to Gandiva to compile and use.

And so in the context of Dremio, we actually run Dremio inside of the JVM. And the JVM is notorious for not working well with other languages, because of Arrow we can actually work very well with Gandiva and make that work extremely efficiently. So that's Gandiva, that's the first of the four topics that we're going to cover.

Running SQL-Based Workloads in the Cloud Using Apache Arrow

The second one is reading. So, Gandiva will make you go faster whether you're in the cloud or on prem, it's just about processing making it go faster. One of the things that sort of really becomes a really interesting challenge in the cloud is that object stores are awesome but they're also slow. They're way slower than if I've got an MDM on my local machine. And so let's talk about the data representation of columnar data and how that maps to the performance that we can get by doing changes to how we're doing the underlying reads.

And so, this is a quick review of how columnar data formats work. So imagine you have a table of date off here on the left and you have these columns, I've got color coded columns, so that you can see what's going on with this data. So the first thing that happens is that you will then generally take a subset of columns and put them into some kind of chunk, what might be called a row group or a stripe or something like that. I'll use the Parquet language because I use Parquet the most, so that's called Parquet row groups. And so basically you say, "Okay. I'm going to take this subset of data and I'm going to put those together so that all of those rows are together."

Because I want that to be sort of a manageable size. So, the size might be 128 megabytes, it might be one gigabyte, something like that. Once I have those rows of data together, I then coalesce the values for each column independently. Right? So I put all of the oranges together, all the yellows, all the greens, all the, whatever, light blues, then all the dark blues, then all the grays. Right? So you put those together and that's actually the format of that row group inside of the Parquet file.

And so, this is designed so that you can improve the performance of reading from disk. And so what happens is that if I only need orange and green, then I don't have to read yellow, I don't have to read blue, and the blues or the grays. Right? And so I can substantially reduce my read on disk. Okay? Inside of one of those column chunks, is what we call it is Parquet, is actually what are called pages. Which are little compressed units that are the actual data itself. Okay? So this is how, roughly speaking, all of the columnar formats are composed. There are slight variations to these things but this is basically how they all look. Okay?

And so, this is great, it works really well for a lot of things. But if you start thinking about how it reads, it becomes a little bit problematic. Because if you think about that color, if I want to read data and I want to turn it into a row-wise representation which is how most processing systems work. Or even a columnar representation where I'm not trying to read all the data at once. So if I read all the data at once, I can start from the beginning of the file to the end of the file. I read all the data and everything is fine in terms of reading. But I want to take advantage of the fact that I'm only going to read some of the columns. So I don't really want to read from the start to the finish.

Okay? If I want an orange column or a green column, I could read from the start of the orange to the end of the orange, then the start of the green to the end of the green. In that order. But that's also pretty inefficient because if orange is a really big chunk here, and this is to note in this picture, right? It's that different columns inside that chunk are going to use substantially different amounts of data depending on well they can be compressed. Okay? And so if I read all of the orange before I read any of the green, I may be using a substantial amount of memory holding the orange in memory before I even get to reading the green. And depending on the operation it may be that I'm wasting all that reads and all that memory.

And so when you read from disk, there's this thing called read ahead. It's trying to figure out how we can go faster reading from disk. And so if you think about a read, every time you ask for one byte of data from disk, the disk wouldn't have to get data, that one byte, and read it off of disk and then hand it back to you. Things would be ridiculously slow. And so various layers of your IO, there's basically buffers that say, "Hey, I'm going to read this much at a time." And then there's this read ahead concept, which is, "Hey, if you just asked for the first 4k of the file, then odds are you're going to want the next 4k in the file."

Okay? And that's a simplified version of read ahead. But basically a client's going to say, "Hey I want the first bytes." Okay? And then I used some little color coding in here which I'm going to continue to use. And so, color coding here is so, hey, initially this orange is I've got to read this in line. I'm waiting for this. And so this first set of data, I'm going to have to wait for it because I'm asking for it and then I have to go and read from disk to get it. Okay? Now most systems will say, "Hey, you just asked for the first 4k, I might as well get the next 4k because odds are you're going to want that too." And this is to support pipe lining. Because if you then ask for the next 4k, I already have that, so I can hand that back to you and then ask for the next 4k while you're consuming that 4k. So the goal being that the underlying IO is ahead of you so that you're not waiting for it.

Running SQL-Based Workloads in the Cloud Using Apache Arrow

Okay? So if I read the first 4k, the system says, "Okay. I'm going to pull 8k from disk." So then when I have the second request, I get a green request, which means that when I asked for it, it was already ready for me and therefore I don't have to wait for the disk again. Okay? And this is critical to performance; it's basically anticipating what is needed in order to improve the performance of what we're going to get. And so, again, the read ahead's going to say, "Okay. Well you just asked for that 4k, I better go get the next 4k." Okay? And this works really well with files that are read sequentially, okay? But it doesn't work as well when you look at something like a Parquet format. And so this columnar format, I've shifted, that I've rotated it here so that it's a long the top and down the left hand side I've got time. And what I want to show you is what happens when a naïve read ahead approach interacts with the reads of these columnar formats. Okay?

And so what happens is that most systems will start by reading the first column. And they'll read a subset of the first column in order to get some of the data. They don't want to read the whole of orange into memory. So they're going to read only the first little bit, they're going to say, "Well I only want to have 5000 bytes and I want to send 5000 bytes down my pipe line before I get the next 5- sorry, 5000 records and I want to send that 5000 record down the pipeline because there's a bunch of other operations. If I'm doing SQL, I might be doing: read from disk, do a join, then do an aggregation, then do some other operation, a filter or something like that. I want to get the data through that pipeline as quickly as possible and so I don't want to read all of orange before I start returning data downstream.

And so I make that first request, that orange request. Which is, "Hey I'm going to read this, I'm going to wait for the read." And at the same time the read ahead's going to say, "Okay, let me go and read the next four bytes or 4k." Right? This is where it breaks down. So at time two, I'm now going to say, "Okay, well I want the yellow column." So read ahead is probably going to say, "Oh shoot." And this depends on the particular read ahead there are more advanced read aheads that are better than this, but for illustration purposes, this is basically what happens. "Oh shoot. That read ahead, I didn't need that, let me throw that away." And that's the red. I'm throwing away the data that I just read because it turns out that this person seeked to another place in the file and is now reading in that location.

And so now I'm going to say, "Oh okay, they're reading in this location. I'm going to do another read ahead at this location." Right? Well guess what happens? Time three. I'm going to read someplace else because that's the part of the file I need now. Okay? And again, I'm going to throw things away. And what happens is that this continues on and in some cases, it might be okay. And in the case, that one column is so narrow that it happens that the read ahead overlapped with what was being read next and there, all the sudden, we could get a green block. But mostly what you're seeing is waiting for data and the yellows and reds are basically pulling down data but then not using that data. Okay?

So this just goes on indefinitely. Okay? Now realistically, most read aheads, if they see that they're constantly wasting their reads, they'll actually turn themselves off. So at some point the yellows and the reds will just go away and you'll just have a bunch of oranges. Okay? The problem though is that each time you do one of those oranges it means that I ask for the data and then I have to wait for the data to come back from the underlying system. Now this isn't a huge problem, historically, on local disks because local disks are pretty fast. I know that they're not the fastest thing in the world, but generally speaking you set the buffers at a reasonable size, they have a reasonable latency and things are okay.

When you go into the cloud, the latency are one two magnitudes of order higher to get data back and that means that all the sudden, these oranges where we wait and try to get the information back from the underlying system, start to make things way slower. And so when you're interacting with these systems, you'll start to see substantially lower performance than if you're using local disks. And this is actually what leads to many people actually implementing things like using EMR or using HDFS on Amazon to actually do interact workloads because they can't get the performance from S3.

Running SQL-Based Workloads in the Cloud Using Apache Arrow

And this is one of the contributors. The latency itself is actually a problem, the actual read pattern and how we read these things, is also a problem.

So, basically, systems need to consume all these columns and we really want to pipeline things, but we're basically waiting. We're not keeping the pipeline populated and that becomes very painful. And so, as I kind of talked about, one solution is you can read all the file at once. A second solution is you can say, "Well I'm going to read large chunks of the file at once." Okay, you say, "Well I'm going to look and I'm going to only read two big chunks." And then two megs is not as bad, but I've still got a lot of wastage and still this sort of how do I anticipate these things?

And so the third option, and this is what we've actually found to be be very very valuable, is you merge the IO layer with an understanding of the file format. And you say, "The read ahead should be aware of how the file format works or else you're not going to be able to get to the performance that you want to get to." And so that's what we did, is we built what we call predictive columnar, basically, predictive columnar reader. Okay?

So, he's going to understand the columnar formats and read those things. And so what happens is that when you first start, we actually go and request a bunch of stuff before you even start reading. Okay? We're going to read the first little bit of all of the columns. Because we know that this is going to be the most common pattern that happens. And obviously you only read three of the columns, we only ask for three of the columns. Okay?

And so what that means is that then when the normal read pattern layers on top, which continues to happen because that's a natural way to decode one column at a time. Because you're decompressing this column, then you're decompressing that column, then the other column. Is that you get greens. And so you actually maintain understanding to every column and the read aheads for every column independently, so that you can continue to make sure that each of these things are populated. So every time I go to request something, hey, I'm going to get that data and I'm going to keep track of that as a read ahead that's independent of the other read aheads to improve performance. And so this comes down all the way, to where you actually always have greens. Which means that if the application is reading some data, then decompressing the data and sending it downstream. Then reading some data, then decompressing, then sending downstream. Basically, all of the requests are happening when its processing and decompressing that data and so the next time it's going to read some more data, that data's going to already be there. And that's really what you want to have.

Running SQL-Based Workloads in the Cloud Using Apache Arrow

And so, to give you an example of the impact that this had, so we ran some example queries that we did, this is on, I think, Azure storage gen two. And this was the performance improvement of this example workload in terms of just how much faster you can get by applying this technique.

Running SQL-Based Workloads in the Cloud Using Apache Arrow

So I'm going to jump to Arrow Flight and how that can make things faster too. I'm realizing that I'm using more time than I should, so I'm going to go a little quicker here. So, Arrow Flight, high performance protocol for moving Arrow data between two different systems. Okay?

Basically in some streams, you have clients and servers, but realistically a stream can go either direction. So I can be pushing the stream up to my server, I can be consuming it from my server. Now because Arrow Flight's focuses on analytical data sets, it's very important to support paralellization. And so it allows a system, you have 100 servers talking to 100 servers, where you can actually send Arrow streams between all those 100 servers. So it's not a single stream concept. Like if you think about ODBC and JDBC, which is one of the things that you can kind of think about this replacing a little bit, there's a lot more to ODBC and JDBC than this, but for the purposes of receipt of data, this is a good analog and so that's single stream. Here we're saying, "Hey! If I want to have 100 streams to 100 streams, I can do that. I want to have 1000 streams to 1000 streams, I can do that." So that we can move data very very efficiently and it's all moved in this Arrow format.

And so I'm going to show you a quick example of this. So what I have here is just a simple little Jupiter notebook that hopefully will run for me. And what I'm going to do is I'm just going to run a SQL query. This is running against Dremio. It's just got a single note on an EC2 in this case. And so what you see is that- is that big enough for people to see? Sorry, let me make it a little bigger there. Is that big enough for the back? Yeah? Okay. Cool. So I ran a SQL query here where I'm just getting 2000 records back because Arrow Flight is about transport not about processing. And so how do I efficiently get stuff back to the client. And the client could Python, it could be a BI tool, it could be something else entirely. Arrow Flight doesn't really care. I'm using ODBC and Python here as examples. But basically if you run a small query, it doesn't really matter what the format is of the transport. It's pretty fast either way. So we're at .06 seconds and .05 seconds. So Flight is slighter faster, but they're basically in the same ballpark.

So what I'll do here is I'm going to go get 5,000,000,000 records now. Okay? And so let's see if I can run this thing. And so 5,000,000,000 records is actually not that many records. I am actually constantly shocked by how slow this goes. But what's happening here is that Dremio is here, it's reading some data, there's a time for how much time it takes to read the data from its underlying storage. It's then converting that into a representation, sending that over the network to the client that's running inside of Python, the network is then going through that interface and if you look at the ODBC interface, it's, "Give me the next record. Okay, ask for the cell, ask for the cell, ask for the cell."

And so that's actually what's happening inside of the pyodbc client is it's basically reading every cell out of every record. And so I think that we're reading something like 10 columns here, so you're talking about, 50,000,000,000 invocations and method calls, to get this data back into Python. And, right now, I'm not even converting it into a data from or something else. I'm just doing a very simple, I'm getting this back into its native representation, which is the pyodbc representation. And so, it took 60 seconds, 58 seconds. All right? Which is a long time to stand up here and talk and wait for it and hope that it finishes.

So I'm going to run the same exact thing with Flight. So, exact same query, against the exact same system, same locations. Okay? And the goal is, faster. As you can see, it's way faster. Drastically faster. And the reason is pretty simple. Is that, if you think about that first scenario, every single one of those invocations and all of that code that had to be run, doesn't have to be run in the second case. Because, guess what? Arrow representation is one language is the same as Arrow representation in another language. And even though Dremio's running JDBM over here, it can communicate this Arrow Flight data back and its memory representation and the other side is like, "Hey. I know exactly to consume the Arrow representation and start working with it."

Running SQL-Based Workloads in the Cloud Using Apache Arrow

And so, what this allows you to do, and this one I'm not going to run, because it takes about 3 minutes, it allows you to load a billion records in a data frame in about 3 minutes. Okay? Now what would this have taken if I had done it with pyodbc, which I would honestly have never done? Is about two and half hours. Okay? And the goal again being, is that by using these techniques, all the sudden things that were not achievable before or were reasonable for people to wait for, you can now actually achieve them with reasonable effort.

Now if you know Python well, you may know that there's actually something faster than pyodbc, and so I actually have another notebook here, which is turbodbc. Which is hard to say and it's actually also Arrow native, so it knows how to do Arrow well and so that's one of the things it does. I won't run it for you, but takes about 23 seconds for that exact same workload that was taking 58 seconds on pyodbc. So, optimizing within Python, always a good place to start, because frankly, a lot of times there can be a bunch of optimizations done just there. But even with the fastest implementation of how you get data into a Python frame from a database today, with tubodbc, you're still looking at 23, 24 seconds.

Running SQL-Based Workloads in the Cloud Using Apache Arrow

And so that's the kind of difference you can have by using Arrow to solve data access. So lastly, let me just talk briefly about what we call the cloud columnar cache or C3 for short because it's kind of a mouthful. So this is talking about storage in the cloud. And so when you think about storage in the cloud, you don't have a great option. You have some really great things, but they only are great at some things. And so you've got cloud storage, whether it's S3 or ADLS or Azure storage or Blob storage or whatever you have, right? GCS. Unlimited scale, super cheap, no management burden. Of course you could spend a bunch of money because you didn't know you had a bunch of data sitting there because it's unlimited scale. But, by and large, its the best of infrastructure of service.

Now it's slow though. Hopefully most of you have seen this. It's not super slow, it's getting better all the time, but it's way slower than if I've got a locally attached drive. Especially if I've got locally attached high performance drive like NVME. Okay? And it actually can be relatively expensive for hot data. If you're querying the same thing over and over again from S3, you start to realize the storage costs are not the biggest cost for you, it's actually the access costs. And so that's something else that you have to really think about. So then you've got this other option, which is, hey, locally attached storage. Like locally attached SSDs. And you can just do that, sort of with basic techniques or you can say, "Let me apply an HDFS or something like that on top of that." To have sort of have more layers on top of it. Right?

And so this actually allows you to get substantially better performance and it gets rid of the costs of doing those accesses in S3 and what not. The downside though is that it is also a huge amount of maintenance and management burden. It's the reason we went to infrastructure services to get rid of all this bullshit. But if you're thinking about applications, yes, cloud storage is great for long term storage and for analytical data sets where you don't want hit interactive performance. But if you want low latency critical applications, probably local SSDs. Well that sucks, because you don't want to have to make a choice.

Running SQL-Based Workloads in the Cloud Using Apache Arrow

And so what we did with Dremio, is we said, "Hey, let's do something that combines the two of these things." So let's leverage the elasticity for cloud storage, that's where you put your data, but then let's actually build a high performance local caching layer for ephemeral storage. Okay? For local storage. And we can do this across all workloads because the cost to implement and use this is less than ten percent for any missed read. So the cost of actually caching locally is low enough that it actually can work really really well.

So we're not caching files. So caching files, again, is kind of a problem because if you think about that previous picture when we talked about Parquet, I may only read three of the columns of that file. And so if I cache the whole file, I'm wasting a huge amount of local storage. In fact, inside of a file, in one of those column chunks, I may not read all of the pages in that column chunk. And the reason is that frequently you apply common predicates. And so only some of the pages in some of those columns are actually what I want to store.

And so Dremio C3 is about that. It's basically allowing you to store sub-chunks of sub-files of sub-columns, locally to improve the performance. And so one of the things that I call this is screen door caching. And so what happens is that, over time, is that you see different subsets of the file being read and then you can start to bring those in to cache. And so the greens here are reading into cache, the grays are reading from cache. Green means a user accessed it, gray means it's still in the cache. Let me clear this up. Green means a user accessed it, gray means it's still in the cache.

And so, over time, what will happen is that you will see that some things get accessed a lot and so they'll still stay in the cache. Other things maybe only get accessed once, so they'll very quickly come into and out of cache again. Okay? But this is all completely no requirement of management. You simply check a box and then all of the sudden we're using that local storage to make things much much faster. And so how does it work? Well so Dremio has these things called executors, these are the things that do data lifting. And each executor works independently, each executor has this local cache. Which is a combination of using rocksdb to maintain what's called a page translation table along with file system storage directly for the blobs or chunks of data.

Running SQL-Based Workloads in the Cloud Using Apache Arrow

Okay? Each executor works independently, so there's no scaling limits here. So we have customers that are starting to deploy this that are running 10,000 queries and 200 terabytes of Dremio clusters to run with this. So it can run at massive scale and prepare parallelization. And then basically it does a read through and a write through cache. But it's also designed so that if nodes are coming in and out of the cluster, the system survives and does a good job of accommodating where that data should be at afterwards. And so if you lose a node, that's okay. That's not the end of the world, we'll continue to be fast and we'll actually reposition the stuff in an effective manner.

Running SQL-Based Workloads in the Cloud Using Apache Arrow

So what does the performance look like when you do this? Well one of the key things that we figured out was basically doing what we call generational operations. So normally in caching layers there a lot of locking and concurrency problems. And so what we did is basically we would have these generations that are concurrently happening and basically operations are happening with one generation and that allows us to drastically reduce the contention. Which allows us to do over 50,000 page translations a second, which is more than enough for most size notes that exist today. We can do five gigabytes of reads per second with this against the standard high you can get inside of an Amazon or something like that. As I mentioned, cache makes overhead very very low and it supports a gigabyte a second of churn. And what that means is that identifying new data that needs to be read that should be replacing old data that already exists inside the cache. So, very quickly can churn out things that are actually reads that we're not doing commonly. And then there's also the memory flow coming into the cache itself.

So we use the disk and not the memory. A lot of people use memory to do caching but the problem is memory is a scarce resource and doing it for a cache that may or may not be used is very expensive. And so what we do instead is we basically have a very small footprint of memory that we need to basically cover a huge amount of on disk cache. So if we look at most of the largest machines with their NVME to direct attached storage, you're not going to go above 16 terabytes today and 16 terabytes requires less than one gigabyte of memory in order to cover the information. The other thing that we have is a incomplete LRU. And what I mean by this is that an LRU tries to maintain an understanding of the entirety of that data set. Most LRUs try to understand every single value in the data set. The problem there is again, huge amounts of pressure on your memory in order to maintain all that data for every single translation or every single option because you're matching.

And so we have what we call an incomplete LRU which is keeping statistics at higher to improve the performance of the LRU without having to maintain everything in there at once.

And so how do these things all work together? Let's say you have a use case where you're trying to access with Python's Azure storage, right? So the initial approach is simple. I'm just going to use Python to access Azure storage. And that's fine. But the reality is that that's not going to probably be fast enough for you unless it's a really small data set. And so you can just scale out things and Dremio's one way you can scale things out. There's other things you could put here to scale things out, but Dremio's one example of that. But that allows you to then scale up how many things are processing. Because the odds are the data set that you want isn't exactly the data set that's in Azure storage. There's some set of processing that must happen before you get to the data set that you want. Filtering things out, reorganizing things, cleansing things. Things like that.

Running SQL-Based Workloads in the Cloud Using Apache Arrow

But what really happens is that all these different techniques come together. And so I actually didn't even talk about relational caching because I didn't have enough time. But you can basically put the stuff in Azure storage, use the predictive reader to read the stuff quickly into Dremio, Dremio can then use local cache to accelerate the access to some of this data, which is hot data. You then process it with Gandiva and then you move it back via transport with Arrow Flight back to the use. Okay?

So you can combine all these things together and get a substantial lift depending on your situation. The amount that you can rise is substantial, of course you could add all these things, it's like 40,000 to actually improve it. That's not actually true, because it depends on where your time is. If most of your time is not in transport, then making transport 20 times faster doesn't actually help you that much. Right? So it's about which techniques actually have an impact on your situation. That's what I got. So let me know your thoughts. I've got like 30 seconds for questions, although I think there's a break now, so if people want to ask questions beyond, I think that's probably okay or if people come out. Thank you.

Any questions, or, no? There's a question.

Speaker 2:

So, for on-premise use cases, what's still valuable?

Jacques Nadeau:

So the question was, in on-premise use cases, how many things are still valuable? So that's a good question. So if you think about it, Gandiva is just as impactful on prem as it is in the cloud. Arrow flight just as impactful on prem as in the cloud. The IO pattern, more impactful if you're working in a situation where you're got remote clusters. So for Dremio, we have many customers that have multiple clusters and in that situation, some of the clusters are going to be more distant. And so taking advantage of the way that we're reading can be hugely advantageous as well as the columnar cache. So those two are really more about whether or not there's enough distance between the data and where you're at. If you've got low latency local storage, then those two things probably will have less impact on you. Other questions? Thanks everybody. (silence)