Apache Arrow In Theory, In Practice




Hey, everybody, thanks for being here today. First Arrow meetup in San Francisco, West Coast, first Arrow meetup in West Coast, I guess.

So, quickly, who am I? My name is Jacques, I am CTO and co-founder of Dremio, I'm also an Apache member, involved with several Apache products. I'm the VP of Apache Arrow, but also involved with [Calsite 00:00:48] and the Incubator Project out of Apache. To start off, I just want to say thank you, to of course Thumbtack, as well as the people from Dremio and IBM, as speakers as well as Apache Foundation for being able to provide us a place to build something that's so important and built on sort of a consensus approach to things.

So, I'm gonna talk about two things. I'm gonna talk about sort of Arrow in theory, and them in gonna talk a little bit about sort of basic practice with Arrow, and then we'll have, after me, we're gonna have a special guest speaker, Brian from IBM, who'll talk about the Arrow Spark integration, and then we'll follow that up with Sid, who's gonna talk about some sort of heavy usage of how you can use Arrow in practice.

So, first to just sort of get you on a primer here, Apache Arrow, think of it as a standard for columnar in memory processing and transport. Okay? It's both processing and transport, which is an important thing that I'll hit on in a minute. It's really focused on analytics, there's other representations that are better for streaming things, Arrow is really focused on sort of large-scale reads and analytics.

And in many cases, using Arrow and the techniques that are sort of around Arrow, you can get a substantial performance in your workloads. Performance improvement in your workloads. You can think of it kind of as a common data layer. I think the goal with Arrow is try to come up with a way to do a much better job with having a loosely coupled data ecosystem, so rather than having to think about things monolithic-ally and having to be very expensive to move from one system to another, we want to make it easy so that you can pick the right system for the right use case and combine those things together.

It's designed to be programming language agnostic, so it's an in memory represent, rather than a programming API, or something like that, which is a little different than most things. And it's designed for both traditional sort of rows and columns, a relational model, as well as sort of arbitrarily complex data that you might see in sort of document databases or JSON files.

And the most important thing about Arrow is it's an Apache Foundation project driven by a large group of people who all have their own sort of motivations around it, and so it's very consensus-driven. And so, using Arrow is something that you can feel very safe about because you know that there's a lot of different people driving it together. So, really goals, very simple. Well documented cross language compatible. Designed to take advantage of modern CPUs, embeddable in all sorts of things.

So one of the important things to understand about Arrow is it's a library more than a system. And so it's designed to be incorporated into data applications you might be building, or data infrastructure products that other people are building, that you will benefit from. And designed to be very very inter-operable.

It's a shredded, nested representation of data, so Arrow is columnar in memory, kind of like something that was columnar on disk. A lot of the same kinds of ideas about how to make something efficient. A different sort of tactics to how to solve those things, though, in memory, for example, we can't to focus on random accessibility, whereas on disk you're really only going to expect more streaming reads.

And so that changes some of the representation. You also chose to use different types of impression on memory, versus on disk, because you have your interaction being different. So want to maximize CPU throughput. The way I kind of look at the world is that your first problem is pulling stuff of the disk. These days people have such efficient formats on disk that they can pretty quickly pull stuff off of disk, and they have a lot of different spindles or SSDs driving information being pulled off of disk.

Your next question is whether or not you can fit this stuff in memory. If you can't, then that's where your performance is gonna be a challenge. But then once you get that stuff into memory, the real issue is about CPU bottlenecks. So how fast can we push things through the CPU, and that's where Arrow's really designed to sort of work very efficiently.

It's also designed to allow you to do scatter gather IO, when you're interacting between different systems, which can be very powerful. And so, when you think about Arrow, it's kind of a chicken and an egg a little bit, in that Arrow's goal is to try to make it so it's easier to move data between different systems, but the question about how you do that, we were thinking a lot about how to do this, and early on there was a bunch of us involved in sort of brainstorming, why does Arrow need to exist before it was even a project.

And really it comes down to this sort of you want systems to all speak the same language, and the way that that happens today is that you have basically people come up sort of these serialization formats, these transport formats, which are useful, and then what happens is each system that's producing data will take from an internal representation of data, to this transport format, move the data across, either the wire or local socket, or something like that, to a second system, and then the second system will take the transport format and move it back to an internal representation.

And so what happens is that you can have these sort of standardized transport formats, but what happens is you use huge amounts of CPU cycles moving from internal representation to transport format, and then back again. And so, when we thought about how do you solve for having a very efficient transport format, it turns out that you actually need to solve first for processing. You need to come up with something that's very efficient to process, because that allows people to use something internally, and then be able to hand it to someone else in a transport format.

And so if you can avoid the serialization and de-serialization boundaries between systems, then that's a huge benefit, but the way you do that is to get people to adopt a format internally for processing purposes. And so, as much as Arrow is designed to transport and move between different systems, it's core is, hey, what is a very efficient processing format, internal to a particular system that then there's a benefit to adopting this for just the internal systems, not just for communication between systems?

And so that's really these two parts, right, is on the left hand side, you can think about hey, this is how we want to have better transport between the systems, let's avoid serialization, let's come up with a way to push things on the network or on local or shared memory or whatever very efficiently. But in order to do that, let's come up with a representation that's columnar, that's very efficient for CPU to process. So it's kind of the two sort of sides to what Arrow is. So it's not just a transport format, and it's not just a processing format. It's really trying to bring those two things together.

This is kind of an older chart here, but we've seen really good progress, in terms of the adoption of Arrow. So Arrow had its first release, I think a year ago, plus or minus, something like that, maybe a little more than a year ago. And we've seen towards the end of last year, we already were surpassing a hundred thousand downloads a month, which is great. And we've seen adoption by a large group of different people. So, that's really what you want to see with an early open-sourced project, in terms of its adoption success, it's tracking very well, versus everything that you need has become very common today.

So let's talk about Arrow basics. So Arrow has some sort of basic data types. You get scalar data types, Boolean, different size integers, unsigned and signed, decimals, floating points. You got your sort of standard date, time, time stamps with different levels of precision, you got your strings in your binaries. So that's sort of your standard scalar types that everybody might expect.

And then you layer on top of that the ability to create maps and lists, which kind of common things that everybody can understand, if you're gonna build that sort of hierarchal object, and then lastly we have a union concept as well. So you can express things that are heterogeneous in nature. So for example, my first record has a column which, Column A is an integer, and my second record it's an array of strings.

So Arrow can also express those concepts, so you can pretty much express any data that you're interacting with with Arrow. There's some more advanced types, like a type, and some of those things as well that are kind of early on in sort of the development, and so we're always looking at different ways. There's discussion, for example, of how you express graphs inside of Arrow, and people are thinking about that as well. So there's more things happening, but these are sort of the core types right now.

So, when you think about arrow, there's really this sort of how do I communicate Arrow between systems? And so we basically think about it in terms of the first step between two systems is thinking about how do I negotiate schema? And so, schema is like, hey these are the fields, this is all the structure, here's the data types. We only need to communicate that once for a stream of Arrow records. And it's sort of the logical description of things, right?

After that, Arrow has one encoding that it's allowed, which is a dictionary encoding, in addition to a raw encoding. And so, if it's a dictionary encoded set of information, then you get a one or more dictionary batches, which describe what is the nature of the dictionary data. And then lastly, you get what are record batches, and so the way that we at Dremio use is we always try to keep record batches to 64 thousand records or less. You can actually have larger ones if you're working inside of a system, it's not a limit of the specification, but it's useful if you're trying to do pipeline between systems.

And generally speaking when you go down to the leaves of a record, or of a single record, you can't have more than two billion values, because of the nature of how the spec is written. But usually that's not a problem for people. When you think about actually how it's represented, so Arrow, you gotta get it in your head, Arrow is an in memory representation of data, and so what's specified is how the data's actually laid out in memory. It's like, okay, this is the first four bytes are this, the second four bytes are this, the third four bytes are this, and so, it's designed columnar, and so basically for each data type you have one or more arrays of bytes that are describing different parts of it.

And so let's start with the middle column here. If we're saying this might be a four byte integer, which is age, expressing this data that's on the left hand side, you're gonna have two four byte integers that are back to back in the data structure. And so what you can see is that the age data is one chunk of memory, completely independent of the name data. Name, as a string, actually has two chunks of memory, which are independent and can be interacted with independently.

The first is what we call the offset structure, which describe the start and endpoint of each string value, and the second structure is the data structure, which actually has the individual characters for this string. And so you start to introduce these offset vectors into the data structures, as data structures get more complex. And so, the phones example here, where you have arrays of strings, introduced one additional offset vector, which is the list offset vector, describing what are the start and end points of individual sets of strings inside the underlying structure.

And so, you basically can use these types of techniques to build up arbitrarily complex data structures, and then when you're processing them, you can choose to only interact with portions of the memory, depending on what the algorithm is that you're using. So, for example, if you're trying to calculate how long the strings are, you can interact with the name offset vector, but you never have to interact with the values themselves.

So, there's a bunch of sort of benefits that you can do in terms of how you're processing. I think Sid'll talk about that more in detail, when you think about the data, in this example, I actually simplified things a little bit. In Arrow everything can be null or not null, and so there's actually one more piece of structure for all of these things, which is a validity vector, which describes whether or not for each individual record, whether or not the value is null or not null.

And so, when you actually lay these things out, this is the same data structure, when you actually lay these things out, you're gonna see, basically as you're communicating, so we're talking about schema and dictionary up top, but then followed down below by the record batch. So these are record batches, one particular record batch is gonna include initially what's called a data header, this is a very compact representation of how long each of the underlying buffers are that are associated with the data that's coming from this particular batch of records.

That is followed by the three different chunks of memory associated with name, followed by the two different chunks of memory associated with the age, followed by the four chunks of memory associated with phone, okay? Now, I showed these as being contiguous behind each other, and that's the requirement if you're gonna go across the wire to somebody. But, Arrow does support this concept of shared memory, of having a single data set that multiple processes can interact with, and in that case, it's not necessary that you actually put these things contiguously. So they could be all over memory space, as long as both systems interact with the same shared memory space, you can actually interact with these things without compacting them into a single spot.

The other nice part about this is these data structures, you can actually line these things up, and then get use scatter gather IO to send it across the wire. So, in terms of components, I think I lost a slide here. Okay. I'm gonna go into basically what are the components that make up Arrow. So the first thing is what we call the core libraries, and these are the different ways that you can interact with the representation, okay?

So if you want to create an Arrow data structure, or you want to consume an Arrow data structure, then there's basically bindings for a bunch of different languages to do that. You know, Java and C++ are the first that came along, but now there's Python, Sea, Ruby, Java Script and Rust, all available to interact with Arrow representations.

Now some of them don't swap every data type today, but we're working hard to make sure that they do, and most of them support all the most common types people are interacting with. And so, if you're building an application today that is data related, especially if you're moving between multiple languages, one of the best ways to build that application is to use these tools to write Arrow memory and then share that memory across the applications.

So, one step on top of the core libraries, is what I call the building blocks, that's an inform term that I call it. It's not what everybody calls it. And the four things that I think about here, two of them already exist, one is called Plasma. And Plasma is a single node in memory store for Arrow representation of data that specifically allows you to have multiple processes sharing memory with interaction to that data. And so you can spin this up and you can load data into it, and then you can have different users who are potentially doing a Python application to grab some memory, interact with it, or grab an application or whatever, grab some memory and interact with it, but several people can interact with the same memory so that you don't have to have multiple copies of your working data set.

And so, super interesting development there. Feather is another thing that's been built inside of Arrow. Feather is an ephemeral on disk format. And so, Arrow is really about in memory stuff, and we don't recommend that the representation be used on disk for long term purposes, because it's not necessarily as efficient as other on disk formats, it's also not designed to sort of ... It's not optimal for on disk.

That being said, if you have two different systems that you want to share data quickly, and you don't yet have a data to do that over RBC or IPC, then one of the easiest things to do is simply use the disk as the intermediary. And so that's what Feather's about, it's about dropping the Arrow memory straight to disk, it's basically the exact same representation as in memory, drop that to disk, you can then theoretically actually MMAP that in another application, and treat it like memory, and it's all in the same representation.

So, Feather's very efficient way to move things between different applications, or if you're running out of memory and want to drop stuff to disk temporarily until you have enough memory again. The other two things that we're working on in the community, one is called Arrow RPC, and this is about trying to deliver on the concept of I'll be able to move data between any two systems.

And so, there's definitely lots of techniques to move data between systems today, but they're not necessarily optimized for this analytical workflow that Arrow is focused on. And so, Arrow RPC is about having a standardized library that works with the core libraries in the different languages, allowing us to move data very quickly between two different systems. It's designed to ... Or it's gonna be designed, or it's in the process of being designed to support both a RPC or an IPC approach, so I can use shared memory if I'm local.

And then lastly, the other really important thing here is since we're dealing with large amounts of data, we're also very interested in supporting the concept that they parallel stream. So, it's great if I'm interacting between two systems that are small, that a single stream is fine, but realistically with the data sets that everybody works with today, we really want to be able to paralyze the set of streams, across many nodes so that two different systems can interact at massive scale.

The other thing that people are working on is something we're calling Arrow [Colonels 00:16:07] right now. That's the working name. And the concept here is that it's great to have a representation, the most common way that people adopt this, if they're trying to adopt it for their own application today is because they're working in two different technologies that don't necessarily work well together, and they want to move data quickly between one and the other. Okay?

The additional option that we see, the addition sort of opportunity that we see is that if we can give people very efficient ways to do a certain algorithm, let's say sort or dictionary encode something, then we can give them those tools and say hey, if your data's in Arrow representation, we can do these very very efficiently, and that's another way for people to start taking advantage of Arrow for a random use case that they might have.

So, on top of that, so you've got core libraries, building blocks, and then you've got Arrow integrations. And so, there are several different projects, this is just a subset of the projects that have all integrated Arrow in one way or another. So, one of the great things is [Pandis 00:17:00] has adopted it as an internal representation for lots of their processing, and we're really trying to drive that direction.

Spark now supports it, and that's what's Brian's gonna talk about, so I won't steal the thunder there, Dremio has a project that we work on, at Dremio. Which is built entirely on top of Arrow. The [Parkay 00:17:17] community, basically we figured out, Arrow one of the things that we did is we built Arrow to also think about how on disk representations existed, and what is an efficient way to move between the two? And so, the Parkay community has heavily adopted Arrow as a way to make it very efficient way to get data out of disk and into an in memory representation.

And so the C++ library built entirely on top of Arrow. And then the last thing I'll mention there is that the GPU open analytics initiative, driven by Anadia and several other companies also has adopted Arrow as the in memory representation for things, and so, that's really nice to hear, because one it's more support for the project, but then also it kind of sort of fits what we approached it with, which is just that the representation should be really efficient, whether it's for CPU or GPU. So that's sort of reinforcement of that point.

So, couple more slides, then I'll hand it off here. So couple ways to think about Arrow. So, we worked with Arrow for several years now at Dremio, and we've learned some lessons, some of them the hard way, of course, sometimes happens that way. And so I just want to share a few of the things that we've learned, and how we've kind of approached things.

And so these are all in context to something we call [Sibau 00:18:22] which is the engine that we have built around Arrow inside of the Dremio product. And the Dremio product's an open search project, to go look at this code, and I have some links I think in some of those stacks so you can look at stuff. But it's built sort of entirely around the Arrow libraries, it's built on top of the JVM.

So the three things I kind of want to share in terms of sort of real world experiences, the first is memory management, the second vector sizing, the third sort of how we approaching RBC communication. So, the first is memory management, and so one of the things that's different than sort of maybe a traditional database is that you've got these chunks of memory because you have all these independent pieces of memory for the different Arrow representations. And so what we do with ... What we do very extensively when we work with Arrow is we work hard on building hierarchical allocators. And having a very clear concept of what is ownership of memory. Because we're constantly moving these vectors between different operations inside this system.

And so what we did is we built something that's on top of implementation that allows you to create a tree of allocators, and this is actually code that's inside of Arrow that we built through sort of a bunch of sort of trial and error that allows you to sort of set two things. It lets you set a reservation for each of these different nodes inside this graph, or this tree. And then, additionally allow you to set limits around things, and so you can say, okay, I want to have a reservation of this operation of one megabyte of memory, I want to make sure it's no more than 10 megabytes of memory, and the allocator will help you sort of reinforce that to make sure you're managing memory throughout the system.

And when you think about Arrow, you're actually thinking about many individual chunks of memory, and so what we always interact with, at least in the Java library, it's called arrays in the C++, in Java it's called vector, but it's basically when we say vector, we typically mean the collection of all of the buffers associated with a particular scalar data type or a particular data type node, if you will.

So, something very important to focus on, because as you start using Arrow, you'll find that because of the random accessibility and some of the other things, it can be a lot of memory. And obviously memory is scare, and so being very aware of memory management is something important to think about when you're working with Arrow.

This is kind of some stuff I talked about, but basically this concept of when we're working through ... For example, an operator pipeline, we make sure we have a very clear concept of ownership across those things, and how we can transfer between them. The second thing is that how big you create chunks of records is also kind of an interesting question.

And so, all of this is kind of based on some early work done by the ... What is it called? The X100 [Monay 00:20:51] X100 papers. Where they started talking about, hey, rather that doing the C Store stuff where I have entire chunks of columns and memory, I'm gonna have sort of like rogue groups or smaller chunks of columns, columnar data in memory, but then I break it up by rows.

And so, that paper kind of identified this concept that like, hey the optimal size of this might be, you know, one to four thousand records, or something like that. Maybe it was one to 16 thousand in their particular example. And so, we actually work on that same model, which is that we generally will start by targeting about four thousand records at a time. And the reason is that you want to figure out a size of record set that is efficient for the overheads associated with the data structures, so if you picked one record, that wouldn't be very good, because the overhead of the data structures would be very great and be very inefficient in sort of terms of representation.

You also wouldn't get any of the benefits of sort of the columnization or vectorization. And so, if you pick a very large chunk of data, then it takes a long time, it's very hard to pipeline it through systems. And it's just, like if I imagine, for example, a batch of records that could be like, a hundred megabytes in size, when I'm gonna push that across the wire, it's gonna be very hard to sort of pipeline that through the system.

And so, basically what we did was we started realizing that depending on the width of the records, you're gonna want to resize how many records you have in the set. And so, we start with four thousand, but we'll actually size it down as the record gets wider. So if you have a thousand columns, you probably need something a lot less than four thousand records, because otherwise the overall data set will be too large. And that means that you're gonna pay a higher fixed cost for the overhead of the representations, but the flip side is it means you're still piping them through the system and not dealing with giant allocations, which can also cause fragmentation problems.

And so, that's something we kind of work through and people should watch out for is that as you get to you know, wide batches, you want to sort of shrink down those things, if you're trying to pipeline them through the system. The last thing to sort of hit on is that RPC communication, so we spent a lot of time, we used the structure that identified it in the Arrow respect to move data between systems, but worked very hard to built it so that we have a gathering and write approach to things.

And then we also have as close to possible as zero copy approach between systems. So, yes you're gonna push it between two systems across the socket, but beyond that try to minimize the number of copies to avoid problems. And so, we worked through that a lot, and basically created this concept of a side car message, and so we've actually been ... So we have our own protocol today, open sources in there, in our hub, but we have our own protocol today because we really wanted this ability to have the side car messages, which hold the bulk of the data, and are not processed through a normal RPC framework.

One of things we've been working with the GPRC community with for some time is a way to try to express this in GRPC, because that's a very nice library that has a lot of good support. And so probably the initial Arrow RPC will be built on top of a GRPC, even though it may not always have zero copy. It may have it in some cases and not in other cases. So that's what we did with RPC, so that was my main slides, oh, that's right, I had two more, I forgot.

So one is that I've done enough talks to Arrow to realize that sometimes people just want to have a very ... It's not necessarily clear. It's kind of abstract. What is it, what is intended? Okay, so what is it? It's a library, or it's a set of libraries, that's what it is. It's a set of tools, it's a specification, it's a set of standards, it's really designed for being efficient in memory, and it's efficient and convertible to things like Parkay where you're going columnar to columnar, very very fast.

What is it not? Well, Arrow is not an install-able system, right? Arrow does not replace X. Arrow is a new thing, it's trying to solve some problems that other systems solve pieces of, but it's a very sort of focused use case, in terms of what Arrow's trying to do, and so you will not install Arrow, you may use Arrow in an application you're building, or you may use an application that has Arrow built into it.

But those are gonna be the two situations that you're most likely to use Arrow today. It's not an in memory grid. There's an exception here, because Plasma's actually in memory single node solution, but don't confuse Arrow with sort of like, an Ignite or a [Tackeon 00:24:47] because it's not about that. There may be people who build something like that on top of Arrow, so for example, someone like, the Tackeon guys may so, you know what, I want to support Arrow as an in memory format in addition to this sort of disc format, or the file system, in memory file system approach.

And they may support that, or one of the other sort of in memory grids may support that, but Arrow is not really trying to drive towards that today. And the other thing is it's not designed for streaming or single record operations. It's designed for working with these large batches of records, so you can work through them very efficiently, and so if you're trying to solve an application where you're interacting with one record at a time and you want to move it through the system very quickly, Arrow is probably not the right thing.

So that's kind of just sort of help with the what it is and what it isn't. I have to get a quick pitch, just one minute, which is what is Dremio? So Dremio is a startup, we're based in Mountain View, we were established in 2015, we launched the product in 2017. And we're really trying to build what we call a self service data platform. It's a way to interact with data that's more easy to interact with. The narwhal's name is Narly, and I hope Justin, we have stickers.

Speaker 2:

We have lots of stickers.


Okay. We have stickers, if you like Narly, you can get a sticker of Narly. And I think we even have stickers that don't even have our name on them. Do we have those? No.

Speaker 2:

Yes, that don't say anything. Just the ones with the name.


Exactly. Kids like the narwhal. But anyway, our project ... Our product is an Apache licensed product, it's built on top of Arrow, we use Arrow extensively, everything that we do inside of our system is basically we read data off disk, immediately turn it onto Arrow, use it throughout the system, all the way out to the clients as Arrow. So we're big proponents of Arrow, and so that's what Dremio is. So that's all I got. Questions for me, before we hand it off to Brian?

Speaker 3:

How does Arrow achieve fast processing?


How does Arrow achieve fast processing? So, Arrow does not today, in itself achieve fast processing. It is a representation which is optimal for fast processing. Right, and so there are certain algorithms implemented in Arrow, there's a bunch of work about implementing a bunch more algorithms in Arrow, but the representation as it's laid out, for example, and actually I think Sid's gonna hit a bunch of examples of this, but basically the representation is designed so that the types of operations you typically do against data are very efficient for how the CPUs like to interact with memory and cash and that kind of thing.

Other questions? Back there?

Speaker 4:

So double clicking on the thing that you mentioned about streaming and transactional use cases, if I have a bunch of records in memory, in Arrow, representation, can I change their values? Like can I update those records in memory?


The libraries that are built for Arrow, think of Arrow as immutable. Sense it's a specification, as long you're managing the locking, you can. Right, now there are some complexities to that, and so the example might be, is I have a bunch of strings that are sitting next to each other. Okay, and the strings are all different sizes. If I want to go in and change one of the strings, it's gonna be hard for me, because I may not have enough space, or I may have too much space to replace the string.

Right, and so there's no extra space, and there's not a bunch of pointers, and so that can be kind of a challenge, but as long as you're controlling the locking, and we actually do this. Sometimes we will actually update the representation. But we never do that sort of ... We're trying to communicate or share with someone else.

Speaker 4:

So, fix up linked values are okay, but variable linked represent ... Like, values, are gonna cause problems, that's what you're saying?


Yeah, that's right.

Speaker 4:

And there is not support in the Arrow libraries themselves to handle this?


That's correct.

Speaker 4:

Okay, thanks.