45 minute read · April 16, 2019

Data Science Across Data Sources With Apache Arrow

Webinar Transcript

Jacques:Hello everybody, thanks for being here late on a Friday afternoon. Hopefully I can keep this interesting for you. As mentioned, I wanna talk about Apache Arrow and what that’s about, and specifically in the context of, as you’re working with different kinds of data, how can it help you to get your job done.So, as mentioned, my role, I have two hats and I’m actually wearing both of them for this talk. The first is that I started up a company called Dremio, which, you may have seen the booth outside, we are sponsoring the conference. The second one is that I’ve been driving, with others, the project Apache Arrow for quite some time, since the beginning of that actually. So, those are my two roles.So, let’s start out with a story. I got here late last night and I was riding the Uber from the airport and the guy is like, “Oh, you gotta have some barbecue, you gotta have some barbecue.” And so, I’m like, “Okay. Okay, I’ll have some barbecue.” So I got here and I’m like, “Where should I go?” He’s like, “You gotta go to Terry Black’s. That’s the good place to go.” So anyway, I get to the hotel, I look online. Sure enough, there’s good reviews for Terry Black’s. I know nothing about it. It’s a mile away and I’m like, “What’s the best way to get there?” I look, they’re closing in like 35 minutes. I’m like, “Okay, I’m gonna get an Uber. I don’t know how else to get there.” I could walk it, but that’s gonna be … not the best walking shoes for a mile, a mile and a half, or whatever it was.So, that’s my thinking. I live in the South Bay San Francisco area, and so I drive more than I live downtown San Francisco. I never scootered. So, I ordered the Uber inside the lobby and immediately I walk outside and I see scooters. And I’m like, “Ha, I should’ve used a scooter. That would’ve gotten me there in time.” And I’m like, “Well, I’ll use a scooter on the way back.”So anyway, go to Terry’s and as I’m walking into Terry’s I see scooters. And I’m like, “Well, I don’t know what brands of scooters are here. I know about Bird, I know about Lime, but I don’t know if those are the ones that are here. As I’m walking in I see a bunch of scooters that look like this. Now, if you know what this is you know it’s a Jump scooter, but if you don’t know what it is it’s really hard to read what it actually says it is. Okay? And so, there was a bunch of them, they were kinda worn out, and I’m like, “I can’t tell what that is actually saying on the side of it.” And so, I’m like, “Well, I’ll Google online and see what’s going on with scooters in Austin.”So, I think you’re the only place in the world that has this much content on scooters. I was shocked. There’s this whole thing called Dockless Mobility, it’s like this big initiative. And so, first of all I was able to find a list of all the scooter companies around, which was cool. So I went through and I actually started clicking on them to figure out what the hell scooter this was. Sure enough, figured out it was Jump scooters, but the second thing I saw was, hey, look at this. You guys expose, or at least the ones of you that are here from Austin, expose data about what’s going on with the scooters. So, I’m curious whether or not we can start to see patterns about what’s going on with people at AnacondaCON using scooters through this data. Okay? So I’m like, “Hey, this is an interesting thing to take a look at.”So anyway, I saw the information last night. This morning I’m like, “Hey, let me see if I can pull something together about that.” And the question was, are there others here that are like me at AnacondaCON? Because honestly, I’d never ridden a scooter before. I did ride a scooter last night coming back to the hotel. It was fun. It was interesting. I don’t know that I’ll do it again anytime soon, but it was good. Anyway, but the question was, are there others like me? And so, how do you solve this problem? How do you start to understand what’s going on with this dataset? So, this is a new dataset, I’ve never looked at it before. I wanna figure out what’s going on, start to do some analysis against it. Okay?So this brings me to the product that we’ve build, which is called Dremio, which is a tool to help you try to find, access, understand data and connect it to whatever tool it is that, at the end of the, you wanna actually analyze with. So, it’s not an analysis tool itself, it’s trying to connect you with data. Okay?Now, before any of these thoughts around scooters, I actually built up a little instance myself for the demos that I was gonna do later in this talk. So, what I did, in my S3 bucket, I dropped the CSV for the Austin Dockless Mobility. Okay?Now, what’s cool about this data is it’s actually updated every night. So, I downloaded it this morning, popped it up on S3. You go to Dremio and you can simply set what the format is. In this one it’s basically Linux delimited CSV file which has headers, which is nice. So anyway, I dropped it into here and that allows me, in Dremio, to look at it. Okay? So I can just see the data and start looking at it.Okay, I’m like, “Well, what do I wanna do with this data?” And I start to notice, well, it’s a CSV file, so it doesn’t have data types. So, I went through the tool and the tool allows you to do things like set data types for all the different things, clean up data, remove data that’s not interesting to you. And so, for example, a bunch of scooter entries had no start or end locations. So, I don’t know what that’s about, but that was part of the data. As you always expect, there’s lots of dirty data, and so what I did is I actually built up what we call a virtual dataset in Dremio, which is a bunch of different operations. Nothing that interesting, basically just cleaning up the data, setting time stamps, eliminating data that seems to be empty, things like that. And so, that gave me a base dataset to look at.And in Dremio you create these things called spaces, which allow you to store a certain content, set of content, that’s interesting to you. And so, this virtual dataset is something called Scooters in the Austin space that I created for this talk.Now, I could just interact with that data and start playing with that data. But I wanted actually to look at it in the context of what’s going on here at AnacondaCON, and so what I did is I actually built a dataset on top of that scooter dataset called The Fairmont Austin data. And so, this Fairmont Austin Data I actually said, “Okay, I’m gonna take that data, I’m gonna wrap a bounding box around it.” I went and found out what a good bounding box was for Fairmont, this hotel. ‘Cause I didn’t know what that was. And then I basically categorized the data in a couple of different ways.So, I started out by categorizing the data for whether or not the ride either started at Fairmont or ended at Fairmont. Okay? So, the same bounding box but was it the start location or the end location. I also categorized it down to actually start to categorize the trip distance. Because I started to see some patterns in the data as I started to analyze this, and so I started adding more information in here and basically categorized trip distance. So, anybody who’s under 100 meter I considered to be newb, because that probably means you started the scooter and stopped the scooter without actually moving the scooter. So, you just paid the dollar to that company. I think they’re happy you do that, but to me that’s a suggestion that you didn’t necessarily have that much comfort with a scooter. Then I actually said, “Hey, anybody who’s going under 500 meters is probably someone who doesn’t like to walk.” And so I categorized them as lazy, and then everybody else was normal. Okay?And then I said, “Let’s look at the data.” And so, in this case I built this dataset up and then what I did, I close this out here, is I dropped it into Tableau. So, I know that no everybody here uses Tableau. A lot of people use different kinds of tools to look at data. I happen to use Tableau quite a bit to look at data. And opened up the dataset here. You can click in Dremio and it pops up and lets you log in and then access that dataset directly.Now, to not take you through the process of building up analyses, I actually built up some analyses that I was looking at earlier today to see what was going on with the data. And so, here is the time analysis of scooter rides that started at … So, actually applied a filter here. Oh no, actually I didn’t. I have both routes here. So, this is the number of records that ended at the Fairmont, this is the number of records that started at Fairmont, and then the time since the beginning of the year and the activity. Okay?And my first question was gonna be, “Hey, can we see the spike because the Anaconda event is here? That there’s more scooter users?” And the reality is no, we can’t. Right? We’re right here at the end. Right? So, in fact, we actually use the scooters less than, it seems, everybody else does. Interesting. The second thing is like, “Oh wow, yeah, I wonder what’s going on here.” Well, South by Southwest must be going on there. Right? So, clearly substantial amount of use in South by Southwest. Right?And so I was like, “Okay. Well, let’s look at just in the last 48 hours the use of scooters.” Right? And I was trying to say, “Hey, I wonder how much pattern you can see around the agenda of AnacondaCON and whether or not people are going and using scooters at different times.” Didn’t come up to anything super interesting here. Well, one, it looks like people are going to lunch. Okay? So, people apparently didn’t like the lunch here and wanted to go off to lunch. And then the other one is that at nine o’clock a bunch of people left. And so, nine o’clock was after the party last night. I’m not sure what was happening that everybody decided to go someplace else. And so anyway, I’m looking at that.The other thing that I saw here, and I saw this across the board, right, even the general time analysis, apparently people like to take scooters from the Fairmont, but when they’re coming back they don’t use scooters. The amount of people who take the scooters away is way more than the people who come back. Right?So anyway, so I also looked at what’s going on with the map here. So hopefully this map will reload. The map doesn’t reload. Basically, if you can see my cursor here … Oh there we go, there’s the map. So, the Fairmont Hotel is basically right here. Okay? And so, if you can see my cursor, it’s kind of in this zone. Okay? And I was looking at this map and this when I started to actually say, “Hey, maybe we should look at what’s going on here.” Because if you look at where people went with scooters from AnacondaCON, most of them went nowhere. Right? If you look at it they went here, they went here. These are the end destinations of people who started at Fairmont. Right? So basically they didn’t go anywhere. Okay? And that was when I started saying, “Hey, what is going on here?”And so, I broke it down and I said, “Hey, what is the minimum trip distance, what is the max trip distance by hour, as well as the average?” And what you see here is these crazy short minimum trip distances. Right? So, min trip distance here is 112 meters, but then all of a sudden at noon there’s all those people and then they’re actually just going zero distance. Now, that could’ve been one guy or gal who was just trying to figure out how to use the scooter and stopped it 12 times, but you basically see a bunch of situations where people were very low in their scooters.And so, this is when I said, “Hey, what I wanna do is actually take a look at the use of scooters and see whether or not the people who use these scooters at AnacondaCON are different than others.” Right? And so, I came up with those categories. One, you’re new, who is someone who doesn’t actually go anywhere. And the second one is someone who’s lazy who doesn’t go more than 500 meters. Okay? Probably something that could’ve been walked. Okay? And I apologize to those people who actually have injured feet or whatever, not talking about you.So, I was like, “Okay, let’s look at those two things.” And this is when I dropped into a notebook and I said, “Okay, let me build up stuff.” Now, I’m a SQL person. I could’ve done this with different kinds of tools. I’m a SQL person, so I actually built it up using SQL, and I basically wanted to categorize how many people were newbs out of the total amount of people. And I was like, “Well, what am I gonna compare it to for a base line?” And I said, “Well, South by Southwest might be an interesting base line.” Right? Like, at one hand I’m like, “Well, it’s gotta be a lot of people who are using scooters a lot, it seems like, coming to South by Southwest.” And so, anyway, I was like, “Oh, let’s compare these things.”And so, this is what I came up with, was the South by Southwest audience is about 22% of people are lazy. Okay? Out of the total scooter rides. And actually AnacondaCON is a little bit less. So, hey, good for us. Right? On the flip side, it turns out that there’s more newbs for us. So, apparently people use scooters more in the audience that go to South by Southwest.Now, the one thing I wanted to compare it to was, how does this actually compare to things when no event is going on? And what’s interesting there is that apparently there’s more lazy people in AnacondaCON and South by Southwest as opposed to the norm. But on the flip side, we’re generally slightly better at using scooters. Okay?So, jumping around here, the access, whether it’s in one tool or another tool, the goal is to be able to get to data very quickly. So let’s go to a couple of slides that talk about that. But the good news here, and I apologize ahead of time to the vegetarians, was I did in fact have success and enjoy some brisket and a beef rib.So, let’s talk about what that Dremio tool is and then we’ll talk about Arrow and then we’ll talk about how those things fit together. So, Dremio is an open source product. We have a community edition and an enterprise edition. It’s basically designed to help you find, access, curate, share and secure your data. Right? So, think about all the things that you need to do with data before you can actually start to consume it. You’re gonna have to clean it up, you may have to canonicalize some things, you may need to join it with some data that’s very specific to your business. All those things are things that actually should be shared.That’s one of the main thinking that we have with Dremio, is that each … today many times several different people inside of an organization are doing the same steps with this data. Okay? And that’s because you start building up your analysis inside your notebook and then someone else has to go through the same set of steps in their notebook. And so, with Dremio the goal is that one person can go through these steps, other people can then go and find the data and they could collaborate together around it. Okay?It’s designed to run on your laptop, or it can run in a cluster. So, we have customers that are running five, six, seven hundred nodes of this in a cluster environment, and then we also have people who just use it on their laptop. And it’s really, really fast. The goal is to be able to get access to data faster than what you can do today. Okay? And the way that we do that is in large part built on top of something called Apache Arrow, which I’ll talk about. We also use Parquet a lot and several other things.And one of the key things that we do is we try to abstract away the details of what source of data you’re accessing. So, whether you’re accessing a CSV file or you’re accessing an elastic search cluster or you’re accessing MongoDB or Oracle or SQL Server, all of those things have their own interface. Some of those are actually SQL, but even their SQL dialects are different. And so, Dremio says, “You know what? Someone who’s consuming data shouldn’t have to worry about all of those details and should be able to just access a data endpoint and always use a single language, in our case SQL, to access that data.” Okay?And then lastly, if you haven’t already, if you like our narwhal, his name is Gnarly and he’s our mascot. And so, you’ll see him around. You probably saw him outside there in the booth.Go through a couple of slides here in detail, but basically we look at the problem of data access as basically a problem that two main groups of users have. Okay? The first are data consumers, so people who are actually trying to get the data, right? They wanna figure out how they can get this data into a shape that’s useful for whatever tool they have. And the second one is data engineers. These are people who are actually responsible for getting data, the mechanical heavy lifting of data, reorganizing data so that it makes sense for the business, that kind of thing.And Dremio is really built to help these two users do their jobs on a daily basis. Okay? And the way that it works is it sits between whatever data sources of data you have and whatever tools you wanna use to actually analyze your data. Okay? And there’s a bunch of components about how it does this, but at its core it’s about how do I make this data available to people. Okay?Now I’m not gonna talk a lot about it in this talk, but one of the key things that we do is something we call data reflections, which is the ability to materialize different versions of data and then when a user asks for data or is consuming data for a particular purpose, we can pick the fastest version of that data to get them the data as quickly as possible. Okay? And so, in this example here I created a reflection on the CSV file, because parsing the CSV file over and over again and casting all of those things actually takes some time and I want as interactive a response time as possible.Another way to think about it is Google Docs for your data. This idea that you can create an artifact and share that artifact with others, you can set permissions around that artifact and basically avoid doing duplicative work that people typically do.As I said, we are fast. The reason we are fast, in large part, is because of Apache Arrow. So, the product has been out for a little less than two years, Arrow has been out for more than three. So, when we started the company we started thinking about how do we make this a fast product when it’s sitting between those other things. And one of the key things we came up with was the idea of Apache Arrow, and I’ll talk a little bit more about that in a second.Yeah. So let’s talk about Apache Arrow. So, I think probably most of you have heard about Apache Arrow by now. It’s great news that you have, because a few years ago that wasn’t the case. So, let me talk a little bit about that. So, Arrow at its core is the standard for in-memory data and how you represent that data for high performance processing. Okay? It’s designed to speed up a bunch of different workloads, it’s focused on analytics, and it’s really about a consensus driven approach to this representation of data. And so, the project has, I don’t know, 30, 40 different people who are involved in it, that are stewarding the project that are from many, many, many different companies. And that’s a key part of why it’s important and why I think it’s been as successful as it has.Arrow had this idea, right? And so, when we were looking at the problem of interoperability we thought about, well shoot, there’s a huge amount of formats that are already designed for interoperability. I can just send stuff in REST and JSON, and I can send it in Thrift, I could send it in ProtoBuf, I can write it into a file with various different formats and communicate it that way. The problem with all these things are that they’re fairly expensive. Okay? And the reason that they’re fairly expensive is not about the format itself. The format may be an efficient format to move data, the problem is that systems don’t use those formats internally when they’re working with data. Okay?So, what happens is, every time you move from one system to another system, you take an internal representation of data, you go through several steps of serialization and encoding, you put it on, say, the wire, then you get it to the new system and that system has to deserialize the data and put it into an internal representation again. Okay? And not only is it that transfer between those different representations, but in most cases those APIs are typically built on top of methods. Right? So, I wanna access the value in this cell. Okay? And so, even the requirement of reading cell by cell out of data between one system and another system becomes very expensive.And so, when we were looking at interoperability, what became clear was that it was more important to think about how do people adopt a format internally than it is about the actual format that’s being used to do interoperability. Okay? And so, we basically said, “Forget everything you think about when you think interoperability of data formats and focus on processing.” Because if we can build a representation of data that’s very efficient and common for processing purposes, then as systems adopt that representation to improve their processing performance, all of a sudden when one system is talking to another system they happen to have the same internal format and therefore the interoperability can be very, very highly efficient. Okay?And so, that’s kind of the idea behind Arrow. And how it started was, about four years ago I got to know Wes McKinney. Okay? And so, I came from the database side, he came from the data science side, and what we identified together was that those two communities hadn’t done a good job of ever integrating with each other. They’re kind of out of two different worlds. Okay? And so, when we came together we basically spent the next six months talking to different people, starting to socialize this concept of Arrow, and then after about six, eight months, something like that, we then launched the Apache project.So, since that time we’ve seen a huge amount of adoption. Apparently I’ve lost that slide. Well, that’s okay. I believe that last month there was about two and a half million downloads of Arrow in that one month. So, that’s amazing given that the first artifact of Arrow was available, I think, two, or a little bit less, years ago. And so, a huge amount of success there.What is the representation? I’ll go through these slides pretty quickly, but just to give you a quick overview. But the representation is a shredded nested data structure representation. So, it doesn’t just support flat rows and columns, it supports arbitrarily nested data structures. It’s designed specifically to be randomly accessible so that you can do high performance processing without having to encode into a different representation. It’s very much focused on maximizing your CPU efficiency, pipelining things through the CPU, organizing things in a [columnar] way to improve the cache locality, et cetera. And also designed to very easily and efficiently scatter and gather onto your network socket or your disc. Okay?Oh, I do have the slide. Well, there it is. So, a huge number of projects have adopted the technology, and that really goes to the fact that it is a foundational technology. And as you see, the chart of downloads, it’s a pretty amazing chart. So, I had the chart from a couple of months ago. I was like, “Hey, it’s a million downloads a month.” Now it’s two and a half million, so it continues to do very, very well.So, let’s talk about the format in a little more detail. So, you basically have all the standard types that you would think about in most data representations. You’ve got scalars, things like Booleans, integers, you’ve got date, time, timestamp intervals, strings and binary data. You’ve got the main three complex types from my perspective, which are struct, map, and lists. And then you also have a union type, which is pre exposing things that heterogeneous. So, for example, if I’m processing a JSON file and the first record, field A, is an integer and the second record, field A, is a list of strings, I should be able to support that even though there’s no name to that union. And so, Arrow also supports that concept.If you think about Arrow it’s this representation of data and how you lay it out in memory and then how you communicate that between different systems. Okay? And the way you communicate it is that you start by sharing schema information between the two different systems, then you potentially, if Arrow can be dictionary encoded, you will share a dictionary batch, which defines how to dictionary encode the different values. And then a bunch of batches of data, which are called record batches, which are actually chunks of records in a columnar representation, but they’re all the columns for that chunk of records. Record batches, different people use different things, but we typically target around 256K of memory per record batch to keep things close to the cache.Here’s some details about how it lays out in memory. The main idea to understand here is that it’s columnar in its representation. So, data is set out next to each other, so all the name data is next to each other, all the age data is next to each other, and all the phone data is next to each other. And we also encode the different kinds of data next to each other. So, if you’re talking about strings, which is the name here, then we actually encode all the data end to end and then in a separate vector we encode the start and end point of each string. Okay? And so, this allows you to do operations just on the metadata of the data structure, but it’s also lined up very well to do all sorts of efficient things on the CPU. And so, if you actually look at the integer representation, this is designed very well to fit within your instructions that already exist. And we focus on alignment and whatnot to make sure that that happens.You then take all these different structures and you put them together, and that is one record batch. So, let’s say there’s 64 000 records in this batch, so you’re gonna have all the data for the … so, there’s actually three vectors for a string, there’s the offset in the data, but then there’s this third one, validity. So, validity says whether or not this is a valid point of data. And so, these things are basically end to end, then you put the age information, then you put the phones information, et cetera. You put all these things together, that’s a record batch. Okay? And there’s two ways to think about a record batch. One is that, in some situations, Arrow may be actually sharing data in the same process or in a shared memory context. And in those cases that data could be scattered all over your memory. Okay? But if you’re putting it on the wire, and that’s what I’m gonna talk about in a little bit, in those situations you generally compact the stuff all together, you do a gathering right where you send the stuff on the socket so that the other side receives the data in this end to end format. But that’s not actually a requirement of the format itself. Okay?So, when you think about Arrow, the first thing that we did was we built a whole bunch of core libraries. These core libraries are in, I think, 14 different languages now. This is only a subset of all of them. And so, if you’re working in any of these languages and you’re building a data application, you can use these libraries to build up that representation in that native language. Okay? Not every one of these are a native implementation, some of them use a C binding, but most of them are actually native implementations so that you can debug it and understand it yourself.So, on top of those libraries we then have what I call building blocks, which are additional components in the Arrow collection of tools that are for solving specific problems. And four of them I think are important to know about. First is the Arrow Gandiva project which is a sub project of Arrow that is focused on building an LLVM based compiler for arbitrary expressions. So, in many situations you’re gonna do a calculation, like A plus B minus two divided by 0.2, or whatever that might be, right. Those expression trees typically can be a fairly expensive part of your processing pipeline, depending on what you’re doing. And so, having a runtime generating code base that allows you to process those things very efficiently can be very cool. And the reason that Gandiva can exist and actually write very, very efficient algorithms is because Gandiva knows exactly what the representation of the idea is in memory. And so, you can take Gandiva and you can build up and Arrow structure in JavaScript and then hand it to Gandiva and it will process that data with that arbitrary expression much more efficient than you can do any other way. All right? And that doesn’t really matter what language you wrote it in. Gandiva is written in LLVM and C++, but it also has bindings for Java and Rust and several others now. Okay?The second one, and this is where I’m gonna go into a lot more detail, is called Arrow Flight. And that’s really about communicating data between different processes that wanna share Arrow data. Okay?Two others that you should know about, one is Feather. Feather was a very early project that was added to the Arrow initiative and the Feather project is really about serializing data for an ephemeral purpose. So, if I wanna go from one application to another and I don’t have Arrow Flight yet, one of the easy ways to solve things is just write stuff to disc and then go into the other application and read it outta disc. But you generally don’t wanna spend any time encoding that data, because you know you’re gonna be throwing it away right after you read it. And so, Feather is a very lightweight format that allows you to drop data from memory into disc and then bring it back out.And then lastly Plasma is a shared memory storage layer focused on single node that allows you to take a representation of data, and Arrow is an example, and store that in shared memory and then access it with multiple applications.So, on top of the building blocks there are actually several really important integrations that you probably know about. Or if you don’t, here they are. The first is that, because we’re working closely with Wes, Pandas is doing a lot of adoption of Arrow to improve performance of a bunch of core processing algorithms. We’re also working with the Spark community, and so now if you wanna do Python operations inside of Spark they are something like 70 times faster than they used to be, because the Spark libraries actually use Arrow as the representation to move between the Spark context and the Arrow context.Dremio, the project that I work on, I talked about that a little bit. The Parquet project is adopted heavily to improve the performance of pulling data out of Parquet files and writing Parquet files. And then I … Shoot, I forgot to update this. The NVIDIA guys are gonna get on me on this. The NVIDIA RAPIDS initiative, usually called the GPU Open Analytics Initiative, has also adopted Arrow as the internal format and the internal representation for GPU analytics. So, basically you’re seeing it in a lot of the common things that you use now and the goal is to basically speed those up.So, I always had hit this slide, because Arrow is a lot of things, but there’s a lot of things that it isn’t and I wanna make sure that you don’t think, “Oh. Well, it competes with this,” or, “How is it gonna replace that?” It’s a foundational layer. Okay? So, it’s a specification, it’s a set of libraries and tools for dealing with data, standards of how to move that data around between different systems, obviously designed for very efficient processing. But what it isn’t, it’s not an installable system. So there is no Arrow [daemon] that you run. It isn’t a distributed system. Some people think that it’s an in-memory grid, it’s not that either. You could build all these things with it, but it self is not any of these things. It’s also not designed for single record or streaming applications where you can’t deal with these batches of data. Okay? It’s really focused on analytical where you’re gonna be having a decent amount of records in each one of those record batches before you push it along to Wire or whatever.So, that’s Arrow. So let’s talk a little about Flight. I’m gonna go fast here because I know I’m gonna run out of time. Arrow Flight is the next phase of Arrow. So, Arrow came out two or three years ago, whatever it was. The project started three years ago, the downloads came out like two years ago and we’ve grown it since then. Arrow Flight is the next phase.And so, about a year ago we started the Arrow Flight initiative and the goal here is to have a standardized way of moving data between different systems. A high performance, wire protocol that you can have in multiple different languages to move data between different applications that are not necessarily on the same system. So, you can do shared memory before, but if I’m not on the same system or if I’m in containers where shared memory is not really a thing, I can use Arrow Flight to communicate that.If you think about it, there’s building blocks, right? We started out with, hey, we need a representation that’s common. Then we need libraries for that representation. Now we wanna be able to communicate that representation on the wire between different systems. So, it’s really trying to fulfill the promise of interoperability of systems and that’s sort of the core of what Arrow Flight is.In Arrow Flight there are two main types of operations: data operations and actions. Okay? Data operations are generic. Like, I wanna get a stream of data, I want to give you a stream of data. Those are the two main data operations. Okay? On top of that we then have actions. Now, actions are per application. So, it’s a generic interface in the Arrow Flight libraries to allow you to have different concepts of what an action might be for a data micro service.And so, I’ll talk about this a little bit, this concept of a data micro service. So, you guys are all familiar with micro services in general, but on of the things that happens today is that people generally don’t have a way of binding different data services together if they’re doing analytical workloads. What they do instead is they usually write data to disc. Okay? ‘Cause that’s the easiest way to get a large amount of data from one application to another. The goal with Arrow Flight is you no longer need to do that. And so, people can build a service which scores a bunch of data and package that up and then have an Arrow Flight endpoint where you send data and then you pull data out. Okay? A system that may be caching data in memory. Right? So that you can write data into that and then someone else can read it out. And so, it’s basically trying to abstract away the details of what that system is doing and say, “Hey, let’s just have a common way of communicating large streams of data between different systems.”But on top of that you may wanna have different operations. So, if you think about the in-memory cache scenario, you might wanna be able to expire something in the cache, or say, “Hey, pull this data out of S3 and load it into the cache.” Those would be specific actions for that application and the Arrow Flight protocol supports defining these actions for each type of application you might build in with Arrow Flight.I’m gonna skip over this, ’cause I know I’m not gonna have enough time. Basically, I’ll post the slides, this is an overview of how the protocol works in detail. One of the important things to note about the protocol is that it’s parallel by definition. Okay? So, if you realize when you’re working with large datasets, odds are you don’t want one stream to move data between two, three systems. So, as I talked about before, we work with a lot of customers, we’re running on hundreds of nodes at a time, and I wanna take hundreds of nodes and communicate with hundreds of nodes. And the last thing I wanna do is have to do that through one stream, and so when you go and get what is called a narrow flight from one node and say, “Hey, how do I access you?” that will actually return and say, “Hey, I’ve got these 150 streams.” Okay? “And here’s all the places you can get each of those streams,” so that you can go to all those places as once with all of your nodes and pull that data in parallel. Okay?And so, more and more people are doing distributed computing, and so having that as a core component of Arrow Flight is very nice. Now, it doesn’t require that the consumers or the producers are multi stream. So, if I’m a consumer and I’m on my laptop, which is what I’m gonna demo in a second, in that case I’m just gonna pull one stream after another. I can choose to parallelize it on that one laptop, but I don’t have to. Okay?There’s also a security authentication set of pieces to the protocol, so you can go and read about those online. And it’s also designed to support backpressure and stream management. So, if you don’t know, we built our stuff on top of GRPC, which is a great library. Although, it’s more focused for application communication so we had to make several changes. But one of the nice things about GRPC is it’s built on top of HTTP/2 and HTTP/2 has a concept of streams. And if you think about pulling data between two systems, one of the things you really have to worry about is whether or not the backpressure works correctly for what you might wanna do.And so, the common use case where this could be a problem is that you’ve got two different people who are consuming from the same system and they’re consuming in a shared context, so they’re consuming over the same socket. In that situation you may have user A who’s pulling data very slowly and user B is pulling data quickly. In a system where they don’t manage this well, if user A doesn’t pull the data, user B can never get his data. Okay? So, with HTTP/2 streams and the way that we built this with Arrow Flight, that’s not at all a problem. So, if user A is not pulling quickly then that backpressure gets pushed across the wire so that the sender will actually stop sending, but at the same time user B can continue to pull. So, you’re basically multiplexing on a single socket to make sure that individuals are not impacted by each other.So, let’s talk about in action. I’ve got a little example I wanna show here, which is a Jupyter notebook, so I got Python 3.7, and I’m gonna compare three different things: pyodbc, turbodbc, I don’t know the right way to say that, and pyarrow.flight. Okay? And I’ve actually connected it up to Dremio and I’m running it on some Ec2 instances.So, let’s go over here to the first example here. And I’m gonna just reload this, because I know the notebook never likes to run unless you reload. Okay. So all I’ve got going on here is I’m gonna show you basically two different things to start. I’m gonna be running with pyodbc and I’m gonna be running with Arrow Flight. And I wanna show you what the difference is in performance once you start to use Arrow Flight. Okay?And so, here I just got a couple of imports and then I’ve actually written some utility functions to just print the time of doing two different operations, the pyodbc query and a Flight query. Okay? And so, the first thing to note about this is that, if you run small queries, so here I have a random dataset, it’s the [inaudible] table from the TPC-H benchmark. It’s maybe 10, 12 columns, 15 columns, something like that. The table is fairly large on this cluster so that I can do a large query on it to show you, but for the first query I’m just gonna go get 2000 records. Okay? Well, 2000 records isn’t a lot, so if I run this thing you’ll see they’re about the same, right? So, it’s 30 milliseconds versus 54 milliseconds. I think if I run this five times they’ll kinda swap around, because it’s kind of in the noise. Okay?So, small datasets, it doesn’t matter. You can your odbc stuff that you have today. The question is when you do larger stuff. And so, let’s go here and we’re gonna run the pyodbc run. And let’s wait a little bit. Okay. So, if you think about it, this goes to exactly what I was talking about before right, interoperability. Okay? When you’re communicating between two systems, if you think about this example, this is talking to Dremio. So Dremio has to get the data into a representation that can be communicated [inaudible] on the sending side. So there’s work that has to be done there. And then on the receiving side pyodbc has to take the data at the cell level and read it out into an internal representation so that it actually can work with the data. That internal representation transformation can be very expensive. And so, I’ll actually show you turbodbc in a second and then see how that’s faster.And so, hopefully, unless I completely lost my demo, we should get a result back here. And so, the goal is that, “Hey, I’m gonna pass some data into a data frame,” or whatever I’m using. The problem with that is that, if it takes a long time to get data back, it means that I start to go and lose my focus. Right? I’m a data scientist, I’m trying to figure some stuff out and I wanna give it some data and start working with it. And guess what? It took a long time to do so I start surfing the web, I go and get a cup of coffee and I lose my focus and I lose the efficiency that I might’ve had.And so, here’s an example. So this is five million records. It’s not a big dataset. This is how long it took. 53 seconds to get it back from odbc. Okay? So, what’s the difference with Arrow Flight? Well, let’s run that one too. Okay? 1.3 seconds. Okay? So, what you guys are doing today, we can make it faster with Arrow Flight and that should make it more efficient for you guys to work on stuff.And so, to give you some sense of this, I’m not gonna wait for it now, but I actually ran it last night and wanted to figure out how long it would take to pull a billion records into the data frame, which I actually can do with this and did last night. I’m not gonna make you wait now, but with Arrow Flight it was 199 seconds. So that’s a little bit like, “Okay, I’m gonna go get coffee for that. I don’t really wanna wait three and a half minutes.” Okay? But still a completely reasonable amount of time to pull a billion records into my data frame for analyses purposes. Okay?Now, to give you some sense of what that means when we compare the two different things, if I used pyodbc to do that it would be two and a half hours. Right? So, if I can take two and a half hours to three minutes, that’s got to improve my productivity in a day. Okay?Now, I showed this once before and someone was like, “Well, turbodbc is actually way more efficient than pyodbc, so it’s not really fair to compare that.” And so I actually did the same set of operations on turbodbc earlier today. So, 22 seconds. And that’s not for the billion, that’s the five million. Okay? And so, if you put this in numbers, let’s go back over here, this is what it looks like. Okay? This is the five million records. This is a previous number, I didn’t update it right now, so it was 54 seconds when I ran it earlier, 22 seconds for turbodbc and 1.3 seconds for pyarrow.flight. Right? So, this should hopefully help you be much more productive in doing your work.Now, it’ll take time for different technologies to adopt Arrow Flight, but Dremio, we’re pushing it hard and I think that other people are starting to look at it seriously across the board. So, it should help you guys’ productivity on a day to day basis, and so when you layer that with other things like Dremio, it should be really helpful for you.So, already Flight status, just to clarify to you, Flight is actually not GA yet. We’ve got all the things working together and the next version of Arrow will include the GA version of Flight most likely, but we already have bindings for C++, Python, and Java. We’re gonna work on other languages after that, that’s sort of the first three languages we usually start with in the Arrow project. As you saw, between a 20 and 50x performance improvement. That’s pretty nice. So, we’ve actually tested performance extensively with Arrow Flight and that was part of when we were building it up, and we can 20 gigabits per second performance on a single core, single stream scenario when we’re using the average sized batch of data. So, pretty awesome there.We’ll focus on getting the Gas out, and then we’re actually working with some community members who are working on a Kerberos integration for Arrow Flight, more language bindings, and then ultimately the support for upgrading of the connections. So, we’re doing TCP/IP here, but if you happen to have a better substrate, then Arrow Flight should be able to upgrade to that. So, I can do my negotiation in TCP/IP, but then, hey you know what, I’m gonna use this RDMA channel for something else. Right? To help things out.And so, when you think about it, this is kinda how we think about data access. Right? You’ve got your sources of data. You could go straight at the sources, and there’s tools to do that, but if you do that you’re probably gonna put the logic of doing that and the logic of cleansing in a place that’s not easily shareable. So build that inside of Dremio and then Dremio exposes now Arrow Flight, well GA as soon as it’s GA ed in the community, it will expose an Arrow Flight endpoint so that you can get that data frame back in next to nothing time.So that, I think, is the set of things I have. Love you to try out our product. It’s free to download and try out. And also, join the Arrow community and chat with us, ask questions, whatever. We’re happy to talk with you. That’s what I got. Thanks guys.

Data Science Across Data Sources With Apache Arrow

Table of Contents

Webinar Transcript

Achieve More with Arrow Flight: Accelerate Results with AI-Ready, Curated Datasets

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?

Table of Contents

Webinar Transcript

Achieve More with Arrow Flight: Accelerate Results with AI-Ready, Curated Datasets

Additional Resources

What Is Apache Iceberg? Features & Benefits

Apache Iceberg: The Definitive Guide

Introduction to Data Engineering

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?