Arrow Flight and Flight SQL: Accelerating Data Movement

Video Session

Data movement and access makes or breaks the data tools we rely on. This talk will look at Apache Arrow Flight and how it plays a key role in accelerating the movement of data in modern data architectures, especially compared to other existing protocols. Additionally, this talk will look at Flight SQL which is a next-generation standard for data access that utilizes the tried and true SQL as its query language.

Video Transcript

Kyle: Thanks for joining us. So today I’m going to talk about Arrow Flight and more interestingly Flights SQL. So the plan is just to give you guys a little bit of a background on both Arrow and Arrow Flight, go over a few performance comparisons and then dive into Flight SQL. So, first off, if you don’t know about Arrow Flight, it is [00:00:30] another transport protocol. So you might ask why we would want to introduce another transport protocol. And this was really targeted towards data movement. So you may be familiar with ODBC and JDBC. They’re quite old standards. They were developed like, ODBC, I think it was early nineties, back then the cloud didn’t even exist. There was no such thing as a multi core processor in for consumers. Networking was orders of magnitude smaller. [00:01:00] The data scale was much-much smaller and the standards were really designed for those.

And that is a reflected because when you run them on modern hardware than the APIs themselves become a bottleneck, there’s significant serialization de-serialization overhead. So if you’ve attended one of the ARrow talks or know much about Arrow you’ll know that serialization de-serialization can be 80% upwards of some of the costs with some of these APIs, JDBC as an API actually [00:01:30] enforces sequential access of data, right? You have to move to the next row and then go get sell then repeat. Additionally the standard APIs for data access, they are standard, but the implementations around them are all proprietary. And the implementations are of differing very quality and performance. And you don’t control that. That’s totally up to the vendor who provides them.

[00:02:00] So Arrow Flight was introduced as a modern high-performance data transfer protocol. So it actually leverages Arrow. And as part of the Arrow project, as a standard for data representation and interoperability, and it was really designed for bulk data transfer. Arrow is in memory format, meant for sharing data amongst programs, in flight extends that across machines. It supports a ton of different languages right now, Python, C++ [00:02:30] Java, Go, C Sharp and warming implemented. It’s fully open source. And as I mentioned, as part of the Arrow project, and you can actually go and you can see the source code and get hub, and it really, it delivers on the original vision of Arrow for seamless interoperability. And on the right you can see a little chart in terms of the momentum that Arrow has, as of late last year was over 15 million downloads a month.

So Arrow Flight is really targeted at sufficient [00:03:00] bowl transfer of streams across machines, where you’re avoiding a lot of the sturdy costs when you’re transferring that data. So if your system uses Arrow already and you’re using Arrow Flight to do the transfers, you can literally avoid all of that sturdy costs. Well, most of it, I should say. And the diagram at the bottom shows you a really simplistic diagram of how flight with work. There’s a source in the left [00:03:30] target on the right. So the source would say, get me the information for this flight, target would say, here’s your information. And then source would say, get and then you stream back a set of record batches, and it allows you to get close to the line rate of modern networks.

In terms of the client interfaces for Flight. As I mentioned, they’re available in a bunch of different languages. The most, really recent release or major release for Arrow was 4.0. And I believe that gearing up to do a 5.0 quite [00:04:00] soon. Flight SQL, which is the most interesting part of this talk. I think we’ll get into a little bit later, but the idea is that really we’re going to apply JDBC semantics on top of Arrow Flight, I’m allowing us to standardize some of the pieces. Again reiterating that this is all open source. It’s non-proprietary. All the clients, the protocol, the server pieces, everything is open source and it’s compatible with all the Arrow Flight endpoints. On the right-hand side I have some pseudo [00:04:30] code in Python showing you how an Arrow Flight connection would be made. And this is if you were to use Dremio, then this is actually what’s exposed in terms of a Dremio endpoint.

So the top we can create a connection using basic auth just to use name and password. Then we can create something called a Flight descriptor and we would pass it to query. So that’s how I set up the description for a query. Then I can get my schema for that query. And then I can eventually, I can kick off the execution by calling get Flight info. And [00:05:00] then I can call my do get using the end points. And I’ll get into that in just a little bit. So I’ll just give you a really, really brief introduction to Arrow Flight basics before we get into the Flight SQL portion.

So if you don’t know what Arrow Flight is, essentially, it is a server framework transport protocol and client implementations. So I don’t know if many of you have tried to implement a client server framework before, but it’s a non-trivial task [00:05:30] and arrow does all the hard work for you. It has a server framework. It has client orientations. It has the data protocol. What’s really nice is it has built in support for parallelism. That’s one of the big strengths of Arrow. And again, I’ll show you in a second, what that looks like in practice. And it also provides an efficient data format, in terms of we already use Arrow for our applications and we’re going to use Arrow record batches in terms of putting it on the wire. So you don’t need [00:06:00] to spend time trying to figure out how to most efficiently represent your data.

Arrow Flight already takes care of that for you. And what you’re using when you use Arrow flight to implement your client server implementation has already been used elsewhere. It’s not like you’re creating something new and you need to validate everything. You’re reusing software that’s already in use elsewhere. Another large benefit is the client implementations, the client libraries are standard. If you have your Arrow Flight clients in Java that will talk to any [00:06:30] Arrow Flight server, you don’t need to have your SQL server driver and your Oracle driver and your My SQL driver and they’re each different. There’s only one Arrow Flight client. The nice thing about Arrow Flight is there’s a small set of primitive actions. The AP is actually relatively simple to reason about and use. If you’ve ever looked at something like would you be seen where there’s 70 odd functions, you know that you can need to have at the spec at hand to even know what you’re really doing. [00:07:00] Arrow Flight is not like that. I think there’s seven or eight actual actions.

So what does it actually look like? This is a one node system. So if you were to use Dremio and this would be an example of what you could do, my client on the right-hand side issues a get flight info, passing the query information, Jeremy would send back the information with the schema of the results and then a set of end points. And an end point is a location of a machine and then a no pick ticket and then my client can [00:07:30] turn around and for each of those end points I can issue a do get passing the associated ticket and then the circuit would stream back the record batches.

So five N Nodes. What does that look like? So we’re going to omit the get flight info call here. So if you recall here, I call it get flight info. I get back a set of endpoints. So assuming we already had that, but I have multiple end points I can turn around. And for each location I can issue a do get with a ticket. And then I can retrieve that either sequentially or in parallel and that would be up to you. [00:08:00] The benefit here obviously is that this can be much more performant than a serial system. JDBC or ODBC would enforce even if they have an MPV system or a cluster, then it would force you to send it to a single machine and stream it through a single end point. And this can be orders of magnitude. I think Ryan Murray whose given the talk I think earlier today, he had actually done a simulation and a large amount of data which took days to transfer in terms [00:08:30] of one node was down two hours, if not an hour with a multi note system.

That’s really, I just wanted to give you the highest level overview of Flight itself. If you want more details, there’s a link, when you look at everything later, which describes the basic protocol. Then of course, you’re probably wondering why we think so much of Arrow Flight, which is really around performance. So I wanted to give you guys an idea of what we were looking at. So from a Dremio instance, [00:09:00] I picked Arrow versus ODBC because ODBC is a mature standard. It’s used in a lot of different areas and this was probably something most of you would be familiar with. So we’re just using a One node Dremio instance. It was just doing a simple select star of a parquet data set in Dremio. And in case you didn’t know that was done to try and minimize any processing on the of sideshow.

We’re really just trying to test the protocol. And so if I just do a selection of a hundred rows they’re roughly equivalent in my setup [00:09:30] and it was a 0.35 of a second versus 0.34 of a second. And this was done 10 iterations throwing out the outliers and taking the average. So one small sets, Arrows got a slight edge, but it’s within noise. We scale up to something like a 100,000 rows. You can start to see where some of the difference comes in. So Arrow Flight here is about 5.158 seconds or 5 and change. What do we see is 37 seconds in change? So Flight hears about seven times [00:10:00] faster than what you’d see, even for a relatively small amount of data. If we go up to 10 million rows of data, Flight is about 234 seconds and change. And ODBC is 340,054 seconds and change. So we’re talking about the difference of just going and make yourself a cup of coffee versus maybe taking a nap, if not more. Which can make a real difference if you’re moving any sort of amount of data around

[00:10:30] So a different comparison which might be more applicable if we use machine learning or using Jupiter notebooks with Python is a comparison of about half a million rows. A lot of the time when we use Python and we’re interacting with other databases, you’ll actually wrap ODBC under the hood using something like pi ODBC or triple ODBC. So chose those two and then just use the native Python Arrow flight client. So pi ODBC comes out about 452 seconds and change to triple ODBC is 265 seconds [00:11:00] and Arrow Flights is 24 seconds and change. Again there’s a massive difference here in terms of efficiency and performance.

So hopefully you all have seen kind of why we think Arrow Flight is cool with the parallelism, the avoidance of the sturdy costs and just the straight up performance that you get from it. So now let’s talk about Arrow Flight SQL, which is the newest portion of Arrow Flight that we’re working on. [00:11:30] So what is it? So Flight SQL is a proposal to add a standard way of accessing database SQL like semantics to Flight. So it exposes things like database metadata, catalog metadata, query execution, prepares statements and parameters to Arrow Flight. So it is tried and true concepts from ODBC and JDBC. Layers on top of Arrow Flight while maintaining the speed of Arrow Flight. You can still go back and you can get the benefits of [00:12:00] or you can use Arrow Flight like Flight SQL manually. But we’re providing a set of libraries on top and I’ll show you what that means in a sec.

So plain SQL is a way of standardizing some of these things. So why do we want to standardize this stuff? So today, a lot of applications use ODBC or JDBC as a standard abstraction for data access. So you get a write once use many paradigm. So I can write [00:12:30] an implementation that uses the JDBC driver and then I can run my application against SQL server or Oracle or Dremio, without having to change much of my code if any. I put an asterisk there because even though the APIs may be standard, there’s still SQL to deal with. So ODBC and JDBC and most of the others say that SQL is the way that you’re going to interact with these sources and each source tends to have its own dialect of SQL, but the APIs themselves are the same. [00:13:00] So there’s also the need, as I mentioned before that you need to bundle each different driver along with your application. So you need to have the Microsoft driver, you need to have the Oracle driver, you need to have the Postgres driver and so on.

So flight on the other hand is open-ended and it allows any type of server to be built. So whenever I want to have my application interact with an Arrow Flight server, I need custom application logic. So ODBC and JDBC I have standard [00:13:30] application logic or Flight I have custom application logic. So even though Arrow Flight has the same set of primitives like do get, do put and so on, the meanings can be different per server implementation and Arrow Flight doesn’t say really anything about a query language. So how do I interact with my data behind the scenes. Again that would be on a per server basis. So really what we want is that same, right? Once use many benefit, that ODBC JDBC and the other standards get us [00:14:00] while using Arrow Flight.

So we put more concretely, so imagine you think Arrow Flight’s cool and you’re like, “Hi, I want to build an adapter for my application.” So you can initiate a connection using Arrow Flight pretty easily. The next thing you probably want to do is list what’s there. So how do I list the tables or wheels or whatever else is there. There’s no standard way to do that. Okay. How do I determine the server capabilities to even say what I can do? There’s [00:14:30] no standard way to do that either. Flight has the idea of actions but the actions are server dependent. Maybe I can use something with the information schema, so I can issue a SQL query.

Again there’s no standard way in Arrow Flight to do that right now. So Flight SQL aims to take all of these and standardized them, such that, when I’m writing it, there’s only one way to issue a SQL query. There’s only one way to list a metadata and I could turn around and [00:15:00] I could build other APIs on top of Flights SQL. So I could actually go and I could build the JDBC driver on top of Flight SQL If I wanted to. So it’s really meant to be like a foundational API.

So again, the goal is to implement a Flight SQL connector once, and have it be able to connect and communicate with all Flight SQL enabled servers. And again, that would mean I only need one client and it really aims to book client connectivity into the community hands. So you don’t have to worry about out of date drivers, or if there’s a bug in a driver, [00:15:30] then you can take action and you can fix it. Or if you want a specific feature, then you can work with the community, get it implemented. But it is not attempting to standardize the query language. So it says that we’re going to use SQL, but it doesn’t attempt to standardize the dialect. So you will have to still deal with pl SQL versus t SQL. If those are what’s in use, but it will do more to describe what’s capable.

So how was it doing this? So it’s really [00:16:00] aiming to reuse the existing flight primitive, such a schemas actions do get, do put. It’s just layering on semantics on top of that. So all the data will be returned via Arrow record batches. So the same method of retrieval that you use with Arrow Flight, you will use with Arrow Flight SQL. And it will add rapper implementation around a Flight client, which has a standard way of getting catalog metadata, asking the database for what it supports in terms of metadata query and update execution, [00:16:30] prepare statements and parameters and so on.

So as an example, here’s a bit of Java code, which shows Hollywood request some catalog metadata. So the top, assuming I have an established Flight client, then I would create a new Flight SQL client by passing in the Flight client. And then I would there on use the SQL client, the Flight SQL client. So there is something called This Tables. So I can just call This Tables and the [00:17:00] pseudo code here passes null for everything if false, but the nulls all represent filters. So I could filter by catalog, by schema, by table, by type. And the last one says whether or not I want the table schema to come back with each of the tables.

So I call This Tables, I get back at Flight info. And then I can just call SQL client I get stream and I can pass at the end point that I specifically want along with the ticket. And then it opens up a normal Arrow Flight stream. And I would consume that [00:17:30] stream as normal. Here it would actually return a result set with the catalog schema table, table name, table type. And if I’ve requested at the table schema. So if I want to retrieve the database metadata, again I open up a Flight SQL client before I call get SQL info and I can actually pass it the bits of information that I want to retrieve. So in this case, I’m retrieving the server name and the version. I also just pass no, if I want all the metadata back from the [00:18:00] server. And again, this actually comes back as an Arrow record batch, so stream. Just this one has a schema of info value and then a union, depending on the type of value coming back.

Want one execute query. Hopefully this looks very similar. I have a Flight SQL client, I just call execute and I pass it my query. So I can get the schema back. I could do that in any of the other ones. I just didn’t show that explicitly, but I can get the schema for the [00:18:30] query that issued. And then again, I can consume as normal. So it wouldn’t matter what type of server I’m talking to as long as it had play SQL enabled, I could execute a query, I could get metadata in a standard fashion from the client.

So here’s a little bit more of a detailed example of what’s going on under the hood. So on the left-hand side I have the client, on the right side inside I have my server. On left-hand side I have the flight SQL primitives, on the right- [00:19:00] hand side I have the Arrow Flight predator. So when I issue a get tables call from Flight SQL, it’s actually translated into a get Flight info call with a command of get tables. This server sends me back my Flight info. I call it do get, I get back my record batches. If I want to do an execute on Flight SQL I just call dot execute. That’s translated again into a get flight info with a command of whatever query I passed and I get back my flight info. And again, I call do get and got back my results. [00:19:30] So it’s really layering these other higher level commands on top of arrow flight.

So I wanted to show just a little bit of a demo. So this is in progress. Implementation is in progress right now. Around this, so I have some low level of code here in terms of some tests, but really what’s going on here is that I have a local flight server set up, which is connecting locally. And what I wanted to show is, this is connecting [00:20:00] to my local host. I’m creating a Flight client as normal, and then I’m just wrapping it in the Flight SQL client. So here’s an example of the get tables where I’m passing nothing in and false, and they would expect back the schema catalog, schema table, table type, table schema. So I should run that. This’ll take a second.

While, they’re just doing that. I also have a subsequent test where [00:20:30] I’m actually going to be retrieving results, but I’m going to filter it out to just the table type. So I’ll just wait for this one to finish here. The first one usually takes a few more seconds. Yeah. So we can see that one successful. So in the second one, as I was saying, I’m going to issue get tables again. It’s just this time I’m going to say, I only want tables that are up type table. So I don’t want views, I don’t want system tables or anything. And I expect one rollback, which is just going to have the schema app, the name of table [00:21:00] and a type of table. That’s fine.

Take your full time. There we go. And I probably should have mentioned that this was running against an example, play server, play SQL server, which is backed by a Derby database [00:21:30] just because it was simple. And that will be shipped on as part of the Flight secret lesson example. I’ll skip down here just in terms of time. But if I wanted to prepare all statements, then I would call prepare. I would pass it to my query. I’m executing my prepared statement right away and I’m getting a stream and I’m expecting the table. The schema of that comes back and my table is just two values. You name the value of our triangle. And then I’m getting three rows back with values [00:22:00] of one in one, zero in zero and negative one in the negative one. Okay.

There we go. [00:22:30] So I wanted to show you that to show that it actually does work. And this is an implementation in terms of current status for Flight SQL, the design initial design is being finalized. We’re working on modification with the community. There was a partial implementation of the Flight SQL plant and that’s ongoing. Things that aren’t there yet are things like parameters, but that’s going to come soon. There is a partial implementation of [00:23:00] an example, Derby based Flight SQL server. Derby was chosen because it’s relatively simple and self-contained, and that allows us to easily write tests against it.

So currently we’re working on finalizing the implementation and then also flushing out testing and Polish. Additionally, as I mentioned, you can build higher level APIs on top of Flight SQL and we’re currently working on a JDBC implementation as well. That would be open-sourced. In terms of the future, [00:23:30] we would ideally like to work with the community to get C++, Python, Go and et cetera implementations created. And then this is the view one of Flight SQL. Additional releases would opt for things like output parameters, transactions, additional meditating and whatever else the community decided that we wanted to go after. So thank you for attending the talk. Here’s some helpful links on the right-hand side. There’s the link to the benchmarks, the link to [00:24:00] the actual PR for the Flight SQL proposal and examples where if you want to use Arrow Flight off under the hood. Thanks, David.

David: Great. All right. Thanks a lot, Kyle. All right. So let’s go ahead and open up for Q & A. Remember if you have a question, you want to use the video, you can use the button in the upper right-hand corner to share your audio and video. And you’ll automatically be put into a queue. And if for some reason you have trouble with that, you can ask [00:24:30] your question in the text. Okay. So don’t see any audio-video, but we do have a couple of questions here. Let me first start with the ones that came in on the chat. You can probably put them in the Q & A tab over there, but it’s okay. I’ve thought I start off with Igor’s question here. He asks “What’s the expected performance and client CPU usage difference between Arrow Flight and Flight SQL for a very large 1 million rows result set?”

Kyle: That [00:25:00] is an excellent question. To which I wish I had a better answer. I don’t have a concrete answer. Maybe you can follow up with me in slack. I would expect to see that the CPU, so it should be significantly less when we’re using Arrow Flight, because the point about Arrow and Arrow Flight is that you’re avoiding a lot of the sturdy costs. So if you think about it, when you’re using one of the other APIs. You’ll read from your data source in one format. You’ll probably put it into some intermediary format. You’ll then serialize it into a third format to get it on the wire on [00:25:30] your client’s side, you’re going to read it off the wire. You’re likely going to put it into another intermediary format or something like ODBC or JDBC. You’re probably going to again, but we didn’t see the format that they required to get through the API.

You’re going to read it into your application, probably into some other format, and then you’re going to actually work with it. So with Arrow winner on flight, your database is probably hopefully working with Arrow already. It’s going to pass that Arrow Flight, which expects Arrow. It’s going to throw it onto the wire. You’re going to read it off the wire in Arrow format. And your application on the [00:26:00] client side is ideally aren’t using that. And if not, then it’s fairly trivial to convert into one of the applications. So you’re going to wait a lot of the surgery costs there.

David: Okay, cool. Before I get to the next questions we have both Steve and Igor. I think they wanted to see your screen again. Maybe you have some links up there. [crosstalk 00:26:18]

Kyle: I’m going to share if I did not share at the very end there. I apologize. I will just put that back up.

David: I just might’ve been too quick.

Kyle: Keep it for you guys. And again, Dave, I believe the Flights in the recording is going [00:26:30] to be shared afterwards. Correct?

David: It will be. That will be, yeah. So we have another question over here in the Q & A, “Are there plans to use QUIC instead of TCP?”

Kyle: Quick and center TCP. So, I can’t speak to the larger Arrow Flight community, but Arrow Flight was designed with the intention that you would be able to flip the underlying protocols out. So it wouldn’t be possible to work, [00:27:00] but the community to implement quick currently, that’s not one of the goals, mostly I’m involved around Flight SQL right now.

David: Okay. And it looks like, let’s see. So I think we have a comment. I’m not sure, I think this is a question from Alaura, “For the same client server shouldn’t the cost difference be negligible between Flight and FLight SQL since the data transfer [00:27:30] is the same?” I think this is the follow-up on the previous question.

Kyle: Yes. So maybe I misspoke or something. Yes, I did misspeak. So ALaura you’re correct. There is exactly zero difference or there should be zero difference in performance between Flight and Flight SQL because Flight SQL is simply a semantic layer on top of Flight. If you think about Flight versus something else, like ODBC or JDBC, that’s really what I was talking to. So I apologize for [00:28:00] the confusion.

David: Okay. I think we’re almost out of time. So Alaura is great. So I think we’re almost out of time. So let me put this link. For those of you who want to continue the conversation or have any more questions now or later, you can log into slack. I did provide a link to do that easily to join slack. Also, there’s a link I just posted. It will take you directly to Kyle-Porter channel in slack, [00:28:30] where you can continue the conversation. And let me just double check in Q & A. Okay. Yeah, I think that’s it.

So we’re out of time and out of questions. So I’d like to thank you very much for your time and presentation, Kyle. It’s a lot of detail there and a lot to learn, so thank you. And let’s see, and that’s all the time. Here we go. Let me wrap this up. So before you guys leave, if you wouldn’t mind filling out [00:29:00] the super short Slido session survey that you’ll find right before you leave and the next sessions are coming up, I think there are some panels actually, or there’s a keynote and then a panel. You should definitely check out the open architecture panel. That should be really good. And of course the keynote looks like it’d be fun to and that’s it. So thank you for everyone. Enjoy the rest of the show and thanks again Kyle.

Kyle: Bye.

David: Bye.