Dremio Jekyll


Subsurface LIVE Winter 2021

Arrow Flight and Flight SQL: Accelerating Data Movement

Session Abstract

Data movement and access makes or breaks the data tools we rely on. This talk will look at Apache Arrow Flight and how it plays a key role in accelerating the movement of data in modern data architectures, especially compared to other existing protocols. Additionally, this talk will look at Flight SQL which is a next-generation standard for data access that utilizes the tried and true SQL as its query language.

Presented By

Kyle Porter, CEO Bit Quill Technologies

Kyle Porter is the CEO of Bit Quill Technologies


Tom Fry, Senior Director Product Management, Dremio


Webinar Transcript

Emily:

Welcome everyone, for joining our session. My name is Emily and I’m from Dremio. I’ll be moderating this presentation today, and I’m glad to welcome Kyle Porter and Tom Fry, who will present Arrow Flight and Flight Sequel, accelerating data movement. If you have any questions during the presentation, please type them into the chat. At any time we will address the questions live during the Q&A session at the end of today’s [00:00:30] presentation. And now I’ll turn it over to Tom.

Tom Fry:

Great. Thanks a lot for that, Emily. So again, my name’s Tom Fry. I lead product management for Dremio Core. We call that all aspects to the Core Dremio database in our connectors, et cetera. And joining me today is Kyle Porters, CEO of Bakewell, who we’ve partnered and worked very closely with this. And today we’re really happy to talk to you about Arrow Flight, which we really see as a future direction and standard for how do different systems transmit data [00:01:00] between each other. So before getting into the details, what we want to start off with was what was some of the motivation for this? And why introducing a new transport protocol because we have standards and they’ve worked very well for quite a long period of time. Well, the reality is the standards that we have ODBC and JDBC, which are supportive across the industry, pretty much all client tools work with them, all data services support them.

They’ve been around for quite a while. And as a result, they really are starting to show their [00:01:30] age in a few different areas. The first aspect is they were really developed in a different infrastructure era. Infrastructure was much slower in terms of network, processing looked very different. And the hardware stack has really developed quite a bit in the decades since they first came out. The other aspect is they really were designed for smaller query results, not for data like scale sized datasets. And what we mean by this is the standards came out of a lot of collaboration [00:02:00] between the major databases of the time in terms of how can we standardize or how can the industry standardize just getting end query results out of systems in a common format between different databases. So is very much focused on query results and getting those out of the databases.

And again, that’s very different than what we see in a modern data lake architecture stack. And as a result of this, the standard protocols themselves really have become a major bottleneck. There’s a significant amount of overhead [00:02:30] in terms of the details. If you start to peel back the layers and look at what’s happening inside them and quite a bit of sterilization, the serialization steps throughout the pipeline. And as a result, they’re actually under the forum or the raw capabilities of what we see with essentially the hardware that’s presented to us essentially for free today. The other aspect that really has become somewhat of a difficulty is the protocols themselves, maybe standard an application that knows how to talk to JDBC can really work with any [00:03:00] JDBC driver for example, but the drivers themselves are proprietary. And what we mean by this is every data service develops their own drivers and submits them to clients.

And what it means is client tooling. If you’re going to work with 20 different data services, you need 20 different drivers and you need to be on top of updating those drivers and integrating their applications, et cetera. And so again the protocols themselves are standard, but the drivers proprietary to every system. And that creates a lot of complexity. The other aspect is [00:03:30] varying quality before, as it comes with that, instead of just a single standard way of doing things, we have multiple different versions of things. And we see a variety of quality across that.

So what we want to talk to you and to introduce today is the concept of Arrow Flight, which is a modern high-performance data transport protocol air for data lake scale size datasets. And we’ll get into throughout this talk, what do we mean by that. Arrow Flight itself leverages Apache Arrow behind the scenes [00:04:00] for data transfer between systems. Dremio was the initial developer of Arrow and contributed that project to the Apache community several years ago.

And since then, Arrow has become a wildly popular power product or a project within the community. It’s up to over 10 million downloads a month now. It’s widely adopted across many different systems and what Arrow is for those of you that may not be as familiar with it, is Arrow is essentially in-memory representation of column or data, and [00:04:30] a very efficient form within memory. And it’s become essentially a common standard for how different applications work store and work with column or data within themselves. It was designed and so Arrow Flight itself is designed not for just query results. It’s really designed for analytical workloads and a data lake scale, it’s cross platform. As a transport layer it offers a variety of different options and Kyle’s going to walk through some of the details of that. Today it supports key languages used by data scientists, including Python, [00:05:00] Java, C++, and quite a few more languages are being added as we speak. FactSet said actually yesterday he had a great talk where they’re contributing a go connector.

So it’s already being adopted and more supports being added to it. When the other aspects about Arrow Flight itself is it’s not just an open source protocol. It’s an open driver as well, the drivers themselves and the client libraries themselves come from the Apache Arrow community. In fact, they’re going to be available in Apache Arrow 3.0 that’s just coming [00:05:30] out this week. And what this means is an application that’s designed and gets the drivers themselves from the Apache or community.

Can work with any data service that supports an Arrow Flight end point. And this is very important for compatibility across distance. It makes it very easy to develop an application that can talk to dozens of different data services without having to continuously update and find all these different drivers. The other aspect about Arrow Flight is it really realizes Dremio’s initial vision for Arrow. [00:06:00] Arrow was put out in the community as a high-performance way to think about columnar data and memory, but a big part of the motivation in terms of the long-term story was if this was the right way to represent data and processor memory, and it’s widely adopted, there could be a lot of other side benefits to that, particularly in forms of transferring data between systems, if multiple different systems all have the same memory representation for data, we can now essentially enable seamless transport of data between those systems, because they’re both working on the exact [00:06:30] same structures.

And so what Arrow Flight really is, is the realization of essentially this longterm story that we’ve had with Arrow and a lot of the adoption over the years of it. Apache Arrow itself, when we peel back the scenes, it’s essentially a mechanism to very efficiently transfer large streams of data between a source system and a target system. As you can see here in the yellow line, essentially one source system will be processing data in memory. It keeps data in memory for processing [00:07:00] it has a certain format for that, that uses for Apache Arrow. The target system as well uses the same format of in-memory representation of data. And because of that, we can actually just straight stream data from the source to the target, very, very efficiently. This both the source and target can actually just DMA with the networks straight from the CPU, straight to networking and then straight into the others processors memory.

And so this is a very, very efficient way to transport data that essentially eliminates all software [00:07:30] overheads and bottlenecks. It avoids such as overheads, it avoids all the zero-sation steps, et cetera, et cetera. One of the things that you see here, I’m also talking about this as a source system or target system. I think people commonly think about Arrow Flight as a way for end clients to get data from data services, but it can also be utilized for very efficient data transport between servers themselves. And so if your data like architecture has multiple different applications in the pipeline, this enables you to develop essentially this complex pipeline of multiple different tools [00:08:00] with very efficient data transfer throughout the entire chain, keeping data in Arrow Flight format from the beginning to the end. And it also operates at close to the line rate of modern networks.

What we’re going to be talking about today are some of the different options for client interfaces with Arrow Flight the first in a release 3.0 with the Apache Arrow community is we’re releasing libraries for multiple different programming languages. That includes Python, Java, and C++ today. FactSet, [00:08:30] as they talked about yesterday are releasing a go connector as well, which is fantastic. What Kyle’s also going to be introducing as well is what we call Flight Sequel, which is in progress and will be released in some number of months. What Flight Sequel is, is it provides a JDBC like interface to Arrow Flight. It uses JDBC semantics for the authentication catalog Sequel preparation, et cetera steps, but behind the scenes it uses Arrow Flight for data transport. [00:09:00] And so you could think of it as a extremely fast form of a JDBC.

These client interfaces and drivers are all fully open source, fully non-proprietary, both the protocols and drivers and available from the Apache Arrow community, which again means we can really focus efforts in terms of having a single high-performance best-in-class connector that works with essentially all data services that support Arrow Flight end point. And this can really simplify application development while providing [00:09:30] seamless high-performance. And what you see here is a very simple example of a Python client, where it’s very simple to essentially get from Flight, the basic descriptors of the schema for a query, and then execute a Arrow Flight, retrieve of result process. And Kyle walked through a lot more of the details.

Kyle Porter:

Cool. Thanks, Tom. Let me just get my screen set up here. [00:10:00] Perfect. So as Tom said, I’m going to dive a little bit deeper into Arrow Flight, and then talk a little bit about Flight Sequel. What I’m not going to do is talk about Arrow. I am hoping that you guys have an idea of what Arrow is, but just as Tom mentioned, it is an efficient way of representing data in memory, such as the multiple applications support Arrow, you essentially [00:10:30] remove any of the serialization deserialization costs that are common when you’re transferring data. So just as a basic primer around Arrow Flight, what is it? So Arrow Flight’s a framework for fast data transfer. And when we say it’s a framework there is no defined Arrow Flight server that you can take. Each of them exposes a set of requests that are implemented for each server. [00:11:00] Monetarily like I said that the set of requests is actually fairly small, so it’s not something large. So if you’ve ever looked at the ODBC API that has over 70 different methods on it, it’s quite complex. Arrow Flights, nice and compact.

So it exposes a number of quests. To get fine info, you can think about this as a way to set up an access plan for data, and it would return you to the [00:11:30] schema for that dataset. And then one of more end points. And I’ll talk a little bit more about the points in a second. We can also just get the schema for one of our data sets. We can get the set of Flights that’s available on a server. So what data can I get that is currently active? Similarly, we can retrieve data or put data on the server. And the interesting thing here is that Flight also [00:12:00] has the idea of actions and an action is a server defined action. So I can use list actions and discover the set of server defined actions. And then I can then turn around and call due action, and you’ll see how that’s actually quite important to Flight Sequel as we go forward.

So this is an example of what an Arrow Flight system might look like when there’s one server and one client. [00:12:30] And if I wanted to issue a query, and this is actually what’s available in Dremio right now, then I would use the get Flight info request, pass my query to the server. Server would then respond with a Flight info, which has a schema for my result set. And then instead of end points. An end point is essentially a pair of location and ticket I’m simplifying, [00:13:00] and then we can turn around and for that location and ticket, I can then issue a get request to start streaming the data back. And as a reminder, Arrow Flight it currently leverages GRPC. So there are streams, although it’s not tied to GRPC. One of the goals is that other mechanisms could be implemented there as well.

One of the strengths of Arrow Flight is actually the… it has parallelism built into the architecture. [00:13:30] So if you recall here those endpoints, when I’m in a single mode system, I may get back one location, one ticket. When I’m in a multi-node system. I can get back multiple end points. And then I could then as a client issue, get requests to each of those nodes in parallel, which is a huge upgrade over what we typically consider when we’re working with something like ODBC or JDBC, because those are inherently [00:14:00] cereal and for a lot of the newer systems where scaling out is essential to their performance, ODBC and JDBC are very commonly the bottleneck, and pair of Flight has that built to work around that.

And less do you think that writing a Flight server yourself is some massive undertaking? What I also appreciate is the conciseness of what you can do things. So this example is actually taken from the Apache Arrow [00:14:30] site, and it’s only one method implemented, but this is one example of how you would do in C++ one of those requests, here it has less Flights and there’s a dot, dot, dot, where it says you should get your Flights, but you can probably think about how you would hook this up to an existing system and then just return it. The point here is that a lot of the [00:15:00] framework and the scaffolding that we typically have to implement ourselves, if you’ve ever implemented a server, a lot of that’s taken care of for us already. And we can just focus on the basics, which is a huge win.

So that’s a little bit what Arrow Flight is. And so it’s interesting to look at what else is there that we may already be using and why Arrow Flight, we [00:15:30] consider it to be really important. So the big one is ODBC, that’s an old API. It was created back in early nineties. ODBC itself it’s an API, which means that it specifies a set of methods and the behavior of those methods, but it doesn’t specify any implementation. So it’s Thomas saying, every ODBC driver it’s proprietary. They’re each implemented differently. [00:16:00] ODBC is actually a C language API. I don’t even know how many people on this conference know C anymore, but it’s getting a little bit old among the tooth. The LVC API, as I said, is large and complex. There’s lots of dark corners and skeletons around, but one of the biggest drawbacks is that the way that ODBC operates, it implies a serial method of access.

It’s also [00:16:30] very inherently tied to Sequel concepts. So it’s harder to write an ODBC driver for a service that is not Sequel, whereas Arrow does not tie itself to Sequel. And then finally ODBC defines the data format for how it used to transfer data through it. Which means that when I am retrieving data, I have to pay my serialization deserialization costs. [00:17:00] And I believe there was a talk by West and Ryan and they’ve quoted numbers up to 80% of your time is actually spent on serialization deserialization. And because Arrow Flight uses Arrow as if data format, we can essentially sidestep that.

So JDBC is the other, probably most popular API. Again, it’s a little bit older. It was released five years after ODBC. And if you’ve ever used both ODBC and JDBC [00:17:30] yourself, you’ll see the JDBC is essentially an object oriented version of ODBC. They’re very much the same concepts. It doesn’t imply any underlying implementation. It just specifies an API, but it is cereal. And JDBC is actually cereal to the point where if I’m retrieving data, I actually have to literally move the cursor by saying, move next.

And then I go, “Get this socket, this socket, this socket, this socket, go next and then [00:18:00] so on.” And then again, you have to pay your serialization deserialization costs. So you also say, “All right, well, I’m not going to use ODBC or JDBC. I’m just going to write my own. And maybe I’m going to use Thrift or GRPC.” However, just by the nature that you’re creating something of your own, it’s going to be much more complex. You have to create the service yourself. You have to make sure that it interacts [00:18:30] properly with the applications you want and then you’re going to be on the hook for maintaining it for however long that’s in use.

None of those are usually desirable of this. Also the straight copy. I think probably everyone on the call here is taking some CSV file or Excel spreadsheet or something of the sort, copied it over, realized it wasn’t exactly what you wanted, made some modifications, cleaned it up [00:19:00] and integrated it. Got it ready when you’re ready to present and then found out, “Oh, the data is actually more up-to-date where you originally copied it from.” That means you have to go back through that same process. And there’s no standardization on that either. There are other options but… and these are probably some of the most common. The nice thing about Arrow Flight is it does away with a lot of these problems. It implements most of the framework for you, standardizes [00:19:30] on the data format. It doesn’t impose any concepts on you and it allows parallelization.

So, that was also interesting to look at some benchmarks. And I took the ODBC because this is typically one of the heaviest used. If you’ve ever used Python, then likely used something like PyODBC to hook it up to a database. So this was running against Dremio and it was a simple select [00:20:00] star query of a parquet dataset. It was about 50 or 60 columns wide. And when that data set was limited to about a hundred rows, that was fairly comparable. It was played as a little bit of an advantage, but this kind of annoys.

However, when we start scaling up the number of rows, this is where Arrow the serialization deserialization costs really start to make themselves apparent. So here ODBC for a hundred thousand rows is roughly 37 seconds. [00:20:30] An Arrow Flight is about five. So seven, seven, and a half percent or seven X faster. That was if we go bigger. I’ll [inaudible 00:20:42] that in 10 million rows, Arrow Flight is roughly 15 X advantage, which turns an analysis from hit and go button, go get your coffee, come back, finish your coffee, see your result [00:21:00] to hit a button and wait a couple of seconds.

So that’s pretty impressive. Now I bet a lot of us are also used to working in Python, doing data science and these benchmarks here, we’re straight up retrieval of data, but they didn’t work with the data after the fact. So what happens when I want to use Python to do a little bit of analysis, and I want to pull that data through, [00:21:30] through the ODBC driver. This is where the advantage really starts to make itself apparent. So I think this is almost a 20 X speed up if I’m using PyODBC versus Turbodbc versus Flight itself directly in Python. So 452 seconds versus 24 seconds to pull half a million rows. Again, that same 50 column dataset. And this is really, I think game-changing in terms of performance. [00:22:00] And from here adding more parallelism, we’ll just make this faster.

So hopefully you’ve looked at this, you’ve seen the benchmarks. You’re like, “Great, Arrow Flight, best thing since sliced bread.” Let’s go use it. Maybe you’ve already implemented the service and now you want to implement a generics service that knows how to talk to Sequel enabled data sources. So that’s what Flight Sequel is for. [00:22:30] So Flight Sequel is a proposal to add a standard way of accessing data. The Sequel like semantics.

This would introduce concepts like database metadata, AKA, what can my database do? What does it support, catalog metadata for discovery? So it exposes concepts like tables, schemas. We done preparing statements and parameters. [00:23:00] So if you want to re-issue the query with multiple different sets of parameters you could, in a differentiation between reading and write. So you’ll notice a lot of these concepts are borrowed from ODBC and JDBC, and those are the tried and true concepts that most people are familiar with, if you’ve worked in the space. But the nice thing is that we’re not giving up any of that speed advantage that Arrow Flight gives us. We’re just layering those concepts over top of it. [00:23:30] So you may ask why Flights Sequel?

So compare against ODBC and JDBC. So those use a standard abstraction. So if I’m an analytics tool, I can rate against an ODBC interface the API and then can interact with multiple different data sources provided there’s an ODBC driver for them. And the idea is I don’t have to change anything in the application, even though I’m talking to different databases.

[00:24:00] Well, Flight is open-ended. It has a set of requests, but it doesn’t specify the behavior of those requests. So if we imagine you’re going to write an adapter for Arrow Flight, or we can initiate a connection. That’s fairly standard and then I want to expose what’s in that service to my users. So I’ll probably want to do a listing and depending on how the service [00:24:30] implemented, it’s not necessarily obvious. Maybe some of them use an action. Maybe some of them use less Flights. So it’s not super clear how to do that in a standard way. Well, similarly, what if I want to issue a Sequel query? Some of them may have used an action, a custom action. Some of them may use Flight info, just show me instead. So again, it’s not standard.

And that’s really the motivation behind Flight Sequel to provide a standard for accessing Sequel enabled sources. [00:25:00] And so getting a little bit more into how it works, the proposal is to add a wrapper over a Flight plan, which would be what you use on your… acting your application. And it would expose a few more Flight actions so we can get Sequel capabilities of our server. We can discover. So catalog scheme, as in tables, similar to ODBC and JDBC. The prepared statement, [00:25:30] that’d be a good action for that. And then we’d be able to supply parameter values via Duplo, which is a standard Flight action or requests, sorry. And then finally, we can just clean up our prepared statements and I’ll show you a little bit of code, which demonstrates this as well as how it release map to some of the underlying Flight requests.

So, similarly on the server side, there would be a new set of hooks. So again, [00:26:00] Sequel info, one of my server support discovery. There’s the Flight info specific to a statement. Many of these they have a little superscript one there. They also have prepared statement equivalent. So whether I’m getting the information for a statement or a prepared statement.

We can get streams. So get the actual data for a query. We can accept to put. So that would be typically for a parameter [00:26:30] and then again, the equivalent got prepared statement, close procrastinate. And again, this is a proposal, we’re working on the community or with the community on this. And so the details may change as goes forward and evolves. So here’s a little example of how you might use it from the client’s perspective. And in the proposal it introduces a little bit of help and this is Java, but it introduces a Flight [00:27:00] Sequel client utilities. And again, that may be subject to change, but if I want all this tables it’s a one minor, provided that I already have a Flight client created, simply pass that in. And I pass… currently, this is normal mode but that would be filters for my catalog schema or table. I say, what kind of tables do I want to get back to? I can use system tables or such. So you’re specifying a table and you get back that list of the results. [00:27:30] So pretty simple, pretty straightforward.

I want to do prepared statement. It’s also fairly straightforward. So I would call and get prepared statement, give it my client, give it the query that I want. And I get back a prepared statement. I can see the schema from that. I can execute it. If I wanted to, I could’ve push some parameter values to the server there. And then just as if I was using a normal Arrow server, [00:28:00] then I could get the streams for each of the values. The point here is that these actions have been standardized, so provided that you’re interacting with a Flight Sequel service, these will all be implemented in a standard way.

So it’s an actual example between the server at the top, I may issue to get tables [00:28:30] request in Flight Sequel, which would be translated into an action and get tables action. In which case I get back and I get tables or so. I may issue, I get Sequel info. So what does my service support? Just translated into an action, forget Sequel info, execute query would be translated into get Flight info, which case I get back my Flight information with my schema and my end points. And then the tribals done through the normal way we do get and passing at a ticket.

[00:29:00] Again, if you didn’t want to use the wrappers on the clients, that simplify things, you could actually just directly implement the request yourself and that’s it. And one of the exciting things I just want to say is that Arrow 3 got released, I believe yesterday. And if you’d like to get involved with the links, and if you’d like to see the examples of the benchmarks, those are all in the right.

Emily:

Great. [00:29:30] Thanks, Kyle. And Tom, we only have one minute for questions, but Kyle and Tom, it looks like a lot of people are asking about what security is available? Can you elaborate little bit there?

Kyle Porter:

So, you want me to take us here.

Emily:

Sure, you go.

Kyle Porter:

Okay. So the security is no different currently from what’s enabled in Arrow Flight. With the 3.0 release support was added for bearer tokens and so [00:30:00] the idea of Flight Sequel, specifically is to lay on top of what Flight already has. And so we would look to work with the community to improve that as there’s any specific asks.

Tom Fry:

Okay, just to add to that we support with Dremio specifically using ID password authentication, under the bearer token for follow up. One of the things that we’re looking at adding or participating in adding to the community is support for OAuth access tokens as well. So that in the future, [00:30:30] you can go through the OAuth code, authorization flow and work with any identity provider. A client can essentially get an access token through the code authorization flow from an IDI provider and submit that to Dremio. So that’s something that we’re looking at and we hope to upstream to the community at that.

Emily:

All right. We are actually at time. So before everyone leaves today, can you please go over to the Slide 0 tab at the top of the chat and give us your review of this session. [00:31:00] Additionally, Tom and Kyle are available for questions over in the Slack community. So I know there was a lot of questions that didn’t get answered. I’m sorry about that, but please follow them over there. There’ll be able to answer your questions. Thank you both to Tom and Kyle and to everyone participating. This was a fantastic session. So I really, really appreciate everyone’s input and enjoy the rest of your day. Thanks everyone.

Tom Fry:

Thank you.

Kyle Porter:

Thanks.