Apache Arrow Flight SQL: High Performance, Simplicity, and Interoperability for Data Transfers

Network protocols for transferring data generally have one of two problems: they’re slow for large data transfers but have simple APIs (e.g. JDBC) or they’re fast for large data transfers but have complex APIs specific to the system. Apache Arrow Flight addresses the former by providing high performance data transfers and half of the latter by having a standard API independent of systems. However, while the Arrow Flight API is performant and an open standard, it can be more complex to use than simpler APIs like JDBC.

Arrow Flight SQL rounds out the solution, providing both great performance and a simple universal API.

In this talk, we’ll show the performance benefits of Arrow Flight, the client difference between interacting with Arrow Flight and Arrow Flight SQL, and an overview of a JDBC driver built on Arrow Flight SQL, enabling clients to take advantage of this increased performance with zero application changes.

Topics Covered

Open Source

Sign up to watch all Subsurface 2023 sessions

Speaker

Dipankar Mazumdar

Developer Advocate

Transcript

Note: This transcript was created using speech recognition software. It may contain errors.

Dipankar Mazumdar:

Hi everyone and welcome to Subsurface 23. Hope you guys are having a good time here. So I’ll, I’ll quickly share my screen and start the presentation. All right. Okay. So yeah, today we are going to talk about data transfer from databases and data stores and how basically the Apache Air ecosystem and particularly how a Flight Sequel really brings in those advantages with large data transfer. Yeah. So a little bit about me. I’m currently a developer advocate here at, as part of the technical evangelism team. I kind of help people like data engineering teams and researchers to like understand about Lakehouse and various open source projects around that. I work predominantly in the area of like large scale data transfer with projects like Apache Iceberg, Apache Arrow, Nessi, you know, that helps people build modern data analytical solutions. and my background is kind of like b n machine learning and, and that’s pretty much about me.

So we’ll start with some future options. Like what, what if we could have these for our data transfer, right? What option do we have to meet those requirements? And we will do a quick intro about Apache and talk about why Arrow isn’t just enough. And then we will introduce a flight and understand how aero flight itself isn’t just enough. And finally we’ll come to our topic, which is arrow flight sql and talk about what flight SQL brings to the table and you know, various of the scenarios as well. So let’s talk about an ideal best case world, right? Where we have the following things for our data transfer. first, you know, the standardized api. So it doesn’t matter what system you’re using or what language you’re using, you have a standardized set of APIs which makes it easier to develop stuff, right?

And there is no need for different drivers right now. you know, right now if you have to connect to any system in any data store, you need to get the system driver, which means you also need to manage them. And you know, if you’re in a centralized IT team, that can be such a pain, right? So what if we could use a standardized a P I and actually have just one single driver? And obviously performance is a huge aspect, specifically when you’re dealing with a lot of data. we need responses to be pretty quick. So that’s not something you can trade off easily as well as along with the high performance, at some point you’re gonna saturate a single network connection. So we wanna do those peril data transfer, both one to many as well as many to many, right? and you optimally want this to be easy.

It’s not just the standardized E P I and all the remaining stuff, but we also want this to be easy to be implemented on both the server and the client side, which helps in adoption and ultimately makes it the standard. So if we kind of look from those requirements, why don’t O D B C and G D BBC fit in today’s kind of analytical world, right? Why don’t it fit those requirements? well the reason is that these were built back in the older time. O D BBC was built in 1992 and J D B C 97 and they’re just inherently robust cause they’re a product of their environment, right? And back then, basically most ST were robust or suddenly a vast majority of them were robust. and it did some good things, right? It prevented the you know, the many to many client to database kind of API mapping, but also it solved only half of a problem. It resulted in one to many database protocols. So for every data store that you have, each one has its own

D, c and ODBC driver. So that’s a pain. But if you look at it today, right? The world isn’t drug-based you know anymore. Now with OLA Analytics specifically with huge volume of data, you are a good column, right? And most of the tool involved in the OLA workload that at least process this non-negligible amount of data are also coer to take advantage of those same benefits. that has to be provided by the column standard, right? But there is no column that transfer standard, right? When you’re transferring that column data, there is no standard for transferring that data. and as you can see in this particular diagram here we have a column and you know, we got to convert into ROE over because we are sending it via od, BBC or gbc and then convert it back to column in the client side again.

So why do we need to do that? Like this process can take anywhere from 60 to 90 per 90% of the time of the particular transfer process, right? and why, because of the simulation and ion on the client and service side which as you can see here, is totally unnecessary. I mean, there is no reason we have to do it. So if you look at it, well column, there is really the standard now and certainly, you know, Ola analytics world, but what if you could have a column based transfer framework where you don’t need to do that pro based conversion, right? So what options do we have to actually keep things column, right? One, we can do like a custom protocol thing, right? You know, you can build a custom connector for each client and server. You can write code for each of the client server and they would need to bely implemented, right?

And that would be performing obviously, but you’ll end up with a many too many kind of mapping in that case. So if you look at kind of a comparison, right? You know that this metrics that I have presented is that we lose that standardized client client interface. we lose that standardized client database interface. we also lose the standardized survey interface in this custom implementation, but we will prevent the unnecessary serialization and this simulation, which is good. but that’s really where aero flight comes in and kind of bridges that gap, right? So if we took take a look at how we can keep this in column there, right? That’s really where aero flight comes in. And you know, so if the database already column there converting into Arrow is generally pretty quick and we can just keep it in Arrow format all the way to the client Elastic as you can see in the diagram.

So now we don’t need to pay that 60 to 90%, you know, cost with simulation and simulation resulting in much higher performance, not to mention the performance that it gained out of transferring the data in column standard, right? That’s, that’s an added advantage as well. so you get things like compression and those kind of you know, added benefits with this kind of you know, having a column standard. But before we go to the aero flight, I quickly wanted to do a recap about Apache Arrow because that’s what we are setting the base upon. So Arrow, you might have heard a lot about it in the past few, like, you know, past few years, it’s an in-memory column format and you know, it includes various libraries to work in the format. it’s really a polyglot format, right? So, you know, trail language is supported today and there are more supports happening specifically in the Go ecosystem, Python as well. And it was co-created by Dremio and we McKinney a while ago. interestingly it started as Dr. U’S internal memory format. So, and there has been just this crazy rapid adoption, you know, not to mention the capability growth and its integration with various tool. So if

You can see in the image here, you really see an exponential growth of the downloads of the Python package, of which is Piru. and this is really amazing even for my personal workloads. So also the tools that kind of support this arrow ecosystem is also skyrocketing, right? This is not even a comprehensive list, but you can see quite a few well-known projects like, you know, reo, dv, spark pandas and you know, TensorFlow and stuff. So with that said, you know, the question is isn’t arrow enough, right? You know, why do we need something else when Arrow is there? And really the key is that Arrow is primarily geared towards those operation on data in a single node, but there is no specification for like across network transfer, right? So it’s more like both for like within a cluster if you’re doing things like shuffling as well as for client server transfer.

And that’s really the problem Aero flight kind of sets out to solve, okay? that it basically is a network transfer for aero specific data. So now you have the aero format, but now you have the standard to like transfer that data over the wire. Okay? So let’s quickly go into, you know, what is aero flight? so it’s a general proposed client server framework to simplify high performance transport of basically any kind of large data over network interfaces. And it really is designed for data in the modern world, right? we went through how kind of basically we have data in the column standard and not even on the disk like, but even when it’s in memory and trusting and it has results to send back to your client and it’s often coer, right? And, and at a minimum you can take advantage of the column benefits of the data transfer.

So it’s column oriented and as well as that I mentioned before on compressibility, you know, and those kind of stuff. But the arrow format is specifically designed for high compressibility and large number of rows. So these are the two things that kind of are really good for efficient data transfer and it really makes including client side in the distributing computing model easier, right? again, there has never been a silver bullet and likely will never be that you know, we can’t do everything with every solution. for example, like, you know, your database can do a lot of things, right? But it’s not still going to be able to do like a hundred percent like, you know, let’s say you have some complex machine learning, right? Your SQL database or your core business data lake is not going to be able to do that.

So what we can do is we can do some processing and we can do some filtering and pre aggregation and stuff and you can minimize the amount of data that you kind of load on the, you know, computing side. So it really brings your client into the more of the distributed competing model. And let’s say your client is a parallel spark cluster so your client is isn’t just a single machine, right? It’s a, it’s a cluster. So really makes that disability competing and using the best tool for the job much easier and better. Also it enables parallel data transfer. So that’s another important thing and interesting thing. So let’s say you wanna do machine learning using Spark, but you have your core business data stored in the data lake, right? And you access it by something like let’s say reu. Well what error flight enables is it leaves the foundation to have your smart cluster.

So let’s say a hundred node access your DR cluster and let’s say a hundred nodes, do you know, kind of they do kind of like a node to node communication. So you get that a hundred times throughput even above what it already provides serially. So that’s a great use case as well. And I mean if you see the performance results and these aren’t even the serial results, you can just see the crazy difference in general, right? typically this is great with large number of records as you can see with lower number of records. The differences aren’t that great obviously because this is the, this is for higher, higher and larger volume of data. So we can see the difference that we kind of talked about. So if we kind of take a visual representation of what that kind of looks like, this is the one-to-one, one to many parallel data transfer, okay?

Where you have client on the right hand side and it’s actually retrieving the data from a full cluster so the cluster doesn’t have to send all of the data to one node and be bottlenecked on that just for the transfer, which even serially like would still be faster with error flight. But you can actually do this parallel, right? So you can actually have all of a different network connection over here and you know, get that distributed computing model over here. So that all sounds cool. And the question is are we done? Like is aero flight, I mean a flight is basically the standard that will help us, you know, transfer that data of column data or the aero data over the wire, but is it isn’t aero flight enough? That’s the question. but if we go back again and look at this particular metrics we talked about we can see how flight standardizes the client interface, which is great.

However, the thing it doesn’t do is it doesn’t standardizes the client database access interface, okay? And we will go through that in a moment, but it also standardizes the server interface and prevents that unnecessarily serialization and DC relation which we talked about. So we are almost there, you know, we still need to standardize the client database access interface but you know, we are almost there. So that’s, that’s, that’s a good start. So with error flight, the way it is built is that it is designed for this arbitrary bites, you know, so the server is going to send the results and it’s made to be dynamic so it can send anything, right? Because who know what you might be sending, right? And you want to be able to support that so that the by stream really only has meaning for a particular server that chosen to support it, right?

So you kind of have to negotiate that outside of a flight and you can see basically the bite and how it works. But we have to negotiate that kind of how to interpret those bite on the client side. So for example, if you send something to dremeo, like we are going to interpret that bite stream to be U T F eight SQL query stream, okay? But if you were to send that to a different system like some other like you know, client system or some, you know, some other stores, they might interpret that totally differently. So that’s really what we need to standardize, you know, that is how do you communicate specifically around databases and SQL access, which O D B C and GBC already does that today, right? So another example is like catalog information. So alright, give me a list of the tables and you know databases and all those kind of stuff.

Well that’s not part of it but you can do it but different systems are going to interpret it differently so we don’t end up with the same fragmentation we were trying to avoid it the first place and if we just take a step back, haven’t we seen those problem before with the custom protocol? And the answer is yes we have. And in fact how we have really solved us using the O D BBC and G D B C protocol, like they standardized the database activities, right? And they have this standardized a API I call for like you know, how to submit the query and get the results back and those kind of manipulations. So if we kind of take a page out of the O D BBC and J D B C book of standardizing those, a API I call, that’s exactly what a flight sequel aims to.

Okay? So now let’s talk about what aero flight SQL is. So it’s a protocol for interacting with SQL databases built on top of aero flight. Okay? So we are not replacing anything that aero flight did to be honest. We are just actually building on top of it and what we get there for is we have all the advantages that we discussed until now. So it allows databases to use error flight protocol for transfer, but it also standardizes the thing that we talked about to make it easy for client side, you know, and also not result in having so many different protocols, right? And of course it leverages the performance benefit that arrow and error flight provides for a database access. So it’s super fast. So if you look at it more concretely, what it really is is it is a set of commands to standardize a SQL interface on flight.

So for example, it standardizes things like how do you do query execution, right? How do you submit a query and get the results back And specifically you can do things like prepared statement and those kind of things. Also things like catalog we talked about before. How do we actually access the metadata on the tables, columns, et cetera and those things and more importantly the SQL syntax capabilities, right? you know, although you know, even if you have a standard interface today, even with ODBC and gbc, different databases support different subset of SQL syntax, right? and sometimes they even do extract thing on top of that. So these standards are basically provided by flights, okay? And one of the really nice thing here, and it’s the same with arrow or error flight for that matter is that it it is language and database agnostic.

So this is great in terms of getting the benefits of the api but regardless of the language you’re using, so you could use any kind of language that is Python go or US whatever it’s, so what this also means is that if you’re implementing this once on the client side, let’s say we’re accessing using Python we implement that once and you can connect to any flight SQL server that has their endpoint open. So like any server that implements the flight SQL site, now the client can talk to that particular flight SQL endpoint, okay? And clients can now interface with database implementing that protocol. So for example, if you’re using something like Taboo, normally in the taboo screen you see a list of the connectors, right? You see a lot of connectors over there. Well, with flight cycle what that list of connector becomes, it is just connected data, right?

There is no different connectors like many too many connector and there isn’t a choose a database or data store, you know, the figuring out what drivers to download and you know, even if you download something you will have to manage them over a period of time which becomes really cumbersome and you know, it it kind of really expensive. You know, it’s a technical debt as well. So if you look at this diagram, you know, this will be like, you know, let’s take an example. For example, I wanna list a table with arrow flights, right? So the client is basically gonna say hey, you know, use the get flight info, you know api and that’s basically error flight sequel. So anything you see in this particular diagram that is not bold is just error flight and all the bald bold font that you see is error flight sql.

So as you can see here, we are not replacing anything that a flight does. It just standardizes it saying hey you know, it’s a get table command and the database know, okay, I want, I know what the client wants and I want to get it back to the client as soon as I can. So now I want, if I want to query a table, you know you can use the same standard protocol you just have to use a command statement query as you can see in this diagram and you can supply all of your parameters and configuration details and it’ll get the data for you on the client side. So here’s a list of the flight sequel command that I thought might be just to give an idea of like what things you can do, like things like manager request and even things like primary key, getting the primary key for database and you know, the query side command and prepare statement and those kind of stuff.

So these are a couple of the, you know, flight sequel command that already are there as of now. Okay, so here we are again, right? Are we, are we done yet? Right? And and as you can see in this particular metrics, like the main four requirements that we talked about are all checked with error flight, right? And the answer is mostly we are done, right? And the only reason I kind of say mostly and not a definitive yes is because we basically want to make it easier to adopt flight SQL today, right? We want to provide all the benefits of flight SQL as much as we can, but not everything support flight cycle today, right? Again, it’s a fairly new thing and specifically in the BI world, right? It’s going to take some time to widely adopt and even with machine language libraries. so it’s kind of new to them as well, right?

You know, this is a new interface all this while they were depending on O D BBC and J D BBC and that’s what they were like, you know, too. So that’s really where aero flight SQL L od BBC and J D BBC driver come into play. So that’s another thing that, so basically these are the drivers that are built using the air flight SQL protocol, but these are like the OD BBC and GBC drivers. So let’s talk a bit about that. So they are OD B C they’re basically O D B C and G D BBC drivers built on top of flight SQL libraries and you know that that’s kind of like a single driver and, and that well basically simplifies everything. You don’t have to implement the code again, all those kind of things, okay? And that bridges the gap between ODBC and GBC interface.

And the nice thing here is that the drivers will work with any ODBC or GBC client without really having to make that code change, right? You know, so that’s value for BI tool but it’s also true for any application that you might be writing. For example, in trope notebook you might be using Python or something to do that. So you don’t have to change a code for that, you know, it’s already there with this particular drivers and both of these drivers are completely made open source, you know, which is fairly rare like for O D B C and G C driver. So we developed both of these things here at Dr. U and we have our open source stream and we contributed to the, our open source project. So let’s take a look at the traditional O d, BBC and JDBC kind of thing and error flights from two perspective driver management and performance.

So for the driver management for the traditional OD BBC and gbc, you just need to in, you basically need to install and manage a driver for each database with traditional ones. However, with five sql what that happens is you just need to install and manage one single driver and that’s all, that’s all you need to do In terms of the performance, obviously traditional is like robes transfer mostly. and even it’s not even IF’S od bbc, it’s sometime like you have to convert it on the client side. So it is fine for a small amount of data, but when it’s like larger volume of data, you might want to look for something like flight cycle which has this column of benefits and everything. Okay, so should you use this new driver every single time, right? That’s the question. Like why would I use my flight cycle driver?

Why not? I’m already into ODBC and JDBC and for my access and everything, why should I use this? Well it’s generally preferable to have a native flight cycle application and actually use a native interface here A because it’s easier to harness those feature features like multiple endpoints and parallelism and those kinda stuff. But it also be gives better performance that we talked about. So if we kind of look at this diagram quickly, like I, you know, have a couple less minutes here, but if you see in this first example the database is column based, right? But the client is row based, okay? So in this case maybe it makes sense to use the old ODBC and GBC driver, you know, and because like you know there might be things like conversion and stuff so that that can be on one scenario but if the client is column there, well in the second case with O D C and G C, we are converting it from ER over the wire to just row, right?

And then you have to go back to ER again because the client is coer, right? So this is again a unnecessary simulation thing that we kind of avoiding in the first place. So that’s where flight sql, ODBC and GBC driver shines. So if you look at this particular stack, generally the way you think about is if you have a legacy client such as Old BI tools that just needs to get some data or is row based, that’s where you’re going to use this new OBC and G C drivers server flight. However, if you have a SQL native line, that’s where you want to do something like flight sql. So if you’re running something like SQL from a Python code, there is native flying for five sql. So you might want to use that. And if you’re not doing anything with sql, like you’re not connecting to any database or stuff, you can just use like error flight natively as protocol.

So to end this particular presentation, I want you to take this mental picture cuz it might be a bit confusing like with Arrow and error Flight and error flight sequel. Basically error flight is the column in memory format that basically, you know, speeds up everything and you know, helps you in memory transfers. But Arrow Flight is basically the client server framework that helps to transfer that Arrow data that that column on the firm. And data and aero flight sequel is a protocol, it’s kind of a protocol to interact with the databases, right? You still need to cont contact with the databases. So that’s kind of a full stack solution. So that is pretty much all, and you know, I, you know, you can reach out to me even externally and after session, but you know, I’m open to any questions that you might have. Thanks so much.