March 1, 2023
12:55 pm - 1:25 pm PST
ADBC: Arrow Database Connectivity
The Apache Arrow ecosystem lacked standard database interfaces built around using Arrow Data, particularly for efficient fetching of large data (i.e., with minimal or no serialization and copying). Without a common API, the end result is a mix of custom protocols (e.g., BigQuery, Snowflake) and adapters (e.g., turbodbc) scattered across languages. ADBC aims to provide a minimal database client API standard, based on Arrow, for C, Go, and Java (with bindings for other languages). Applications can code to this API standard much like they would for JDBC or ODBC, but fetch result sets in the Apache Arrow format.
This talk will cover goals, use cases, and examples of using ADBC to communicate with different data APIs (such as Flight SQL or Postgres) with Arrow-native in-memory data.
Sign up to watch all Subsurface 2023 sessions
Note: This transcript was created using speech recognition software. It may contain errors.
Hi everybody. So yeah, I’ll be talking about arrow database connectivity, A D B C. so of course the first thing that we need to get outta the way is essentially why should you care what I have to say? and so I work at Voltron Data. my day job now is working on the Apache Arrow Libraries straight. That’s what I do. I work on the Apache Arrow Libraries. I also wrote the first, and I still, I think it’s still currently the only book on Apache Arrow itself. so if you’re curious, you know, pick up a copy, a lot of info in there, a lot of great information diagram diagrams and documentation links, so on and so forth. But in order to kind of understand what I’m talking about with aero database connectivity, we need to first give a quick primer on Apache Arrow itself.
So arrow is itself a high performance in memory columnar format. Now, of course, that’s a lot of buzzwords, but the point is that it is designed to be identical in memory as it is on the wire. No matter what programming language you’re working in. The benefit there being that you can pass this data between processes or within the same process without paying any serialization, oration cost, it becomes super quick to go to pass the data around and works in. There’s implementations for a whole number of languages, you know, c plus plus go Python, rust, Java r There’s more than, I’m saying even more than what’s list on this slide. The idea being that if everyone can agree on the same representation, then data can be passed around very quickly, very efficiently, without spending 80% of your C P U time on just serializing and serializing data every time it goes from one piece of your architecture to the other.
Now, if you’re not familiar with the term columnar or column oriented, it’s pretty straightforward. Typically, we think of structures and tables in terms of the row oriented memory buffer you’re seeing on the, on the slide, you know, a array of structs. Whereas a column oriented buffer is going to be more like a struct of arrays, where you’re grouping everything in the same column together in memory. Now, why would you do this? Because it get, when you’re talking about analytical workflows and com computational access, you’re gonna get better io, you’re gonna get lower memory usage, fewer page vaults at the C P U level, you can fit things in cash better. For example, on that previous table that we just looked at, if you wanted to compute get me all the archers in Europe, well one of those columns don’t even need it all. Now, if you were in the row oriented buffer, you have to still read in the data for that column for everything, and then you have to jump in memory across from column to from value to value in memory, probably causing some page faults.
And it’s just not as quick as being able to just read one chunk of the data, look at the first chunk to get the years, and then process the second chunk of the names. Same thing goes for, you know, vectorization is big in comput computational right now, like cs, single instruction, multiple data to, in order to benefit from that, the the, the data has to be in memory contiguously as one chunk so that you can pass it all to the processor and vectorize computation. When you column orient the data, the data is already in that contiguous chunk that you can then operate on vectorized. So this is our database connectivity. So the first thing we’re gonna do is talk about, okay, when you’re communicating with databases, what is the common workflow? What do you most rec most likely have? And nowadays, most likely you’re gonna use O BBC or J D BBC and your workflows promote something like this. First your application is gonna submit a sequel Q query to some api, probably J B C O D, bbc. Then you have a backend driver, which is gonna translate that query to some database specific protocol and send it across the wire.
The database is gonna then execute your query. You’re gonna get your results set back in the database specific format, and then the driver is going to take the data you, the results that you got back from the database, translate it to the format that J d BBC and O D B C require, and then your application can integrate those results using a row based interface, which as we said, why would you, why would you do that if you’re doing an analytical workflow and want to get the columnar data? More to the point, if your database does column oriented computation and you send that data all the way across the wire, if you’re using J D BBC or O D B C, now you have convert that into a road based representation before the application can get it. And it’s likely you might just convert it back into a column oriented representation, which is time wasted, memory used, not optimal.
So there’s pros and cons though. J BBC and OBC are ubiquitous. They’re not going anywhere. If you u, if you’re using J B C and O, BBC probably to access say, or SQL Server or what have you, it’s not the easiest thing to switch away from that because that’s what a lot of your clients are probably expecting. And so you end up with that conversion I just mentioned, where you’re converting those column oriented native databases into a row based in infrastructure or representation of the data and paying that cost. And then when you want things to go faster, you’re going to avoid that cost by instead having to integrate with a specific SDK for that database so that you can then benefit and get the representation you want natively and not pay all those conversions. But now you’re gonna be dealing with, okay, now I need a com, an SDK for Postgres, I need an SDK for Google BigQuery, I need another SDK for click house. And so you end up with tons of connectors to do all the adaptation. So that’s where A B, C comes in
There. A B C is a single API for getting Apache Arrow native data in and out of databases. Straightforward column oriented vendor agnostic as minimal overheads we can get. The idea is not to completely replace O B C and jbc. We’re not going to, we know that. The point is that for bulk columnar access, if you are doing analytical workflows and want that large amounts of data in a column oriented representation, adbc is where you is where you wanna go instead of O B C or JD bbc, which would then cost you having to convert things to row based.
So let’s take a look at what that workflow looks like. If you’re using a D BBC same workflow, we’re gonna submit a query using the application, the driver is gonna translate that to database specific thing, send it across if you’re lucky. And that database is using arrow flight for its communication. It becomes straight really, really easy and the data can just pass right through when it executes and gives you arrow data whole way through. No conversion necessary, no serialization, no ization, just quick, efficient, fast column oriented data. Now if it’s necessary, the driver would also convert it into a Native Arrow format. So it, it allows you to then query row oriented databases like Postgres while your application is coded solely against Native Arrow record batches that you can then iterate over and operate on efficiently.
So A B C is a specification along with a bunch of things that implement that specification. At its core, A B C is a header file that defines an interface. The reason for this is because most things can interface with a C interface and Arrow exposes a C data interface so that you can take that data that came back from the driver and with a zero copy pass it off to the caller without having to pay any copying cost of the data. Just passing some pointers around. The intent is for A B, C to B A B I compatible across releases. We’re gonna uses free functions and enums and other mechanisms in the way it’s defined to be as compatible as possible so that things aren’t gonna break on you.
And it’s relatively new. So the drivers and the APIs still under development, still a lot of work going on with it, but there are currently already c c plus plus Python go Java and Ruby implementations of the I of the interfaces along with several drivers. So in your general ideas, where does a b C fit? Well, here’s a helpful little guide. If your workflow is primarily row oriented, maybe it’s a transaction O L T P, then you probably gonna use J BBC or O bbc. You’re probably using aero based interface. Depends on how you integrate with your vendor, with your database, what you’re doing. If you wanna do anything Arrow native, you have your protocols, arrow flight, sql, big Query storage, G R P C, arrow protocols, and then you have the actual A api. You can utilize A D B C.
Now, if you’ve heard of Aero flight and Narrow Flight sql, you might be wondering why A D B C if a flight and SQL flights exists. It’s not a replacement. It complement that when, when you’re talking about any kind of complex database data system, you essentially have in for the most part three pieces. You have a client, you have a transport, and you have a server. A D B C allows the client side so that you can code against the single client API using Arrow native record batches and data regardless of what your backend data sources are. And then you can just swap out the driver for a different one and your application code doesn’t have to change it. It’s the same reason why we use O D B C and J BBC only. We’re getting a column oriented representation natively. Now on the server side, if you’re a database developer and you build arrow flight SQL interface, you don’t have to build a driver arrow. Flight SQL as a protocol already exists as a driver for A B, C. There’s even an O C and J B driver for hour flights. So if you’re developing a database and you give it a flight SQL dr site flight SQL interface, you’ve saved yourself a ton of work because you don’t need to create a specific driver for your database like you would normally have to do for O D B C or JW bbc.
So it’s a complimentary effect here. So let’s see what it actually looks like in practice. We’ll start with the Python implementation.
Recently we released A D B C version oh 0.2 0.0, which included drivers for SQLite Postgres and Flight sql. So the A D B C drivers for Python include A D B A P I interface so that you can just use D B A P I 2.0 completely compatible with whatever you’re using with Python for D B A P I things and benefit from A D B C right away. If you want the data to stay in arrow format, you can install Pie arrow and use fetch arrow table. And now your data never gets converted into a real oriented format and you get the columns of data the whole way through. You can see on the two examples, they’re very, very similar. The difference is just which driver you imported and your you, your connection U i.
There’s also a driver manager instead of having to import different drivers. If you have, if you need, if you want that low level or even higher level interaction or bindings and maybe you wanna have multiple instances that are gonna interact with different databases, you can use the driver manager and then create databases, database objects that will use whichever driver you want to use. Remember that’s a specification defined by a C interface. So as long as there exists something it can load that can that has the A D B C header, exter C functions, it can work period.
So that driver manager allows you to use multiple drivers simultaneously without different imports. It allows you to decouple what your work, what your stuff is from the different specific drivers. You don’t need to have import A, B, C SQL light, you can just import the driver manager and parametrize the driver that you’re the driver’s string so that you can load whatever driver you need. It also allows the drivers to become reusable. For example, the flight SQL driver that Python uses for the PI for a B ABC Python is actually written in go compiled to a shared library and distribute with the wheels via the extern C functions. And the reason for that is cause it allows us to have, not have to maintain multiple implementations of the flight driver and we only have to ship the one shared object library with the wheel with almost no dependencies. Rather than having to ship the A D B C implementation and G R P C and Arrow Flight and LI Arrow, we all of it has one almost zero dependency, almost zero shared object that’s shipped with the wheel for Python to use via Exter C and that way, and everything’s reusable.
Here’s an example. Using Go the Go implementation contains a wrapper that allows you to use, if you really want to, you can use go default database slash SQL import and use any A D B C driver. Any native going, any native going BC driver, right? Any A B C driver that meets the interfaces with GO’S SQL database interface. It also has, if you’re, if you are able to use sigo, it has its own driver manager, which then would let you also use any A D B C driver written in anything that exposes the C interface with go as long as you, it just loads the shared object that implements the interfaces.
So to kind of drive home the point that this is not something that SH is terribly difficult to understand, let’s look at kind of how the concepts kind of match up. You’ve got the different concepts here between A, B, C against the default database sequel for go D B A P I for payon, the flight SQL connection. If you’re using Flight SQL directly, J D B C and O D B C in code, if you were writing an application that utilizes these and you can see that all the things map up pretty straightforward, pretty easy, pretty straightforward. The idea is to be simple, efficient, and effective,
Above and beyond just executing queries. A B, C also defines in its interface setups for bulk congestion. You can stream record. Our record batch is into your database as import, as import as imports. You can handle partitioned results sets if your database happens to allow for different locations to query, to fetch your results from, to benefit from partitioning the results set. A D B C HA has it covered. It also provides some facilities for transactions so that you’re not left out in the cold bay, you’re not, you’re not giving up any functionality of your database to use A, B, C. It complements your databases. So if you want more information about a D B C or by arrow, whichever, here’s some links for documentation. If you want even more in-depth stuff about, about Apache Arrow. Yes, shameless plug, get my book. Lots of great information in. And there there’s also a link to a GitHub repo that has all the code samples that are in the book. Examples in the book are also in c plus plus Python and go. So no matter what your language of choice is, you’re not left out in the cold with the examples.
With that, thank you very much. That’s, that’s me. I’m on Twitter, LinkedIn.