Let’s quickly go over the Apache Arrow ecosystem to appreciate the impact of this new announcement.
The Apache Arrow Advantage
Columnar formats are a proven tool in improving the ability to analyze data for analytics. The columnar approach has long been used in big data technologies like the Apache Parquet file format where users have been taking advantage of the benefits of columnar layouts of data, not to mention the traditional data warehouse world with systems like Teradata, Snowflake, and Redshift all relying on columnar data structures in storing data.
There’s also been an effort to take advantage of these same columnar benefits in how we represent data in memory. Most systems in the past have approached this by implementing their own custom memory data format.
Apache Arrow brings us an industry-standard columnar in-memory data format to improve the speed of analytics in-memory. Arrow is supported by many libraries across all of today’s most popular programming languages that leverage the Arrow format to realize these benefits.
Apache Arrow Flight
As mentioned above, many systems implemented their own in-memory data format, and while it provided users many of the benefits of columnar memory formats, every system having their own custom format led to a lack of standardization in the industry. This had a downstream effect of the continued dominance of JDBC and ODBC standards for interacting with data systems. These two standards worked well for the world of data they were created in, 25 and 30 years ago respectively, but their cracks have shown in the big data world.
JDBC and ODBC require converting/serializing from the system-specific columnar format to the JDBC/ODBC row format and then back again to the client’s columnar format, which can slow down the movement of data 60-90%.
With a standard columnar memory format that the source and destination both use, we eliminate the need to serialize and deserialize the data, resulting in the elimination of that 60-90% overhead in data transfers.
Apache Arrow Flight provides a new protocol that can take full advantage of the Arrow format to greatly accelerate data transfers.
Arrow Flight can also take advantage of multi-node and multi-core architectures, so the ability to optimize throughput is almost endless via full parallelization.
But even when you compare the performance of just serial Arrow Flight to ODBC, Arrow Flight still greatly outperforms ODBC. See the amount of time data transfers take based on the number of records in the chart below.
With the right format and the right connector we can accelerate data transport and processing but it has to be developer friendly. While slower, JDBC and ODBC drivers have simple APIs that are easy for developers to use.
To address this, Arrow Flight SQL has been introduced to round out the Arrow Flight project by providing a developer-friendly interface for connecting to data systems with Arrow Flight endpoints.
One additional very valuable capability that Arrow Flight SQL lays the groundwork for is a universal JDBC driver, which you can read more about here.
An Intro to Arrow Flight SQL RPC
The Arrow Flight protocol is implemented using RPCs (Remote Procedure Calls) which is a communication protocol similar to REST but instead of sending a traditional REST-like request to a remote server, interaction looks more like calling functions in a normal local application.
In the Arrow Flight RPC specification, serial data transfers will follow a pattern like the below:
The client sends a GetFlightInfo request passing a procedure they’d like to complete (e.g. add data, get metadata, run a sql query). The response will be a FlightInfo object with the details of the request including a ticket to use to receive the results of the request.
Then a DoGet request is made, passing the ticket received earlier to get the results of the initial request. During this entire process, the data is never converted to an intermediate row-based format, allowing the data to be transferred from server to client at much higher speeds.
Arrow Flight SQL enables you to execute the above request/response cycle with ease regardless of the language you use, without having to implement your own client and spend time dealing with the performance and security considerations of doing so.
While the official documentation is forthcoming, there are several samples of how Flight SQL implementations can look like in action in the Apache Arrow Repository.
The following example uses an example of a Java-based Arrow Flight SQL CLI can which can be found here. Using it, we can connect to platforms that are Arrow Flight SQL enabled.
For the purpose of demonstration, we will assume flight-sqlrepresents the command to run this tool.
The tool can be passed many option flags whenever we run a command such as:
-command: The command you want to run
-username: username for database authentication
-password: password for authentication
-host: the host of the data
-port: the port of the database
-query: query to be executed
The commands that can be passed can be found here, but include:
So if I have a Dremio instance with an Arrow Flight SQL endpoint running on port 32010 with username/password of “user”/”pass”, I could fetch a list of all the tables like so.
I then get some output like this:
Now let’s say I want to get all the data from that zips table listed above, I can execute the following:
The response appears in less than a second as I see thousands of records populate my terminal like so:
While this dataset is relatively small, even with very large result sets as we saw in the chart above, the response is very fast because after the results are gathered, they didn’t have to be converted into a row-based format at any point, allowing the data to be transferred from server to client without the delay of serialization and deserialization.
Bottom line, with the Arrow ecosystem we make data fast to transport and easy to develop with to meet today’s data demands.