In the current world of data science and business intelligence, each tool you use requires a separate driver to connect to each database it uses. These drivers may be included in the tool, but in general they are separate add-ons that users must install. Having to install this add-on incurs additional challenges for end users and IT administrators that get in the way of simply letting users analyze their own data. Additionally, individual drivers can be hundreds of megabytes large. Tools that support a large number of data sources can quickly balloon in size due entirely to bundling drivers.
With the advancements made as part of the Arrow Flight SQL initiative for Apache Arrow, this is no longer necessary. Arrow Flight SQL provides a lightning-fast protocol for sending data remotely and provides everything needed to describe the schema of a database. A database provider can expose an Arrow Flight SQL endpoint and any application written for Arrow Flight SQL will be able to connect to it.
How does this help with BI and data science tools that aren’t written for Arrow Flight SQL? A JDBC driver is being written for the Arrow Flight SQL protocol itself, rather than the traditional approach of writing the driver for a particular database. The idea is that the driver is a “one-size-fits-all” driver — a user or tool vendor only needs to supply a generic driver that can connect to an infinite number of databases. This is even future-proof — if a new database comes out, it can work with existing tools as long as an Arrow Flight SQL endpoint is provided. In fact, by adding an Arrow Flight SQL endpoint they would automatically enable JDBC connectivity too.
Not only will Arrow Flight SQL reduce the technical burden on applications and users, but it leverages Arrow, which means it will provide better performance. And, just like Arrow, it will be open-source and thus as rough edges and bugs are found, they will be fixed by an active community. And since it will be leveraged by a wide variety of sources it is more likely to be of high quality. having a single reference JDBC driver allows any data source that adds an Arrow Flight SQL endpoint to get JDBC “for free” as an onramp. So the selling point is add an Arrow Flight SQL endpoint to your data source and automatically get JDBC connectivity.
Example BI Tool: Tableau
Tableau is one of the most popular analytics tools on the market. It has three variants – Tableau Desktop, Tableau Server & Tableau Online.
Tableau has the concept of named connectors that it comes installed with to provide connectivity between Tableau and various data sources (which can be relational, flat file, or multi-dimensional sources for example).
There are over 90 named connectors in Tableau Desktop (see Figure 3) as of version 2021.2.
However, not all sources have a driver included with Tableau and require extra steps to install. This driver download page lists instructions for installing the driver for each of the 90 sources. And the instructions vary by Tableau version and operating system (Windows, Mac, and Linux). Another problem is that some sources do not provide a driver for each operating system that Tableau Desktop and Tableau Server run on.
Under the Arrow Flight SQL model, any source that provides an Arrow Flight SQL endpoint can share the same driver, and that driver would work on all operating systems that Tableau Desktop and Server can run on.
To learn more about Arrow Flight SQL watch the Arrow Flight and Arrow Flight SQL Accelerating Data Movement video from Subsurface LIVE. You can also follow the status of the Flight SQL pull request on Github. To learn more about Apache Arrow and ways to contribute to the project, checkout the Apache Arrow documentation.
As a co-founder of Bit Quill Technologies, James helped jump start several projects around heterogeneous data connectivity to sources such as Hadoop, relational databases, text file formats, and web services. His primary role has been to develop a distributed database engine that supports aggregation of data across a mix of sources while leveraging the querying features available to each source.