Get Started Free
No time limit - totally free - just the way you like it.Sign Up Now
ODBC and JDBC were invented in 1992 and 1997, respectively, so that all databases could expose a common API. These interfaces enable an application to connect to a database, run queries, and consume the results (both schema and data). Amazingly, it’s 27 years later, and these interfaces are still the de-facto standard way to do this.
But 1992 was a different era. Data was small, CPUs didn’t have dozens of cores, and systems were monolithic. The rise of Big Data, distributed systems and data science has created a world where ODBC and JDBC no longer provide the necessary performance and scalability.
However, we’re no longer living in a world where representatives from a few big companies can gather in a room and create a new standard by drafting an IETF RFC. Instead, standards are created de-facto through open source and bottoms-up develop adoption. Therefore, we decided that the most practical way to replace ODBC and JDBC was a two-phase approach:
Three years ago we co-created an open source project called Apache Arrow, seeding the project with our internal columnar memory implementation, and teaming up with Wes McKinney and the Python community. The goal of Arrow is to provide a common way to represent data in memory and process it very efficiently. There are now libraries in over 10 different programming languages, making it easy to work with Arrow. In addition, there are various execution kernels as well as a runtime code generator (Gandiva) that provide high-performance processing on Arrow. Although the details of the format and libraries are beyond the scope of this article, the following illustration highlights some of the key concepts that make up an Arrow table.
The exponential growth of Arrow can be seen in the following chart, which is the approximate number of downloads of the Python library pyarrow (4 million in the last month):
Arrow is well on its way to becoming a de-facto industry standard. This has created a once-in-a-generation opportunity to revisit data interoperability and supersede the antiquated ODBC and JDBC standards.
In the Arrow 0.14 release last week we introduced a new data interoperability technology called Arrow Flight. Flight provides a high-performance wire protocol for large-volume data transfer for analytics. Flight is designed for the modern world:
A Flight is a high-throughput pipe through which data can be transferred. For example, a Python client that wants to retrieve data from a Dremio engine would establish a Flight to the Dremio engine. Each Flight is composed of one or more parallel Streams, as shown in the following diagram:
When a client wishes to establish a connection to a service, it establishes the connection via a Handshake, and then uses ListFlights or GetFlightInfo to identify a specific Flight (represented by a FlightInfo object). For example, GetFlightInfo **accepts a **FlightDescriptor object, which is either a path or command (e.g., a SQL SELECT statement):
GetFlightInfo(path.to.dataset) GetFlightInfo(SELECT name, age FROM path.to.dataset WHERE age > 18)
The returned FlightInfo includes the schema for the dataset, as well as the endpoints (each represented by a FlightEndpoint object) for the parallel Streams that compose this Flight. Note that the FlightEndpoint is composed of a location (URI identifying the hostname/port) and an opaque ticket. The following code leverages pyarrow, the Python implementation of Apache Arrow, to talk to a Flight-enabled service (Dremio in this case):
import pyarrow.flight as flt c = flt.FlightClient.connect("localhost", 47470) fd = flt.FlightDescriptor.for_command(sql) fi = c.get_flight_info(fd) ticket = fi.endpoints.ticket df = c.do_get(ticket0).read_all() // This is a single-process server, no need to look at the endpoints/locations
In some cases, the client is a single multi-threaded process, such as a Python or R application, while the server is a distributed system, such as Dremio. The following diagram highlights a simple flow in which the client is asking the service to execute a SQL SELECT statement and the results are returned in parallel.
In some cases, the client is a distributed process, such as a Spark application, while the server is a distributed system, such as Dremio. The following diagram highlights a simple flow in which the client is asking the service to execute a SQL SELECT statement and the results are returned in parallel from multiple nodes to multiple nodes.
It is possible for a Flight to consist of more than one Stream (i.e., endpoint) at the same location. The ticket is responsible for identifying the data to be returned by the server process. In the following example, data is pulled in parallel from two streams on the second executor (location2).
In some cases, the same data may be available in multiple locations. This could be to enable lower latency access to the data from different places in the world, or due to a caching layer. In such cases, the response to GetFlightInfo may include endpoints that have more than one location, and the client has the right to choose which location it prefers to obtain the stream from. In the following example, the server is offering the client the opportunity to obtain the data from a region in the US and a region in Germany, and the client is choosing Germany.
With an average size batch size (256K records), the performance of Flight exceeded 20 Gb/s for a single stream running on a single core. Because Flight can be parallelized as much as necessary, there is no practical limit on the throughput of a Flight, so benchmarking vs. non-scalable solutions like ODBC or JDBC can seem unfair! That said, even without parallelism, Arrow Flight can significantly outperform these older approaches. Here are a few examples involving a single stream (non-distributed client + non-distributed server).
When JDBC and ODBC were released, analysts and data scientists weren’t dealing with anywhere close to the volumes of data that we have today. Arrow Flight provides a high-performance wire protocol for large-volume data transfer for analytics, designed for the needs of the modern data world including cross-platform language support, infinite parallelism, high efficiency, robust security, multi-region distribution, and efficient network utilization.
Want to learn more about Apache Arrow? Check out our explainer and our history of Apache Arrow – or download Dremio to see Arrow in action.