Table of Contents
It’s Time to Replace ODBC & JDBC
ODBC and JDBC were invented in 1992 and 1997, respectively, so that all databases could expose a common API. These interfaces enable an application to connect to a database, run queries, and consume the results (both schema and data). Amazingly, it’s 27 years later, and these interfaces are still the de-facto standard way to do this.
But 1992 was a different era. Data was small, CPUs didn’t have dozens of cores, and systems were monolithic. The rise of Big Data, distributed systems and data science has created a world where ODBC and JDBC no longer provide the necessary performance and scalability.
However, we’re no longer living in a world where representatives from a few big companies can gather in a room and create a new standard by drafting an IETF RFC. Instead, standards are created de-facto through open source and bottoms-up develop adoption. Therefore, we decided that the most practical way to replace ODBC and JDBC was a two-phase approach:
- Create an industry-standard memory representation by providing a valuable columnar memory representation, including convenient libraries in a variety of languages and high-performance data processing kernels.
- Create an industry-standard RPC layer, based on the common memory representation.
Creating an Industry-Standard Memory Representation with Apache Arrow
Three years ago we co-created an open source project called Apache Arrow, seeding the project with our internal columnar memory implementation, and teaming up with Wes McKinney and the Python community. The goal of Arrow is to provide a common way to represent data in memory and process it very efficiently. There are now libraries in over 10 different programming languages, making it easy to work with Arrow. In addition, there are various execution kernels as well as a runtime code generator (Gandiva) that provide high-performance processing on Arrow. Although the details of the format and libraries are beyond the scope of this article, the following illustration highlights some of the key concepts that make up an Arrow table.
The exponential growth of Arrow can be seen in the following chart, which is the approximate number of downloads of the Python library pyarrow (4 million in the last month):
Arrow is well on its way to becoming a de-facto industry standard. This has created a once-in-a-generation opportunity to revisit data interoperability and supersede the antiquated ODBC and JDBC standards.
Introducing an Industry-Standard RPC Layer with Arrow Flight
In the Arrow 0.14 release last week we introduced a new data interoperability technology called Arrow Flight. Flight provides a high-performance wire protocol for large-volume data transfer for analytics. Flight is designed for the modern world:
- Cross-platform. Out of the gate, Flight supports C++, Java, and Python, with many other languages on the way.
- Parallelism. A single data transfer can span multiple cores, processors and systems in parallel.
- High efficiency. Flight is designed to work without any serialization or deserialization of records, and with zero memory copies, achieving over 20 gbps per core.
- Security. Authentication and encryption are included out of the box, and additional protocols/handshakes can be added.
- Geographic distribution. With companies and systems increasingly distributed around the globe (due to performance or data sovereignty reasons), Flight can support multi-region use cases.
- Built on open source standards. Arrow Flight is built on open source and standards such as gRPC, Protocol Buffers and FlatBuffers.
The Arrow Flight Protocol
A Flight is a high-throughput pipe through which data can be transferred. For example, a Python client that wants to retrieve data from a Dremio engine would establish a Flight to the Dremio engine. Each Flight is composed of one or more parallel Streams, as shown in the following diagram:
When a client wishes to establish a connection to a service, it establishes the connection via a Handshake, and then uses ListFlights or GetFlightInfo to identify a specific Flight (represented by a FlightInfo object). For example, GetFlightInfo **accepts a **FlightDescriptor object, which is either a path or command (e.g., a SQL SELECT statement):
GetFlightInfo(path.to.dataset) GetFlightInfo(SELECT name, age FROM path.to.dataset WHERE age > 18)
The returned FlightInfo includes the schema for the dataset, as well as the endpoints (each represented by a FlightEndpoint object) for the parallel Streams that compose this Flight. Note that the FlightEndpoint is composed of a location (URI identifying the hostname/port) and an opaque ticket. The following code leverages pyarrow, the Python implementation of Apache Arrow, to talk to a Flight-enabled service (Dremio in this case):
import pyarrow.flight as flt c = flt.FlightClient.connect("localhost", 47470) fd = flt.FlightDescriptor.for_command(sql) fi = c.get_flight_info(fd) ticket = fi.endpoints.ticket df = c.do_get(ticket0).read_all() // This is a single-process server, no need to look at the endpoints/locations
Message Flow: Non-Distributed Client + Distributed Server
In some cases, the client is a single multi-threaded process, such as a Python or R application, while the server is a distributed system, such as Dremio. The following diagram highlights a simple flow in which the client is asking the service to execute a SQL SELECT statement and the results are returned in parallel.
Message Flow: Distributed Client + Distributed Server
In some cases, the client is a distributed process, such as a Spark application, while the server is a distributed system, such as Dremio. The following diagram highlights a simple flow in which the client is asking the service to execute a SQL SELECT statement and the results are returned in parallel from multiple nodes to multiple nodes.
It is possible for a Flight to consist of more than one Stream (i.e., endpoint) at the same location. The ticket is responsible for identifying the data to be returned by the server process. In the following example, data is pulled in parallel from two streams on the second executor (location2).
Message Flow: Distributed Client + Geographically Distributed Server
In some cases, the same data may be available in multiple locations. This could be to enable lower latency access to the data from different places in the world, or due to a caching layer. In such cases, the response to GetFlightInfo may include endpoints that have more than one location, and the client has the right to choose which location it prefers to obtain the stream from. In the following example, the server is offering the client the opportunity to obtain the data from a region in the US and a region in Germany, and the client is choosing Germany.
What Makes Arrow Flight Fast?
- No serialization/deserialization. The Arrow memory representation is the same across all languages as well as on the wire (within Arrow Flight). As a result, the data doesn’t have to be reorganized when it crosses process boundaries.
- Bulk operations. Flight operates on record batches without having to access individual columns, records or cells. For comparison, an ODBC interface involves asking for each cell individually. Assuming 1.5 million records, each with 10 columns, that’s 15 million function calls to get this data back into, say, Python.
- Infinite parallelism. Flight is a scale-out technology, so for all practical purposes, the throughput is only limited by the capabilities of the client and server, as well as the network in between.
- Efficient network utilization. Flight uses gRPC and HTTP/2 to transfer data, providing high network utilization.
With an average size batch size (256K records), the performance of Flight exceeded 20 Gb/s for a single stream running on a single core. Because Flight can be parallelized as much as necessary, there is no practical limit on the throughput of a Flight, so benchmarking vs. non-scalable solutions like ODBC or JDBC can seem unfair! That said, even without parallelism, Arrow Flight can significantly outperform these older approaches. Here are a few examples involving a single stream (non-distributed client + non-distributed server).
When JDBC and ODBC were released, analysts and data scientists weren’t dealing with anywhere close to the volumes of data that we have today. Arrow Flight provides a high-performance wire protocol for large-volume data transfer for analytics, designed for the needs of the modern data world including cross-platform language support, infinite parallelism, high efficiency, robust security, multi-region distribution, and efficient network utilization.
Want to learn more about Apache Arrow? Check out our explainer and our history of Apache Arrow – or download Dremio to see Arrow in action.