Get Started Free
No time limit - totally free - just the way you like it.Sign Up Now
Getting Started with Apache Arrow Flight
As mentioned above, many systems implemented their own in-memory data format, and while it provided users many of the benefits of columnar memory formats, every system having their own custom format led to a lack of standardization in the industry. This had a downstream effect of the continued dominance of JDBC and ODBC standards for interacting with data systems. These two standards worked well for the world of data they were created in, 25 and 30 years ago respectively, but their cracks have shown in the big data world.
JDBC and ODBC require converting/serializing from the system-specific columnar format to the JDBC/ODBC row format and then back again to the client’s columnar format, which can slow down the movement of data 60-90%.
With a standard columnar memory format that the source and destination both use, we eliminate the need to serialize and deserialize the data, resulting in the elimination of that 60-90% overhead in data transfers.
Apache Arrow Flight provides a new protocol that can take full advantage of the Arrow format to greatly accelerate data transfers.
Arrow Flight can also take advantage of multi-node and multi-core architectures, so the ability to optimize throughput is almost endless via full parallelization.
But even when you compare the performance of just serial Arrow Flight to ODBC, Arrow Flight still greatly outperforms ODBC. See the amount of time data transfers take based on the number of records in the chart below.
Apache Arrow Flight is a high-performance data transfer framework that utilizes the Arrow memory format to enable efficient data transfer between systems. Its design goal is to provide fast and efficient data transfer for a variety of use cases such as data processing frameworks and data storage systems. Arrow Flight uses a binary protocol and can be used over a variety of transport protocols such as gRPC, HTTP/2, and RDMA to ensure low-latency data transfer. Additionally, it provides support for data encryption, authentication, and authorization to ensure secure data transfer. The primary objective of Arrow Flight is to enable fast, efficient data transfer in big data and machine learning applications, thus allowing data engineers and scientists to work with data more easily and efficiently.
The Arrow Flight protocol is implemented using RPCs (Remote Procedure Calls) which is a communication protocol similar to REST but instead of sending a traditional REST-like request to a remote server, interaction looks more like calling functions in a normal local application.
In the Arrow Flight RPC specification, serial data transfers will follow a pattern like the below:
The client sends a GetFlightInfo request passing a procedure they’d like to complete (e.g. add data, get metadata, run a sql query). The response will be a FlightInfo object with the details of the request including a ticket to use to receive the results of the request.
Then a DoGet request is made, passing the ticket received earlier to get the results of the initial request. During this entire process, the data is never converted to an intermediate row-based format, allowing the data to be transferred from server to client at much higher speeds.
Arrow Flight SQL enables you to execute the above request/response cycle with ease regardless of the language you use, without having to implement your own client and spend time dealing with the performance and security considerations of doing so.
While the official documentation is forthcoming, there are several samples of how Flight SQL implementations can look like in action in the Apache Arrow Repository.
Apache Arrow Flight offers two distinct modes of operation: RPC and SQL. Each mode has its own unique characteristics and is designed to address specific use cases.
RPC (Remote Procedure Call) mode is a method of requesting and receiving data across a network. In the context of Arrow Flight, data is serialized using the Arrow memory format and transmitted over the network using gRPC, HTTP/2, or RDMA protocols. The data can be sent in one or multiple chunks, enabling efficient transfer of large datasets.
SQL (Structured Query Language) mode, on the other hand, is a method of querying and retrieving data stored in databases or data lakes using SQL-like syntax. In the context of Arrow Flight, SQL mode allows for querying and retrieving data stored in data storage systems such as Apache Parquet and Apache ORC, using SQL-like syntax, thus allows querying large datasets without the need to load the entire dataset into memory.
In summary, Arrow Flight RPC is a means of transferring data over a network, while Arrow Flight SQL is a means of querying and retrieving data stored in data storage systems using SQL-like syntax. Both modes utilize the Arrow memory format and can be used over gRPC, HTTP/2, and RDMA protocols, but they are tailored to address different requirements.
Apache Arrow Flight was developed to address the challenges of transferring large amounts of data efficiently and quickly in big data and machine learning applications. Its main goals are:
Getting started with Apache Arrow Flight is relatively simple. The initial step is to ensure that the necessary dependencies are installed, including the Arrow C++ libraries and the Flight C++ library. After the dependencies are in place, developers can familiarize themselves with the Flight API, which serves as the foundation for creating a Flight server and client. The API includes functions for creating and interacting with Flight service, and for sending and receiving Arrow data. To aid the development process, developers can refer to the available documentation and examples. Once the basics have been understood, developers can move on to implementing the Flight server and client in their applications. To simplify the process, it is recommended to use the official client library for the specific programming language being used, such as the Python library or Java library. These libraries provide a set of functions and classes that can be used to interact with the Flight service and send and receive Arrow data.
Apache Arrow Flight is a valuable tool that addresses the challenges of transferring large amounts of data efficiently and quickly in big data and machine learning applications. It achieves high-performance data transfer by utilizing the Arrow memory format for serialization and a binary protocol for data transfer, resulting in low-latency data transfer. Interoperability is also a key feature, as it provides a standard way to transfer data between systems and programming languages, regardless of the underlying technology. Additionally, Arrow Flight is designed to handle large amounts of data and can be used to transfer data in one or multiple chunks, enabling efficient transfer of large datasets. Furthermore, it also provides security features such as data encryption, authentication, and authorization, ensuring secure and controlled data transfer in sensitive environments.