Understanding Apache Arrow Flight
August 21, 2019
Arrow Flight provides a high-performance wire protocol for large-volume data transfer for analytics, designed for the needs of the modern data world including cross-platform language support, infinite parallelism, high efficiency, robust security, multi-region distribution, and efficient network utilization.
What is Apache Arrow?
Over the past few decades, databases and data analysis have changed dramatically.
- Businesses have increasingly complex requirements for analyzing and using data – and increasingly high standards for query performance.
- Memory has become inexpensive, enabling a new set of performance strategies based on in-memory analysis.
- CPUs and GPUs have increased in performance, but have also evolved to optimize processing data in parallel
- New types of databases have emerged for different use cases, each with its own way of storing and indexing data. For example, because real-world objects are easier to represent as hierarchical and nested data structures, JSON and document databases have become popular.
- New disciplines have emerged, including data engineering and data science, both with dozens of new tools to achieve specific analytical goals.
- Columnar data representations have become mainstream for analytical workloads because they provide dramatic advantages in terms of speed and efficiency.
With these trends in mind, a clear opportunity emerged for a standard in-memory representation that every engine can use; one that’s modern, and that takes advantage of all the new performance strategies that are now available; and one that makes sharing of data across platforms seamless and efficient. This is the goal of Apache Arrow.
Learn more about the origins and history of Apache Arrow
To use an analogy, consider traveling to Europe on vacation before the EU. To visit 5 countries in 7 days, you could count on the fact that you were going to spend a few hours at the border for passport control, and you were going to lose some of your money in the currency exchange. This is how working with data in-memory works without Apache Arrow: enormous inefficiencies exist to serialize and deserialize data structures, and a copy is made in the process, wasting precious memory and CPU resources.
In contrast, Apache Arrow is like visiting Europe after the EU and the Euro: you don’t have to wait at the border, and there is one type of currency used everywhere. Apache Arrow combines the benefits of columnar data structures with in-memory computing. It provides the performance benefits of these modern techniques while also providing the flexibility of complex data and dynamic schemas. And it does all of this in an open source and standardized way.
Apache Arrow Flight Overview
Interoperability is one of the main pillars of Arrow, however, its primary medium is in-memory. While most modern applications and platforms are distributed, Arrow needs a Remote Procedure Call (RPC) layer to overcome any process and networking limitations and deliver on its promise.
In the Arrow 0.14 release, Flight was introduced as a new data interoperability technology to deliver a high-performance protocol for big data transfer for analytics across different applications and platforms.
Advantages of Apache Arrow Flight
- Platform and language-independent. Out of the gate, Flight supports C++, Java, and Python, with many other languages on the way.
- Parallelism. A single data transfer can span multiple nodes, processors and systems in parallel.
- High efficiency. Flight is designed to work without any serialization or deserialization of records, and with zero memory copies, achieving over 20 Gbps per core.
- Security. Authentication and encryption are included out of the box, and additional authentication protocols encryption algorithms can be added.
- Geographic distribution. With companies and systems increasingly distributed around the globe (due to performance or data sovereignty reasons), Flight can support multi-region use cases.
- Built on open-source standards. Arrow Flight is built on open source and standards such as gRPC, Protocol Buffers and FlatBuffers.
What Makes Apache Arrow Flight Fast?
- No serialization/deserialization. The Apache Arrow memory representation is the same across all languages as well as on the wire (within Arrow Flight). As a result, the data doesn’t have to be reorganized when it crosses process boundaries.
- Bulk operations. Flight operates on record batches without having to access individual columns, records or cells. For comparison, an ODBC interface involves asking for each cell individually. Assuming 1.5 million records, each with 10 columns, that’s 15 million function calls to get this data back into, say, Python.
- Infinite parallelism. Flight is a scale-out technology, so for all practical purposes, the throughput is only limited by the capabilities of the client and server, as well as the network in between.
- Efficient network utilization. Flight uses gRPC and HTTP/2 to transfer data, providing high network utilization,
Want to learn more?
Check out these resources that will walk you through the basics and also deep technical details about Apache Arrow and Arrow Flight.