Dremio Blog

5 minute read · August 21, 2019

Understanding Apache Arrow Flight

Lucio Daza Director of Technical Marketing, Dremio

Start For Free

Copied to clipboard

Understanding Apache Arrow Flight

What is Apache Arrow?

Apache Arrow Flight Overview

Advantages of Apache Arrow Flight

What Makes Apache Arrow Flight Fast?

Want to learn more?

Arrow Flight provides a high-performance wire protocol for large-volume data transfer for analytics, designed for the needs of the modern data world including cross-platform language support, infinite parallelism, high efficiency, robust security, multi-region distribution, and efficient network utilization.

What is Apache Arrow?

Over the past few decades, databases and data analysis have changed dramatically.

Businesses have increasingly complex requirements for analyzing and using data – and increasingly high standards for query performance.
Memory has become inexpensive, enabling a new set of performance strategies based on in-memory analysis.
CPUs and GPUs have increased in performance, but have also evolved to optimize processing data in parallel
New types of databases have emerged for different use cases, each with its own way of storing and indexing data. For example, because real-world objects are easier to represent as hierarchical and nested data structures, JSON and document databases have become popular.
New disciplines have emerged, including data engineering and data science, both with dozens of new tools to achieve specific analytical goals.
Columnar data representations have become mainstream for analytical workloads because they provide dramatic advantages in terms of speed and efficiency.

With these trends in mind, a clear opportunity emerged for a standard in-memory representation that every engine can use; one that’s modern, and that takes advantage of all the new performance strategies that are now available; and one that makes sharing of data across platforms seamless and efficient. This is the goal of Apache Arrow.

Learn more about the origins and history of Apache Arrow

To use an analogy, consider traveling to Europe on vacation before the EU. To visit 5 countries in 7 days, you could count on the fact that you were going to spend a few hours at the border for passport control, and you were going to lose some of your money in the currency exchange. This is how working with data in-memory works without Apache Arrow: enormous inefficiencies exist to serialize and deserialize data structures, and a copy is made in the process, wasting precious memory and CPU resources.

In contrast, Apache Arrow is like visiting Europe after the EU and the Euro: you don’t have to wait at the border, and there is one type of currency used everywhere. Apache Arrow combines the benefits of columnar data structures with in-memory computing. It provides the performance benefits of these modern techniques while also providing the flexibility of complex data and dynamic schemas. And it does all of this in an open source and standardized way.

Apache Arrow Flight Overview

Interoperability is one of the main pillars of Arrow, however, its primary medium is in-memory. While most modern applications and platforms are distributed, Arrow needs a Remote Procedure Call (RPC) layer to overcome any process and networking limitations and deliver on its promise.

In the Arrow 0.14 release, Flight was introduced as a new data interoperability technology to deliver a high-performance protocol for big data transfer for analytics across different applications and platforms.

Advantages of Apache Arrow Flight

Platform and language-independent. Out of the gate, Flight supports C++, Java, and Python, with many other languages on the way.
Parallelism. A single data transfer can span multiple nodes, processors and systems in parallel.
High efficiency. Flight is designed to work without any serialization or deserialization of records, and with zero memory copies, achieving over 20 Gbps per core.
Security. Authentication and encryption are included out of the box, and additional authentication protocols encryption algorithms can be added.
Geographic distribution. With companies and systems increasingly distributed around the globe (due to performance or data sovereignty reasons), Flight can support multi-region use cases.
Built on open-source standards. Arrow Flight is built on open source and standards such as gRPC, Protocol Buffers and FlatBuffers.

What Makes Apache Arrow Flight Fast?

No serialization/deserialization. The Apache Arrow memory representation is the same across all languages as well as on the wire (within Arrow Flight). As a result, the data doesn’t have to be reorganized when it crosses process boundaries.
Bulk operations. Flight operates on record batches without having to access individual columns, records or cells. For comparison, an ODBC interface involves asking for each cell individually. Assuming 1.5 million records, each with 10 columns, that’s 15 million function calls to get this data back into, say, Python.
Infinite parallelism. Flight is a scale-out technology, so for all practical purposes, the throughput is only limited by the capabilities of the client and server, as well as the network in between.
Efficient network utilization. Flight uses gRPC and HTTP/2 to transfer data, providing high network utilization,

Want to learn more?

Check out these resources that will walk you through the basics and also deep technical details about Apache Arrow and Arrow Flight.

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.

Start For Free

Article Topics

Dremio Blog: Open Data Insights

Sep 22, 2023 Dremio Blog: Open Data Insights

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]

Alex Merced

Aug 16, 2023 Dremio Blog: News Highlights

5 Use Cases for the Dremio Lakehouse

With its capabilities in on-prem to cloud migration, data warehouse offload, data virtualization, upgrading data lakes and lakehouses, and building customer-facing analytics applications, Dremio provides the tools and functionalities to streamline operations and unlock the full potential of data assets.

Alex Merced

Aug 31, 2023 Dremio Blog: News Highlights

Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud

Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.

Jeremiah Morrow

Understanding Apache Arrow Flight

Table of Contents

What is Apache Arrow?

Apache Arrow Flight Overview

Advantages of Apache Arrow Flight

What Makes Apache Arrow Flight Fast?

Want to learn more?

Try Dremio Cloud free for 30 days

Ready to Get Started?

Table of Contents

What is Apache Arrow?

Apache Arrow Flight Overview

Advantages of Apache Arrow Flight

What Makes Apache Arrow Flight Fast?

Want to learn more?

Try Dremio Cloud free for 30 days

Related Dremio Articles

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

5 Use Cases for the Dremio Lakehouse

Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud

Ready to Get Started?