What Is Apache Arrow?

Apache Arrow combines the benefits of columnar data structures with in-memory computing. It delivers the performance benefits of these modern techniques while also providing the flexibility of complex data and dynamic schemas. And it does all of this in an open-source and standardized way.

Over the past few decades, databases and data analysis have changed dramatically.

Businesses have increasingly complex requirements for analyzing and using data — and increasingly high standards for query performance.
Memory has become inexpensive, enabling a new set of performance strategies based on in-memory analysis.
CPUs and GPUs have increased in performance but have also evolved to optimize processing data in parallel.
New types of databases have emerged for different use cases, each with its own way of storing and indexing data. For example, because real-world objects are easier to represent as hierarchical and nested data structures, JSON and document-oriented databases have become popular.
New disciplines have emerged, including data engineering and data science, both with dozens of new tools to achieve specific analytical goals.
Columnar data representations have become mainstream for analytical workloads because they provide dramatic advantages in terms of speed and efficiency.

With these trends in mind, a clear opportunity emerged for a standard in-memory representation that every engine can use—one that’s modern, takes advantage of all the new performance strategies that are available, and makes sharing of data across platforms seamless and efficient. This is the goal of Apache Arrow.

To use an analogy, consider traveling to Europe for vacation before the European Union (EU) was established. To visit five countries in seven days, you could count on the fact that you were going to spend a few hours at each border for passport control, and you were going to lose some value of your money in the currency exchange. This is how working with data in-memory works without Arrow: enormous inefficiencies exist to serialize and deserialize data structures, and a copy is made in the process, wasting precious memory and CPU resources. In contrast, Arrow is like visiting Europe after the EU and the adoption of the common currency dubbed the euro: you don’t wait at the border, and one currency is used everywhere.

Apache Arrow Core Technologies

Arrow itself is not a storage or execution engine. It is designed to serve as a shared foundation for the following types of systems:

SQL execution engines (e.g., Drill and Impala)
Data analysis systems (e.g., Pandas and Spark)
Streaming and queueing systems (e.g., Kafka and Storm)
Storage systems (e.g., Parquet, Kudu, Cassandra and HBase)

Arrow consists of a number of connected technologies designed to be integrated into storage and execution engines. The key components of Arrow include:

Defined data type sets including both SQL and JSON types, such as int, BigInt, decimal, varchar, map, struct and array.
Canonical representations are columnar in-memory representations of data to support an arbitrarily complex record structure built on top of the defined data types.
Common data structures arrow-aware companion data structures including pick-lists, hash tables and queues.
Inter-process communication achieved within shared memory, TCP/IP and RDMA.
Data libraries used for reading and writing columnar data in multiple languages, including Java, C++, Python, Ruby, Rust, Go and JavaScript.
Pipeline and SIMD algorithms for various operations including bitmap selection, hashing, filtering, bucketing, sorting and matching.
Columnar in-memory compression including a collection of techniques to increase memory efficiency.
Memory persistence tools for short-term persistence through non-volatile memory, SSD or HDD.

As such, Arrow doesn’t compete with any of these projects. Its core goal is to work within each of them to provide increased performance and stronger interoperability. In fact, Arrow is built by the lead developers of many of these projects.

What Are the Benefits of Apache Arrow?

Performance

The faster a user can get to the answer, the faster they can ask other questions. High performance results in more analysis, more use cases and further innovation. As CPUs become faster and more sophisticated, one of the key challenges is making sure processing technology uses CPUs efficiently.

Arrow is specifically designed to maximize:

Cache locality: Memory buffers are compact representations of data designed for modern CPUs. The structures are defined linearly, matching typical read patterns. That means that data of similar type is co-located in memory. This makes cache prefetching more effective, minimizing CPU stalls resulting from cache misses and main memory accesses. These CPU-efficient data structures and access patterns extend to both traditional flat relational structures and modern complex data structures.
Pipelining: Execution patterns are designed to take advantage of the superscalar and pipelined nature of modern processors. This is done by minimizing in-loop instruction count and loop complexity. These tight loops lead to better performance and less branch-prediction failures.
SIMD instructions: Single Instruction Multiple Data (SIMD) instructions allow execution algorithms to operate more efficiently by executing multiple operations in a single clock cycle. Arrow organizes data to be well-suited for SIMD operations.

Cache locality, pipelining and superscalar operations frequently provide 10-100x faster execution performance. Since many analytical workloads are CPU-bound, these benefits translate into dramatic end-user performance gains. Here are some examples:

PySpark: IBM measured a 53x speedup in data processing by Python and Spark after adding support for Arrow in PySpark
Parquet and C++: Reading data into Parquet from C++ at up to 4GB/s
Pandas: Reading into Pandas up to 10GB/s

Arrow also promotes zero-copy data sharing. As Arrow is adopted as the internal representation in each system, one system can hand data directly to the other system for consumption. And when these systems are located on the same node, the copy described above can also be avoided through the use of shared memory. This means that in many cases, moving data between two systems will have no overhead.

Memory Efficiency

In-memory performance is great, but memory can be scarce. Arrow is designed to work even if the data doesn’t fit entirely in memory. The core data structure includes vectors of data and collections of these vectors (also called record batches). Record batches are typically 64KB-1MB, depending on the workload, and are usually bounded at 2^16 records. This not only improves cache locality, but also makes in-memory computing possible even in low-memory situations.

With many big data clusters ranging from hundreds to thousands of servers, systems must be able to take advantage of the aggregate memory of a cluster. Arrow is designed to minimize the cost of moving data on the network. It utilizes scatter/gather reads and writes and features a zero serialization/deserialization design, allowing low-cost data movement between nodes. Arrow also works directly with RDMA-capable interconnects to provide a single memory grid for larger in-memory workloads.

Programming Language Support

Another major benefit of adopting Arrow, besides stronger performance and interoperability, is a level playing field among different programming languages. Traditional data sharing is based on IPC and API-level integrations. While this is often simple, it hurts performance when the user’s language is different from the underlying system’s language. Depending on the language and the set of algorithms implemented, the language transformation causes the majority of the processing time.

Arrow currently supports C, C++, Java, JavaScript, Go, Python, Rust and Ruby.

Open Source Community Contributors

Lead developers from 14 major open source projects are involved in Arrow:

Calcite
Cassandra
Dremio
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm

Arrow-Enabled Data

Many projects include Arrow in order to improve performance and take advantage of the latest optimizations. These projects include Drill, Impala, Kudu, Ibis, Spark and many others. Arrow improves performance for data consumers and data engineers — which means less time spent gathering and processing data, and more time spent analyzing it and coming to useful conclusions.

Dremio is a data lake engine that uses end-to-end Apache Arrow to dramatically increase query performance. Users simplify and accelerate access to their data in any of their sources, making it easy for teams to find datasets, curate data, track data lineage and more. Dremio helps companies get more value from their data, faster. Try Dremio today!

Projects Powered by Apache Arrow

There are several projects that are powered by Apache Arrow, including:

Apache Arrow Flight: A framework for high-performance data transfer that uses the Arrow memory format for efficient data transfer.
Apache Arrow GLib: A library that provides GLib and GObject interfaces for working with Arrow data in C and other languages.
Apache Parquet: A columnar storage format that is often used with Arrow data, providing efficient storage and retrieval of large datasets.
Apache ORC: Another columnar storage format that is compatible with Arrow data, providing efficient storage and retrieval of large datasets.
Apache Spark: A popular big data processing framework that has built-in support for Arrow data, allowing for more efficient data processing and analytics.
Apache Kafka: A distributed streaming platform that can be used with Arrow data to enable high-performance data transfer between systems.
Python libraries such as Pandas, PyArrow, and Dask, which use Arrow as a foundation for their data structures, enabling faster performance and to perform operations on large datasets.
Some other projects like PyTorch, TensorFlow, and RAPIDS also uses Arrow for better performance and data processing

Note that this list is not exhaustive and there may be other projects not mentioned here that also use Apache Arrow.

Apache Arrow is a powerful technology that is utilized in a wide range of projects, offering a number of benefits such as high-performance data transfer, efficient storage and retrieval of large datasets, and improved data processing and analytics. One of the key projects powered by Apache Arrow is Apache Arrow Flight. This framework provides a high-performance data transfer mechanism that utilizes the Arrow memory format to ensure efficient data transfer between systems. The Arrow memory format allows for fast serialization and deserialization of data, resulting in faster data transfer and reduced network overhead. This makes Apache Arrow Flight an ideal solution for use cases that require high-performance data transfer, such as machine learning and big data analytics.

Another notable project that is powered by Apache Arrow is Apache Parquet. This columnar storage format is often used in conjunction with Arrow data, providing efficient storage and retrieval of large datasets. The columnar format allows for reduced I/O and CPU costs, making it well-suited for use cases such as data warehousing and analytics. Additionally, Parquet's compatibility with Arrow data enables it to be used in conjunction with other Arrow-powered projects such as Apache Spark, allowing for more efficient data processing and analytics. Compatibility with Arrow data also allows for easy integration with a variety of programming languages, making it a versatile solution for a wide range of use cases.

Apache Arrow