Over the past few decades, databases and data analysis have changed dramatically.
- Businesses have increasingly complex requirements for analyzing and using data — and increasingly high standards for query performance.
- Memory has become inexpensive, enabling a new set of performance strategies based on in-memory analysis.
- CPUs and GPUs have increased in performance, but have also evolved to optimize processing data in parallel.
- New types of databases have emerged for different use cases, each with its own way of storing and indexing data. For example, because real-world objects are easier to represent as hierarchical and nested data structures, JSON and document-oriented databases have become popular.
- New disciplines have emerged, including data engineering and data science, both with dozens of new tools to achieve specific analytical goals.
- Columnar data representations have become mainstream for analytical workloads because they provide dramatic advantages in terms of speed and efficiency.
With these trends in mind, a clear opportunity emerged for a standard in-memory representation that every engine can use—one that’s modern, takes advantage of all the new performance strategies that are available, and makes sharing of data across platforms seamless and efficient. This is the goal of Apache Arrow.
To use an analogy, consider traveling to Europe for vacation before the European Union (EU) was established. To visit five countries in seven days, you could count on the fact that you were going to spend a few hours at each border for passport control, and you were going to lose some value of your money in the currency exchange. This is how working with data in-memory works without Arrow: enormous inefficiencies exist to serialize and deserialize data structures, and a copy is made in the process, wasting precious memory and CPU resources. In contrast, Arrow is like visiting Europe after the EU and the adoption of the common currency dubbed the euro: you don’t wait at the border, and one currency is used everywhere.
Arrow combines the benefits of columnar data structures with in-memory computing. It delivers the performance benefits of these modern techniques while also providing the flexibility of complex data and dynamic schemas. And it does all of this in an open source and standardized way. Learn more about the origins and history of Apache Arrow.
Apache Arrow Core Technologies
Arrow itself is not a storage or execution engine. It is designed to serve as a shared foundation for the following types of systems:
- SQL execution engines (e.g., Drill and Impala)
- Data analysis systems (e.g., Pandas and Spark)
- Streaming and queueing systems (e.g., Kafka and Storm)
- Storage systems (e.g., Parquet, Kudu, Cassandra and HBase)
Arrow consists of a number of connected technologies designed to be integrated into storage and execution engines. The key components of Arrow include:
- Defined data type sets including both SQL and JSON types, such as int, BigInt, decimal, varchar, map, struct and array.
- Canonical representations are columnar in-memory representations of data to support an arbitrarily complex record structure built on top of the defined data types.
- Common data structures arrow-aware companion data structures including pick-lists, hash tables and queues.
- Inter-process communication achieved within shared memory, TCP/IP and RDMA.
- Pipeline and SIMD algorithms for various operations including bitmap selection, hashing, filtering, bucketing, sorting and matching.
- Columnar in-memory compression including a collection of techniques to increase memory efficiency.
- Memory persistence tools for short-term persistence through non-volatile memory, SSD or HDD.
As such, Arrow doesn’t compete with any of these projects. Its core goal is to work within each of them to provide increased performance and stronger interoperability. In fact, Arrow is built by the lead developers of many of these projects.
The faster a user can get to the answer, the faster they can ask other questions. High performance results in more analysis, more use cases and further innovation. As CPUs become faster and more sophisticated, one of the key challenges is making sure processing technology uses CPUs efficiently.
Arrow is specifically designed to maximize:
- Cache locality: Memory buffers are compact representations of data designed for modern CPUs. The structures are defined linearly, matching typical read patterns. That means that data of similar type is co-located in memory. This makes cache prefetching more effective, minimizing CPU stalls resulting from cache misses and main memory accesses. These CPU-efficient data structures and access patterns extend to both traditional flat relational structures and modern complex data structures.
- Pipelining: Execution patterns are designed to take advantage of the superscalar and pipelined nature of modern processors. This is done by minimizing in-loop instruction count and loop complexity. These tight loops lead to better performance and less branch-prediction failures.
- SIMD instructions: Single Instruction Multiple Data (SIMD) instructions allow execution algorithms to operate more efficiently by executing multiple operations in a single clock cycle. Arrow organizes data to be well-suited for SIMD operations.
Cache locality, pipelining and superscalar operations frequently provide 10-100x faster execution performance. Since many analytical workloads are CPU-bound, these benefits translate into dramatic end-user performance gains. Here are some examples:
- PySpark: IBM measured a 53x speedup in data processing by Python and Spark after adding support for Arrow in PySpark
- Parquet and C++: Reading data into Parquet from C++ at up to 4GB/s
- Pandas: Reading into Pandas up to 10GB/s
Arrow also promotes zero-copy data sharing. As Arrow is adopted as the internal representation in each system, one system can hand data directly to the other system for consumption. And when these systems are located on the same node, the copy described above can also be avoided through the use of shared memory. This means that in many cases, moving data between two systems will have no overhead.
In-memory performance is great, but memory can be scarce. Arrow is designed to work even if the data doesn’t fit entirely in memory. The core data structure includes vectors of data and collections of these vectors (also called record batches). Record batches are typically 64KB-1MB, depending on the workload, and are usually bounded at 2^16 records. This not only improves cache locality, but also makes in-memory computing possible even in low-memory situations.
With many big data clusters ranging from hundreds to thousands of servers, systems must be able to take advantage of the aggregate memory of a cluster. Arrow is designed to minimize the cost of moving data on the network. It utilizes scatter/gather reads and writes and features a zero serialization/deserialization design, allowing low-cost data movement between nodes. Arrow also works directly with RDMA-capable interconnects to provide a single memory grid for larger in-memory workloads.
Programming Language Support
Another major benefit of adopting Arrow, besides stronger performance and interoperability, is a level playing field among different programming languages. Traditional data sharing is based on IPC and API-level integrations. While this is often simple, it hurts performance when the user’s language is different from the underlying system’s language. Depending on the language and the set of algorithms implemented, the language transformation causes the majority of the processing time.
Open Source Community Contributors
Lead developers from 14 major open source projects are involved in Arrow: