Over the past few decades, databases and data analysis have changed dramatically.
With these trends in mind, a clear opportunity emerged for a standard in-memory representation that every engine can use; one that's modern, and that takes advantage of all the new performance strategies that are now available; and one that makes sharing of data across platforms seamless and efficient. This is the goal of Apache Arrow. Learn more about the origins and history of Apache Arrow.
To use an analogy, consider traveling to Europe on vacation before the EU. To visit 5 countries in 7 days, you could count on the fact that you were going to spend a few hours at the border for passport control, and you were going to lose some of your money in the currency exchange. This is how working with data in-memory works without Arrow: enormous inefficiencies exist to serialize and de-serialize data structures, and a copy is made in the process, wasting precious memory and CPU resources. In contrast, Arrow is like visiting Europe after the EU and the Euro: you don't wait at the border, and there are one currency is used everywhere.
Arrow combines the benefits of columnar data structures with in-memory computing. It provides the performance benefits of these modern techniques while also providing the flexibility of complex data and dynamic schemas. And it does all of this in an open source and standardized way.
Apache Arrow itself is not a storage or execution engine. It is designed to serve as a shared foundation for the following types of systems:
Arrow consists of a number of connected technologies designed to be integrated into storage and execution engines. The key components of Arrow include:
As such, Arrow doesn't compete with any of these projects. Its core goal is to work within each of them to provide increased performance and stronger interoperability. In fact, Arrow is being built by the lead developers of many of these projects.
The faster a user can get to the answer, the faster they can ask other questions. High performance results in more analysis, more use cases and further innovation. As CPUs become faster and more sophisticated, one of the key challenges is making sure processing technology uses CPUs efficiently.
Arrow is specifically designed to maximize:
Cache locality, pipelining and superscalar operations frequently provide 10-100x faster execution performance. Since many analytical workloads are CPU-bound, these benefits translate into dramatic end-user performance gains. Here are some examples:
Arrow also promotes zero-copy data sharing. As Arrow is adopted as the internal representation in each system, one system can hand data directly to the other system for consumption. And when these systems are located on the same node, the copy described above can also be avoided through the use of shared memory. This means that in many cases, moving data between two systems will have no overhead.
In-memory performance is great, but memory can be scarce. Arrow is designed to work even if the data doesn't fit entirely in memory. The core data structure includes vectors of data and collections of these vectors (also called record batches). Record batches are typically 64KB-1MB, depending on the workload, and are usually bounded at 2^16 records. This not only improves cache locality, but also makes in-memory computing possible even in low-memory situations.
With many Big Data clusters ranging from 100's to 1000's of servers, systems must be able to take advantage of the aggregate memory of a cluster. Arrow is designed to minimize the cost of moving data on the network. It utilizes scatter/gather reads and writes and features a zero serialization/ deserialization design, allowing low-cost data movement between nodes. Arrow also works directly with RDMA-capable interconnects to provide a single memory grid for larger in-memory workloads.
Another major benefit of adopting Apache Arrow, besides stronger performance and interoperability, is a level-playing field among different programming languages. Traditional data sharing is based on IPC and API-level integrations. While this is often simple, it hurts performance when the user's language is different from the underlying system's language. Depending on the language and the set of algorithms implemented, the language transformation causes the majority of the processing time.
Lead developers from 14 major open source projects are involved in Apache Arrow:
Many projects are including Arrow in order to improve performance and take advantage of the latest optimizations. These projects include Drill, Impala, Kudu, Ibis, Spark, and many others. Arrow improves performance for data consumers and data engineers – which means less time spent gathering and processing data, and more time spent analyzing it and coming to useful conclusions.
Dremio is an open source Data-as-a-Service Platform that uses end-to-end Apache Arrow to dramatically increase query performance. Users simplify and accelerate access to their data in any of their sources, making it easy for teams to find datasets, curate data, track data lineage, and more. Dremio helps companies get more value from their data, faster. Try Dremio today!