Get Started Free
No time limit - totally free - just the way you like it.Sign Up Now
I became a contributor of Apache Arrow fairly recently, however, I have been using it (knowingly or not) for at least three years. What amazes me is how it has quietly become an integral component to the modern data analytics/science world. As we welcome the 1.0 release of Arrow and the associated declaration of API stability, I thought it would be interesting to take a look at where Arrow started, how it got here and where it will head in the next chapter.
Arrow was released as a top-level project in the Apache Software Foundation in February 2016, and it’s unique because it didn’t go through an incubation period and was immediately promoted to a full-level project. One reason for the swift promotion was the star power involved in the project: Dremio CTO Jacques Nadeau, Wes McKinney and many others who have served as committers and PMC members on various Apache projects.
This advancement is a sure sign of the importance of Arrow to the wider data community. At the time columnar data formats were just starting to come back into fashion (thanks to MonetDB and Dremel) and the big data community was looking for ways to improve the speed of queries on CPUs while the data science community was having a hard look at the performance of its traditional tools. This led to a wide group conversation about the next generation of data tools and the importance of columnar data formats. The recognition that a columnar in-memory format was much better suited to modern analytics workflows and modern hardware features like caching, pipelining and GPUs made it clear that columnar data formats and tools should be explored.
A guiding principle was to not just start using columnar data formats in the tools, but to create a canonical representation of columnar data. This insight allows for a lingua franca for in-memory data. Every application that understands the Arrow format is able to leverage tools built for Arrow as well as communicate with other applications via the Arrow format. Being able to move data easily between processes is particularly important when thinking about cross-language applications. With most big data tools written in Java and data science tools predominantly in Python or R (and more commonly C/C++), being able to communicate large datasets between tools easily has helped unify the diverse data ecosystem. This is what, in my opinion, has fueled Arrow’s explosive growth. It is now possible to easily communicate between Spark and Pandas or execute a compute kernel written in C on a GPU directly from a Java JVM.
At Dremio we have been primarily involved in the Java side of Arrow, as it lies at the core of our data lake engine. Whether data is coming from Amazon S3, Microsoft Azure, or an RDBMS or a NoSQL engine it is immediately marshalled into the Arrow format. All subsequent calculations are then performed on Arrow buffers. This allows Dremio to leverage community-built tools and functionality for fast query execution and transport and more importantly, enables us to contribute our innovations in return to the Arrow community.
The very early days of Arrow saw quick advancement on both the Python/C++ side and on the Java side. Dremio spent a ton of effort on a complete Java implementation for v1.0 of our query engine called ValueVector (for a more thorough history and some fun anecdotes read the origin story written by Jacques). The ValueVector initiative in Dremio eventually became the core data structure for the Java side of the project.
On the Python side Wes and team were busy with a C++ implementation of the Arrow spec and subsequent Python/Cython bindings. This work forms the early part of Wes’s vision for PyArrow and the Apache Parquet to Pandas conversion layer in PyArrow. It also allowed for the subsequent work on Spark and PySpark. Feather was another early innovation for the Arrow community, it enabled fast and simple data exchange between R and Python.
As the Java and C++ libraries matured the community quickly grew. Some highlights:
While there’s no easy way to calculate the exact number of downloads of Arrow, one simple data point is the number of downloads of PyArrow through PyPI:
This takes into account downloads of the PyArrow project from PyPI but does not count all the other language bindings (or Conda downloads) or the variety of bundlings of Arrow in other projects. The growth of users of the project is really exciting to watch. I am not aware of any specialized technology that has grown at such a rate.
The broad language support drove a ton of adoption across diverse projects and companies — from large to small companies and open source projects:
Originally developed at Dremio and contributed back to Arrow in Introducing the Gandiva Initiative for Apache Arrow, Gandiva is an LLVM-based execution kernel which performs JIT compilation on arbitrary expression trees. This means that applications submitting SQL-like filter or projection operations to Gandiva can leverage LLVMs bytecode generation and on-the-fly compilation to generate machine code uniquely tailored to the specific query they are executing. For most applications this produces one to two orders of magnitude improvement in query performance and allows any Arrow user to take advantage of low-level SIMD instructions without requiring SIMD bindings in the client language.
Originally conceptualized at Dremio, Flight is a remote procedure call (RPC) mechanism designed to fulfill the promise of data interoperability at the heart of Arrow. While the Arrow IPC format and in-memory specification have always existed there was never an RPC mechanism to exchange data between processes in a coordinated way.
It attempts to provide a user-friendly interface to quickly build data services with built-in authentication and security. It is also designed as a parallel transport mechanism, leading to very high transfer rates for highly parallel systems. At Dremio we see it as the first step on the way to creating a new and modern standard for transporting large data between networked applications. Benchmarks against JDBC/ODBC show the serial Flight mechanism ~50x times faster and the parallel case scales linearly with the number of workers.
With the new release of Arrow the Filesystem API has been overhauled. The Filesystem API is an abstraction on any file store with local disk, HDFS and AWS S3 supported. More info can be found in the Arrow docs and Apache Arrow Datasets. This simplifies doing data science directly on the data lake as it allows for large datasets to be read incrementally from S3 for Python-based data scientists.
The future of Arrow promises to be even more exciting than the past few years. The near future will bring many performance and usability improvements to better leverage things like SIMD and GPUs and to further explore hybrid computing paradigms. Performance improvements to the Datasets API, compute kernels and Parquet I/O are also in the works. Dremio is also working toward a plugin for Flight which will make a Flight server look more like a traditional database. As more vendors start using Flight a standardized protocol is important to ensure interoperability.
For me, the most exciting project in the near future is the Arrow Compute/Data Frame project. Discussed in Arrow C++ Data Frame Project it promises to deliver on Wes’s original vision for Pandas Two and solve the perceived flaws in Pandas: Apache Arrow and the “10 Things I Hate About Pandas.” This promises to deliver as big of an impact as the original Pandas project and extend the use of Arrow even further.