Get Started Free
No time limit - totally free - just the way you like it.Sign Up Now
As co-creators of Apache Arrow, here at Dremio it’s been really exciting over the past several years to see its tremendous growth, bringing more usage, ecosystem adoption, capabilities, and users to the project. Today Apache Arrow is the de facto standard for efficient in-memory columnar analytics that provides high performance when processing and transporting large volumes of data.
Since we co-created it in 2016, Apache Arrow's journey to becoming the “standard” for the columnar in-memory format has been fascinating to watch. One standout is its diverse application in today's data landscape for various analytical use cases. The project's growth has been on many axes, such as usage, ecosystem adoption, and capability. Let's take a look at some of these.
Since Apache Arrow has diverse applications, its users are both direct and indirect. Software developers leverage Arrow to build high-performance applications specifically targeting data transport and interoperability aspects. Alternatively, there are data consumers who may not directly work with the format but have likely interacted with it under the hood with high-level libraries (such as Pandas, Spark, Streamlit, etc.).
As a surrogate for the usage of Arrow, we can see in the chart below the number of downloads of PyArrow from PyPI per month. PyArrow is the Python binding for Apache Arrow.
While this chart demonstrates Arrow’s impressive growth over the years, note that this actually undersells both the interest and use of Arrow since:
From a capabilities perspective, Arrow has evolved quite a lot. What started as a format specification and a couple of basic implementations for interacting with that format, there are now many libraries and tools.
This growth has been genuinely community-driven, based on contributions from many individuals and organizations. The following details how the capabilities have grown over the years with some highly impactful projects.
Arrow had grown tremendously in ecosystem adoption, in technologies created both before and after Arrow existed. When new tools or libraries decide how to handle data in memory and for transit, they often choose Apache Arrow, such as Google's BigLake, RISELab's Ray, and DuckDB. Technologies created before Arrow's existence are also choosing to introduce Arrow into their stacks, such as Apache Spark, Snowflake, and Amazon Athena. These adoptions prove that when people are deciding which memory format to use for data, they most often choose Apache Arrow.
The Arrow developer community has also grown significantly. The project today has over 700 individuals with ~67 committers. The number of pull requests over the years has been monotonically increasing as illustrated in the graph below. This represents the overall pace and growth of this community-driven project with respect to adding new features or fixing existing issues.
As an Apache foundation project, Arrow brings clear governance and transparency to the project so the decisions can be made unanimously. The project also has great diversity with contributions from various organizations. Read more about how you can contribute to Apache Arrow here.
There is no doubt that Apache Arrow will continue to be leveraged by existing sets of tools as well as newer ones in the data space. As we move toward a world where efficient and faster data processing is critical, we will see more and more systems speak “Arrow” to avoid data transport costs, making data objects interoperable between systems. If you are interested in reading more about the origins and history of Apache Arrow and how it fits in today’s data landscape, check out this write-up.