7 minute read · November 11, 2025
Apache Arrow’s Role in Dremio’s Performance
· Technical Evangelist
Dremio is always striving to abstract away the physical concerns of data, whether the storage location, partitioning schema, or file size optimisation. Thanks to features such as Data Federation, Iceberg Clustering, and Autonomous Performance functionalities, Dremio users get highly-performant access to their data no matter where it lives.
One of the components that delivers this for Dremio is Apache Arrow, an open-source project that defines a cross-language, columnar in-memory data format. Arrow is designed to make analytics and data processing faster and more efficient, both within or “in flight” between systems or programming languages.
What is Apache Arrow?
Whilst many databases share the concept of a structured, table-like dataset, their physical implementations are entirely unique. Each data processing engine has its own custom data structure with unique data reading and writing processes. This would be fine if your data strategy only involved one database, but the reality is that most organisations use multiple data platforms and tools. This means that a typical data analysis pipeline involves multiple data structures, file formats, and programming languages that all must communicate with each other.
The old solution was to throw developers at the problem. Have them write serialisation interfaces that convert one format or programming language to another. Not only did this require a lot of developer time to create and maintain custom interoperability libraries, it also required compute spend to translate between formats. In short, you wasted time, resources, and money moving data between systems and tools.
Co-created by Dremio in 2016, in collaboration with Wes McKinney (the creator of Pandas) and other open-source contributors, Apache Arrow originated as a solution to improving the efficiency of moving data from one system to another.
Arrow
The starting component of Apache Arrow was its in-memory columnar format, a standardised specification for representing structured, table-like datasets in-memory. The project contains implementations of the Arrow format in many languages, including utilities for reading and writing it to many common storage formats. These official libraries enable third-party projects to work with Arrow data without the headache of implementing the Arrow format themselves.
- Columnar Format: storing data in columns is better suited for analytical workloads and data caching,
- Rich Data Types: supports nested and user-defined data types.
- Zero-Copy Data Sharing: data serialisation/deserialisation is not required to exchange data between processes if they share the same format, eliminating a major source of overhead.
- Cross-Language Ecosystem: Arrow provides libraries for multiple programming languages. including C++, Java, Python (PyArrow), R, Go, and Rust, ensuring seamless interoperability.
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
Arrow Flight
Arrow addressed format inconsistencies within a single system, but these issues persisted when sharing data between systems. Even if two systems both used Arrow, data transfer would still be carried out using older transfer protocols such as JDBC or ODBC. This would require the data to be serialised to a row-based format for data transfer and then deserialised back to columns for storage at the destination. This led to Arrow Flight, a high-performance data transport framework built on top of Arrow. Co-developed by Dremio, it is designed to efficiently move large datasets between systems by retaining the Arrow columnar format end-to-end.
- High Efficiency: CPU overhead is minimised as the data format is unchanged in transport.
- Scalable and Fast: data transfer can span multiple cores, processors, and systems in parallel.
- Open Source: built on open source and open standards such as gRPC, Protocol Buffers and FlatBuffers.
- Secure: supports TLS encryption and authentication.
- Cross-platform: built on Arrow so retains the same broad cross-language ecosystem.
How does Dremio use it?
Dremio has put Arrow at the heart of its in-memory query execution engine. Whenever Dremio interacts with data it is firstly converted into Arrow so it can then be processed with high efficiency, no serialisation overhead, and cache-friendly column access. The processed datasets can then be efficiently sent to storage or downstream to other tools with low latency and high throughput thanks to Arrow Flight. By embracing Arrow, Dremio becomes a highly interoperable and cost-efficient analytics platform for users and developers to integrate with.
In late 2022, Dremio contributed the Arrow Flight JDBC driver to the Arrow community. This is a client-server protocol that allows Clients to send SQL queries and for Servers to efficiently return Arrow-formatted results. By standardising SQL-over-Flight Dremio made Arrow Flight more developer friendly, eliminating one of the remaining benefits of the older JDBC and ODBC drivers.
How has it impacted the industry?
Open-source projects like Apache DataFusion and Comet demonstrate Arrow’s industry impact. DataFusion is a Rust-based query engine using Arrow to provide high-performance, embeddable, and modular query execution. By separating its components (SQL parser, logical/physical optimiser, and execution engine) DataFusion allows developers to embed just the parts they need. Comet, an open-source accelerator for Apache Spark, goes to the next logical step of improving query performance for existing deployments. It achieves dramatic performance increases, without affecting code or workflows, by replacing Spark’s JVM-based execution engine with a native engine built on DataFusion. Together, these projects illustrate how Arrow has cemented itself in the analytics industry which has, in-turn, allowed more open-source projects to confidently build upon it and flourish.
Summary
Apache Arrow has fundamentally transformed the analytics industry by introducing a standardised, language-agnostic, columnar in-memory data format. It has seen broad adoption by developers and vendors due to being both highly-performant and developer-friendly. This common format foundation has enabled modular analytics architectures that allow developers to innovate without reinventing core data handling components.
This has contributed to Arrow becoming a de facto standard for in-memory data representation and transport in analytics, promoting a faster, more interoperable, and more efficient data ecosystem. This is why Dremio created, contributes to, and utilises Apache Arrow. And why it can be found in data systems and libraries such as Pandas, DuckDB, Snowflake, BigQuery, Spark, and many, many others.