Java Vector Enhancements for Apache Arrow 0.8.0

Start For Free

Copied to clipboard

In a recent blog post I wrote about enhancement to vectorized processing in Apache Arrow 0.8.0 in the Java code base. In this post we will contrast latency figures for Arrow 0.7.0 and 0.8.0 to explore how these enhancements have helped to improve the overall performance of Arrow. Note that these changes are specific to the Java implementation.

For Arrow 0.8.0, we made significant enhancements to the Java vector implementation by improving the performance of critical code paths and reducing heap usage. In addition, these changes lay the foundation for additional efforts in the future to further reduce heap usage in vectors.

In the table below we compare PROJECT and FILTER operations of specific queries from the TPC-H benchmark using Dremio. Once these code changes were committed to Arrow, we changed our code in Dremio to move to the new implementation of Java vectors and ran a subset of the standard TPC-H performance tests using a 10 node cluster on GCP.

TPC-H Query	Project (Arrow 0.8)	Project (Arrow 0.7)	Project (% improvement)	Filter (Arrow 0.8)	Filter (Arrow 0.7)	Filter (% improvement)
Q1	101ms	120ms	15.8%	N/A	N/A	N/A
Q3	0.2ms	0.5ms	60%	N/A	N/A	N/A
Q4	3ms	3ms	0%	162ms	250ms	35.2%
Q5	23ms	38ms	39.4%	N/A	N/A	N/A
Q6	14ms	18ms	22.2%	42ms	50ms	16.0%
Q7	16ms	25ms	36.0%	87ms	150ms	42.0%
Q8	14ms	22ms	36.4%	N/A	N/A	N/A
Q9	22ms	34ms	35.3%	675ms	720ms	6.3%
Q10	15ms	20ms	25.0%	N/A	N/A	N/A
Q11	19ms	20ms	5.0%	10ms	12ms	16.7%
Q12	30ms	50ms	40.0%	80ms	142ms	43.7%
Q13	9ms	15ms	40.0%	3400ms	3500ms	2.9%
Q14	11ms	17ms	35.3%	N/A	N/A	N/A
Q17	12ms	19ms	36.8%	6ms	6ms	0.0%
Q18	12ms	20ms	40.0%	31ms	48ms	35.4%
Q20	3ms	4ms	25.0%	36ms	42ms	14.3%

Some notes on these results:

Project (Arrow 0.8) and Project (Arrow 0.7) represent the processing time (in milliseconds) consumed by the “Project” operator for a particular TPC-H query using the vector implementation in respective Arrow versions.
Filter (Arrow 0.8) and Filter (Arrow 0.7) represent the processing time (in milliseconds) consumed for predicate evaluation using the vector implementation in respective Arrow versions.
The presence of N/A for Filter (Arrow 0.8) and Filter (Arrow 0.7) indicates that the particular query didn’t require any WHERE clause evaluation and thus no “Filter” operator.

The reason for choosing specific operators for performance comparison rather than full elapsed time of query is that the performance improvements from the new vector implementation are likely to impact the non-vectorized operators since here we work with vector APIs extensively. Dremio’s vectorized query processing algorithms are very optimized and bypass most of the Arrow code stack to interact directly with the Arrow in-memory buffers to perform the meaningful work.

As Arrow vectors are used extensively in Dremio, this new implementation has provided overall performance improvements for our users as a result of fewer objects being created inside Arrow, and more efficient function calls associated with the vector APIs.

We are excited to be able to make these enhancements knowing there is much more we can do to further improve the efficiency of operations in Apache Arrow. Stay tuned for more in the near future!

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.

Start For Free

Article Topics

Product Insights from the Dremio Blog

Blog coverpage for Ingesting Data into Aparche Iceberg with Dremio

Feb 1, 2024 Product Insights from the Dremio Blog

Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg

By unifying data from diverse sources, simplifying data operations, and providing powerful tools for data management, Dremio stands out as a comprehensive solution for modern data needs. Whether you are a data engineer, business analyst, or data scientist, harnessing the combined power of Dremio and Apache Iceberg will undoubtedly be a valuable asset in your data management toolkit.