In a recent blog post I wrote about enhancement to vectorized processing in Apache Arrow 0.8.0 in the Java code base. In this post we will contrast latency figures for Arrow 0.7.0 and 0.8.0 to explore how these enhancements have helped to improve the overall performance of Arrow. Note that these changes are specific to the Java implementation.
For Arrow 0.8.0, we made significant enhancements to the Java vector implementation by improving the performance of critical code paths and reducing heap usage. In addition, these changes lay the foundation for additional efforts in the future to further reduce heap usage in vectors.
In the table below we compare PROJECT and FILTER operations of specific queries from the TPC-H benchmark using Dremio. Once these code changes were committed to Arrow, we changed our code in Dremio to move to the new implementation of Java vectors and ran a subset of the standard TPC-H performance tests using a 10 node cluster on GCP.
TPC-H Query
Project (Arrow 0.8)
Project (Arrow 0.7)
Project (% improvement)
Filter (Arrow 0.8)
Filter (Arrow 0.7)
Filter (% improvement)
Q1
101ms
120ms
15.8%
N/A
N/A
N/A
Q3
0.2ms
0.5ms
60%
N/A
N/A
N/A
Q4
3ms
3ms
0%
162ms
250ms
35.2%
Q5
23ms
38ms
39.4%
N/A
N/A
N/A
Q6
14ms
18ms
22.2%
42ms
50ms
16.0%
Q7
16ms
25ms
36.0%
87ms
150ms
42.0%
Q8
14ms
22ms
36.4%
N/A
N/A
N/A
Q9
22ms
34ms
35.3%
675ms
720ms
6.3%
Q10
15ms
20ms
25.0%
N/A
N/A
N/A
Q11
19ms
20ms
5.0%
10ms
12ms
16.7%
Q12
30ms
50ms
40.0%
80ms
142ms
43.7%
Q13
9ms
15ms
40.0%
3400ms
3500ms
2.9%
Q14
11ms
17ms
35.3%
N/A
N/A
N/A
Q17
12ms
19ms
36.8%
6ms
6ms
0.0%
Q18
12ms
20ms
40.0%
31ms
48ms
35.4%
Q20
3ms
4ms
25.0%
36ms
42ms
14.3%
Some notes on these results:
Project (Arrow 0.8) and Project (Arrow 0.7) represent the processing time (in milliseconds) consumed by the “Project” operator for a particular TPC-H query using the vector implementation in respective Arrow versions.
Filter (Arrow 0.8) and Filter (Arrow 0.7) represent the processing time (in milliseconds) consumed for predicate evaluation using the vector implementation in respective Arrow versions.
The presence of N/A for Filter (Arrow 0.8) and Filter (Arrow 0.7) indicates that the particular query didn’t require any WHERE clause evaluation and thus no “Filter” operator.
The reason for choosing specific operators for performance comparison rather than full elapsed time of query is that the performance improvements from the new vector implementation are likely to impact the non-vectorized operators since here we work with vector APIs extensively. Dremio’s vectorized query processing algorithms are very optimized and bypass most of the Arrow code stack to interact directly with the Arrow in-memory buffers to perform the meaningful work.
As Arrow vectors are used extensively in Dremio, this new implementation has provided overall performance improvements for our users as a result of fewer objects being created inside Arrow, and more efficient function calls associated with the vector APIs.
We are excited to be able to make these enhancements knowing there is much more we can do to further improve the efficiency of operations in Apache Arrow. Stay tuned for more in the near future!
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
Try Dremio Cloud free for 30 days
Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.
Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg
By unifying data from diverse sources, simplifying data operations, and providing powerful tools for data management, Dremio stands out as a comprehensive solution for modern data needs. Whether you are a data engineer, business analyst, or data scientist, harnessing the combined power of Dremio and Apache Iceberg will undoubtedly be a valuable asset in your data management toolkit.
Oct 12, 2023·Product Insights from the Dremio Blog
Table-Driven Access Policies Using Subqueries
This blog helps you learn about table-driven access policies in Dremio Cloud and Dremio Software v24.1+.
Aug 31, 2023·Dremio Blog: News Highlights
Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud
Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.