4 minute read · December 14, 2023

Vectorized Reading of Parquet V2 Improves Performance Up To 75%

Shrirang Kamat · Director, Product Management, Dremio

Mohit Sabharwal · Manager, Software Engineering

Gábor Szádovszky · Senior Staff Software Engineer

Vectorized Reading of Parquet V2 Improves Performance Up To 75%

Read Performance

Write Performance

Guidance for Dremio Users:

We are thrilled to announce the release of an enhanced vectorized Parquet Reader in Dremio software version 24.3 and Dremio Cloud. This Dremio-exclusive reader improves query performance up to 75% for Parquet datasets encoded with the Parquet V2 encodings.

Apache Parquet-MR Writer version PARQUET_2_0, which is widely adopted by engines that write Parquet data, supports delta encodings. However, these encodings were not previously supported by Dremio's vectorized Parquet reader, resulting in decreased speed. Now, in version 24.3 and Dremio Cloud, when you use the Dremio SQL query engine on Parquet datasets, you’ll receive best-in-class performance.

The Dremio vectorized Parquet reader now supports the following encodings in addition to PLAIN, RLE, and PLAIN_DICTIONARY:

RLE_DICTIONARY
DELTA_BINARY_PACKED
DELTA_LENGTH_BYTE_ARRAY
DELTA_BYTE_ARRAY

Read Performance

Execution of TPC-DS queries on a Parquet dataset encoded with V2 encoding using the new vectorized reader delivered an average of 77% improvement in query performance compared to the previous version. Previously when dealing with Parquet datasets encoded in Parquet V2, Dremio utilized the Apache Parquet-MR row-wise reader.

NOTE: Dremio's vectorized reader already reads Parquet datasets encoded with Apache Parquet-MR writer version PARQUET_1_0, so this enhancement does not affect the performance of queries executed on such datasets.

Query Performance Improvements with 24.3
Parquet file type	Least Improvement	Highest Improvement	Average Improvement
With V2 encodings	22.5%	97.2%	77.3%

Write Performance

For writing Parquet data, Dremio utilizes the Apache Parquet-MR Writer. An average of 25% reduction in the storage footprint of TPC_DS data was observed with Parquet-MR Writer version V2 when compared to V1. Reducing storage footprint can also help store more data into Dremio’s proprietary Columnar Cloud Cache (C3). C3 cache enables the Dremio query engine to achieve NVMe-level I/O performance on S3/ADLS/GCS by leveraging the NVMe/SSD built into cloud compute instances, like Amazon EC2 and Azure Virtual Machines

Release 24.3 of Dremio will continue to write Parquet V1, since an average performance degradation of 1.5% was observed in writes and 6.5% was observed in queries when TPC-DS data was written using Parquet V2 instead of Parquet V1. The aforementioned query performance tests utilized the C3 cache to store data.

Storage Footprint & R/W Performance of Parquet V2 over V1 (Average)
Storage footprint	Write performance	Read Performance with C3
-24.8%	+1.5%	+6.4%

Guidance for Dremio Users:

Dremio users who query Parquet datasets and use data encoded in Parquet V2 should upgrade to Dremio version 24.3 to benefit from these substantial performance improvements. (Dremio Cloud users can benefit from this capability now)

Dremio writes data to Reflections and Iceberg tables in Parquet format. Writing with Parquet V2 can reduce storage footprint by as much as 25% and should also improve utilization of the Dremio exclusive Columnar Cloud Cache (C3).

Users can enable Parquet V2 on write using the following configuration key.

ALTER SYSTEM SET "store.parquet.writer.version" = 'v2'

Article Topics

Dremio Blog: News Highlights Dremio Blog: Product Insights

Vectorized Reading of Parquet V2 Improves Performance Up To 75%

Table of Contents

Read Performance

Write Performance

Guidance for Dremio Users:

Ready to Get Started?

Table of Contents

Read Performance

Write Performance

Guidance for Dremio Users:

Additional Resources

5 Use Cases for the Dremio Lakehouse

Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud

Dremio Cloud on Azure Available Now

Ready to Get Started?