We are thrilled to announce the release of an enhanced vectorized Parquet Reader in Dremio software version 24.3 and Dremio Cloud. This Dremio-exclusive reader improves query performance up to 75% for Parquet datasets encoded with the Parquet V2 encodings. 

Apache Parquet-MR Writer version PARQUET_2_0, which is widely adopted by engines that write Parquet data, supports delta encodings. However, these encodings were not previously supported by Dremio's vectorized Parquet reader, resulting in decreased speed. Now, in version 24.3 and Dremio Cloud, when you use the Dremio SQL query engine on Parquet datasets, you’ll receive best-in-class performance.

The Dremio vectorized Parquet reader now supports the following encodings in addition to PLAIN, RLE, and PLAIN_DICTIONARY: 

  • RLE_DICTIONARY
  • DELTA_BINARY_PACKED
  • DELTA_LENGTH_BYTE_ARRAY
  • DELTA_BYTE_ARRAY

Read Performance

Execution of TPC-DS queries on a Parquet dataset encoded with V2 encoding using the new vectorized reader delivered an average of 77% improvement in query performance compared to the previous version. Previously when dealing with Parquet datasets encoded in Parquet V2, Dremio utilized the Apache Parquet-MR row-wise reader.

NOTE: Dremio's vectorized reader already reads Parquet datasets encoded with Apache Parquet-MR writer version PARQUET_1_0, so this enhancement does not affect the performance of queries executed on such datasets.

Query Performance Improvements with 24.3
Parquet file typeLeast Improvement Highest ImprovementAverage  Improvement
With V2 encodings22.5%97.2%77.3%

Write Performance

For writing Parquet data, Dremio utilizes the Apache Parquet-MR Writer. An average of 25% reduction in the storage footprint of TPC_DS data was observed with Parquet-MR Writer version V2 when compared to V1. Reducing storage footprint can also help store more data into Dremio’s proprietary Columnar Cloud Cache (C3).  C3 cache enables the Dremio query engine to achieve NVMe-level I/O performance on S3/ADLS/GCS by leveraging the NVMe/SSD built into cloud compute instances, like Amazon EC2 and Azure Virtual Machines 

Release 24.3 of Dremio will continue to write Parquet V1, since an average performance degradation of 1.5% was observed in writes and 6.5% was observed in queries when TPC-DS data was written using Parquet V2 instead of Parquet V1.  The aforementioned query performance tests utilized the C3 cache to store data.

Storage Footprint & R/W Performance of Parquet V2 over V1 (Average)
Storage footprintWrite performanceRead Performance with C3
-24.8%+1.5%+6.4%

Guidance for Dremio Users:

Dremio users who query Parquet datasets and use data encoded in Parquet V2 should upgrade to Dremio version 24.3 to benefit from these substantial performance improvements. (Dremio Cloud users can benefit from this capability now)

Dremio writes data to Reflections and Iceberg tables in Parquet format. Writing with Parquet V2 can reduce storage footprint by as much as 25% and should also improve utilization of the Dremio exclusive Columnar Cloud Cache (C3). 

Users can enable Parquet V2 on write using the following configuration key. 

ALTER SYSTEM SET "store.parquet.writer.version" = 'v2'

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.