We are thrilled to announce the release of an enhanced vectorized Parquet Reader in Dremio software version 24.3 and Dremio Cloud. This Dremio-exclusive reader improves query performance up to 75% for Parquet datasets encoded with the Parquet V2 encodings.
Apache Parquet-MR Writer version PARQUET_2_0, which is widely adopted by engines that write Parquet data, supports delta encodings. However, these encodings were not previously supported by Dremio's vectorized Parquet reader, resulting in decreased speed. Now, in version 24.3 and Dremio Cloud, when you use the Dremio SQL query engine on Parquet datasets, you’ll receive best-in-class performance.
The Dremio vectorized Parquet reader now supports the following encodings in addition to PLAIN, RLE, and PLAIN_DICTIONARY:
RLE_DICTIONARY
DELTA_BINARY_PACKED
DELTA_LENGTH_BYTE_ARRAY
DELTA_BYTE_ARRAY
Read Performance
Execution of TPC-DS queries on a Parquet dataset encoded with V2 encoding using the new vectorized reader delivered an average of 77% improvement in query performance compared to the previous version. Previously when dealing with Parquet datasets encoded in Parquet V2, Dremio utilized the Apache Parquet-MR row-wise reader.
NOTE: Dremio's vectorized reader already reads Parquet datasets encoded with Apache Parquet-MR writer version PARQUET_1_0, so this enhancement does not affect the performance of queries executed on such datasets.
Query Performance Improvements with 24.3
Parquet file type
Least Improvement
Highest Improvement
Average Improvement
With V2 encodings
22.5%
97.2%
77.3%
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
Write Performance
For writing Parquet data, Dremio utilizes the Apache Parquet-MR Writer. An average of 25% reduction in the storage footprint of TPC_DS data was observed with Parquet-MR Writer version V2 when compared to V1. Reducing storage footprint can also help store more data into Dremio’s proprietary Columnar Cloud Cache (C3). C3 cache enables the Dremio query engine to achieve NVMe-level I/O performance on S3/ADLS/GCS by leveraging the NVMe/SSD built into cloud compute instances, like Amazon EC2 and Azure Virtual Machines
Release 24.3 of Dremio will continue to write Parquet V1, since an average performance degradation of 1.5% was observed in writes and 6.5% was observed in queries when TPC-DS data was written using Parquet V2 instead of Parquet V1. The aforementioned query performance tests utilized the C3 cache to store data.
Storage Footprint & R/W Performance of Parquet V2 over V1 (Average)
Storage footprint
Write performance
Read Performance with C3
-24.8%
+1.5%
+6.4%
Guidance for Dremio Users:
Dremio users who query Parquet datasets and use data encoded in Parquet V2 should upgrade to Dremio version 24.3 to benefit from these substantial performance improvements. (Dremio Cloud users can benefit from this capability now)
Dremio writes data to Reflections and Iceberg tables in Parquet format. Writing with Parquet V2 can reduce storage footprint by as much as 25% and should also improve utilization of the Dremio exclusive Columnar Cloud Cache (C3).
Users can enable Parquet V2 on write using the following configuration key.
With its capabilities in on-prem to cloud migration, data warehouse offload, data virtualization, upgrading data lakes and lakehouses, and building customer-facing analytics applications, Dremio provides the tools and functionalities to streamline operations and unlock the full potential of data assets.
Aug 31, 2023·Dremio Blog: News Highlights
Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud
Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.
Dec 7, 2023·Dremio Blog: News Highlights
Dremio Cloud on Azure Available Now
Unveiling Dremio Cloud on Azure, a Fast, Scalable and Secure Lakehouse Platform Introduction Today we introduce Dremio Cloud on Azure, newly landing on Azure in November 2023, in public preview. Dremio Cloud is a powerful lakehouse platform providing streamlined self-service, unparalleled SQL performance, centralized data governance and seamless lakehouse management. In this blog post, we […]