h4h4h4h4h4

6 minute read · August 27, 2024

What’s New in Dremio,  Enhanced Performance with Reflection improvements, Result Set Caching and Merge-on-Read. 

Mark Shainman

Mark Shainman · Principal Product Marketing Manager

Dremio's latest version sets a new standard in the overall performance for lakehouse platforms. This release underscores Dremio's commitment to providing the most high performance Iceberg lakehouse platform, positioning it as the market's premier lakehouse analytics platform.

Reflection Enhancements 

A Reflection In Dremio, is an optimized relational cache that takes advantage of the platform's advanced semantic layer, which deeply understands both raw datasets and the views derived from them. When a Reflection is established, it creates an Apache Iceberg-based representation of the data within the data lake. During queries on this dataset or its derived views, Dremio’s optimizer dynamically determines if it can deliver faster query results leveraging a Reflection. If it can, Dremio replaces the original dataset with the Reflection for query execution, greatly improving performance. The performance gains come not just from the optimized physical data structure but also from the strategic planning advantages provided by Apache Iceberg metadata on this optimized structure.

Live Reflections on Iceberg Tables 

With Dremio's new capability, Live Reflections on Iceberg tables, any changes in the underlying data structures automatically update the reflections. This continuous refresh process guarantees that Reflections remain current, enhancing query performance. By maintaining up-to-date reflections, Dremio reduces the need for manual intervention and decreases management overhead. Overall, Live Reflections significantly improve efficiency and reliability in data management, leading to substantial performance improvements in query execution.

Reflection Recommendations with ROI 

Dremio has improved our reflection recommendations, to take a more holistic view of the workloads on the system. With the new recommendations, instead of just looking at one specific workload and recommending a reflection that could speed up that workload, the recommendation engine takes a look at a larger workload profile.  Reflection Recommendations with ROI looks for trends and patterns in the queries that a company has been running. It then comes up with reflection recommendations that can accelerate lots of different queries. The recommendation engine looks at historical jobs that users  have run for the last 7 days and analyzes the query patterns in those jobs. Dremio’s Reflection Recommendations with ROI then makes recommendations on what reflections to create that would have the best ROI to accelerate queries across all of the workloads. 

Result Set Caching 

Dremio’s result set caching is a mechanism to speed up overall performance of analytical queries. Result set caching significantly enhances data retrieval speeds by storing the results of frequently accessed queries. We have seen up to 28x improvement in performance for frequently used queries,  leveraging this new feature, This reduces the need to reprocess complex computations, thereby saving time and computational resources. 

How result set caching in Dremo works is that the system immediately writes query result sets in distributed storage as Apache Arrow files. The results are written asynchronously by operators that are part of the original query execution plan. In Dremio there is a central system that maintains all of the results cache IDs. If a new query comes in that contains the same plan, the system matches a result cache ID, the plan is automatically substituted to just read the data files of the result cache from the Apache Arrow files on the distributed data store. The results cache id and plan cache id are always different,  to avoid any issue with multi-coordinators. 

Dremio’s result set caching  minimizes query response times, and allows users to experience faster insights and improved productivity. Additionally, it decreases the overall load on the system, leading to more efficient and cost-effective data management.

Merge on Read 

In modern analytical lakehouse environments, the ability to update data at a rapid pace is critical. Dremio’s new Merge-on-Read (MoR) is essential in a data lakehouse environment due to its ability to efficiently manage and apply data changes without the need for immediate rewriting of large datasets. In traditional data warehouses, every update or delete operation requires modifying the actual data files, which can be resource-intensive and time-consuming, especially with large volumes of data. This direct modification approach can lead to significant downtime and system strain, negatively impacting performance and delaying data availability. By using Dremio’s MoR, users changes are written to log or delta files, which are smaller and quicker to write, thus allowing the system to continue operating smoothly and efficiently even as data updates occur.

The benefits of Dremio’s new Merge-on Read in an Iceberg lakehouse environment are substantial. Firstly, it enhances performance by minimizing the need for extensive and immediate rewrites of large datasets, enabling quicker data processing and reducing system load. We have seen up to 85% improvement in write times for some operations leveraging this new feature. Secondly, it ensures data consistency and availability as changes can be dynamically merged with the base data during query execution, providing an up-to-date view without compromising performance. Additionally, Dremio’s new Merge-on-Read supports better scalability, allowing the system to handle growing data volumes and more frequent updates without degradation in performance. This approach not only improves your lakehouse operational efficiency but also optimizes resource utilization, making it a highly effective component for your Iceberg lakehouse environment.

Conclusion

Dremio’s latest  version represents a significant leap forward in overall performance, making lakehouse analytics faster and easier. With its enhanced Reflection capabilities, smart and efficient result set caching, and enhanced data update capabilities, Dremio is setting a new standard for lakehouse performance. These improvements enhance the overall analytical process, improving business insight, time to value and decreasing TCO. Dremio continues to be an optimal choice for companies seeking advanced, high performance analytics solutions.

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.