11 minute read · May 3, 2024

How Apache Iceberg, Dremio and Lakehouse Architecture can optimize your Cloud Data Platform Costs

Alex Merced

Alex Merced · Senior Tech Evangelist, Dremio

Even if budget constraints aren't an immediate concern, optimizing your data platform costs should still be a priority. Reducing expenses is not only beneficial to the bottom line, but it also frees up resources that can be reinvested to expand your data footprint, enhance insights, and increase overall value. In this article, I will outline the various sources of these costs and demonstrate how a combination of technologies, best practices, and architectural adjustments can significantly reduce these expenses.

Categories of Costs

When addressing cloud data warehouse costs, you typically encounter three main categories:

- Compute: Costs accrue based on the runtime of virtual machines provided by your cloud provider. Longer-running queries and data preparation tasks can significantly increase these costs, especially since data warehouse providers often mark up the compute resources from the base cloud provider, amplifying the cost of these operations.

- Storage: The cost directly correlates with the amount of data stored; storing data in a data warehouse generally incurs a markup over the raw storage costs from the underlying cloud provider, escalating expenses as data volume increases.

- Movement: Costs are incurred for network requests when retrieving data from object storage and may include an egress fee for transferring data outside of your cloud provider's account. The frequent movement of data from a cloud data lake to a cloud data warehouse amplifies these costs with every ETL job.

A significant cost driver is employing an ELT (Extract, Load, Transform) pattern directly within the warehouse. This method involves landing large volumes of raw data in the warehouse, compounding storage and egress fees. Subsequent transformations create multiple data layers, each duplicating data and thus inflating both compute and storage costs.

These costs can be significantly mitigated by adopting a lakehouse architecture, which primarily involves two changes:

1. Centralize on a Single Data Copy: Utilize Apache Iceberg tables to work from a single copy of your data stored in your data lake.

2. Adopt a Data Lakehouse Platform: Platforms like Dremio can handle increasing workloads more cost-effectively by enabling fast and efficient data analytics directly on your data lake, bypassing the need for expensive data warehousing solutions.

These strategies not only reduce costs but also enhance the productivity of your analytics efforts by streamlining data management and processing. Let’s explore how these two components can specifically contribute to lowering your data platform costs.

Better Performance, Better Price

One effective way to reduce costs is by accelerating query execution. Faster queries mean compute instances can be shut down sooner, thereby reducing operational costs.

Apache Iceberg plays a pivotal role in this process. As a data lakehouse table format, Iceberg overlays a metadata layer on top of Parquet files that comprise a dataset. This metadata layer enables data lakehouse query engines to scan and access data rapidly and efficiently. Furthermore, the Dremio Lakehouse platform enhances performance through its Apache Arrow-based processing, cost-based query optimization, and other features. According to Dremio founder Tomer Shiran at the 2024 Subsurface Lakehouse conference day 1 closing keynote, these enhancements enable Dremio to deliver performance speeds up to 100% faster than Trino and 70% faster than Snowflake. These speed improvements are crucial for reducing the cost implications associated with prolonged compute usage.

Historical Data, but at lower costs

Often, the significant factor behind inflated data warehouse bills is the cost of storage. Data warehouses typically maintain 90 days' worth of historical versions of data, meaning the actual data size is only a fraction of what you are charged for storing. While retaining historical data is essential for disaster recovery, querying historical data, and other functions, it can be managed more economically.

Using Apache Iceberg, you can track the historical state of your tables at a much lower cost. This cost efficiency is further boosted by employing an open-source Nessie catalog. Nessie handles historical changes at the catalog level for your Iceberg tables, enabling features like multi-table time travel, rollbacks, and transactions without increasing the storage footprint, thus utilizing more affordable storage options.

Dremio Cloud simplifies the use of a Nessie catalog by integrating it into the Dremio Cloud catalog. This integration allows users to enjoy all the benefits of open-source Nessie without the complexities of deploying and maintaining a separate instance. Additional features include automatic table optimization, cleanup, and a user interface that provides visibility into catalog commits, branches, and tags, further enhancing the management and efficiency of your data resources.

Acceleration that’s Fast and Easy

Even with the robust performance of Apache Iceberg and Dremio, there are still instances where you might require additional query acceleration. Typically, this involves using materialized tables, extracts, and cubes, which necessitate the creation and synchronization of new data objects. Additionally, data analysts and scientists must be trained on when to use these specific objects, which can complicate workflows.

Dremio's reflections feature significantly streamlines this process. Reflections allow for the creation of Apache Iceberg-based objects directly on your data lake. These objects are then seamlessly substituted in for queries that would benefit from their use. Data engineers can easily set up reflections using a straightforward UI or SQL commands, and Dremio automatically maintains these objects. Importantly, data analysts and scientists do not need to be aware of these reflections; their queries will automatically and transparently leverage these accelerated paths whenever beneficial.

Unlike other acceleration solutions, Dremio's reflections are available in all versions of Dremio—both free and enterprise. Furthermore, reflections are capable of accelerating queries across all of Dremio's data sources, not just those using object storage. They can also enhance the performance of joins and other complex operations within your catalog without any hassle. The result is rapid query performance akin to having materialized views, BI extracts, or cubes without the overhead of managing them.

Moreover, when it comes to Business Intelligence (BI), these accelerations are tool-agnostic. This means that teams using different BI tools can benefit equally from the same enhancements to data access and processing speeds.

Reducing those Network Costs

Another standout feature of Dremio that both reduces costs and enhances performance is the Dremio Columnar Cloud Cache, commonly referred to as C3. This caching layer allows nodes within a Dremio cluster to cache frequently accessed files and blocks in the node's NVMe memory. This facilitates quick access to data.

By enabling this caching mechanism, nodes in the Dremio cluster do not have to make repeated object storage requests for the same data items. This efficiency not only speeds up query resolution times but also significantly cuts down on data access costs. The result is a reduction in both compute and network expenses, leading to considerable cost savings while maintaining high performance.

ELT on the Data Lake

While employing ELT (Extract, Load, Transform) directly on the data warehouse can escalate costs, leveraging ELT within a lakehouse architecture can actually optimize them. This cost-effectiveness stems from keeping the data within your data lake, which ensures lower storage costs and minimal egress fees. Additionally, using a platform like Dremio further reduces compute costs. Thanks to Dremio's no-copy architecture, much of the data transformation and modeling can be conducted virtually through views built on top of the raw physical data. This approach significantly minimizes the storage footprint of the ETL process while still delivering strong performance, especially when complemented by strategic use of Dremio's reflections.

This ELT pattern also enhances self-service capabilities, allowing data analysts and scientists to perform a greater portion of transformations and modeling virtually. This not only speeds up the data handling process but also empowers your team by simplifying access to data manipulation tools.

Conclusion

In conclusion, the combination of Dremio and Apache Iceberg presents a powerful solution for optimizing data platform costs through efficient data management practices, enhanced query acceleration, and innovative features like Dremio's reflections and C3 caching. By leveraging a lakehouse architecture, organizations can achieve significant savings on storage and compute costs, streamline transformations with virtual modeling, and enhance data accessibility for analysts and scientists. To truly understand the capabilities and benefits of this integration, I invite you to experience Dremio in action. Explore one of the many hands-on tutorials available, which you can conveniently perform from your laptop at no cost. These tutorials provide a practical way to see firsthand how Dremio and Apache Iceberg can transform your data operations.

Ready to Get Started?

Enable the business to create and consume data products powered by Apache Iceberg, accelerating AI and analytics initiatives and dramatically reducing costs.