The Databricks platform is widely used for extract, transform, and load (ETL), machine learning, and data science. When using Databricks, it's essential to save your data in a format compatible with the Databricks File System (DBFS). This ensures that either the Databricks Spark or Databricks Photon engines can access it. Delta Lake is designed for this purpose. It provides ACID transactions, time-travel, and schema evolution—standard features in many table formats. Moreover, when using Delta Lake with Databricks, it offers additional features not available in open source Delta Lake such as generated columns.
One primary reason to consider Apache Iceberg over Delta Lake when working with Databricks is to avoid vendor lock-in. While Delta Lake is controlled by Databricks, Apache Iceberg provides a more neutral playing field. Aligning with a neutral, widely adopted standard means that your data solutions remain flexible and can easily transition between different platforms or vendors without significant overhaul. This ensures that your data infrastructure remains adaptable to changing business needs and technology landscapes.
Furthermore, the robust Apache Iceberg ecosystem offers a rich array of tools like Dremio, BigQuery, Apache Drill, and Snowflake, which have deeper integrations with Apache Iceberg than with Delta Lake. This ecosystem advantage means businesses can seamlessly leverage a broader range of technologies.
Additionally, Apache Iceberg boasts unique features that can be pivotal for many data operations. Its partition evolution capability allows for changes to partitioning strategies post facto, giving teams the flexibility to adapt to evolving data patterns without having to rewrite data. The hidden partitioning feature abstracts away the complexities of partition management, ensuring efficient data access while maintaining simplicity. These features, combined with its wide ecosystem, make Apache Iceberg an attractive choice for many organizations using Databricks.
There are two approaches to using Apache Iceberg on Databricks, either using Apache Iceberg natively as your table format, or using Delta Lake 3.0’s “UniForm” feature to expose Apache Iceberg metadata on your Delta Lake tables.
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
Method #1 – Use Apache Iceberg Natively
Using Apache Iceberg natively on Databricks offers several advantages and considerations. By adding the Iceberg jar and tweaking the appropriate Spark configurations, you can use Databricks Spark with Iceberg natively. This integration ensures that every transaction is captured as an Iceberg snapshot, enabling time-travel capabilities with tools that support Apache Iceberg. One of the flexible features of this setup is the freedom to choose any catalog based on the Spark configurations. This flexibility paves the way for catalog versioning with tools like Nessie, which can facilitate multi-table transactions and create zero-copy environments both within and outside Databricks.
However, there are certain limitations. Since Databricks employs a customized version of Spark, the 'MERGE INTO' transactions are not permissible on Iceberg tables natively from Databricks Spark. But this restriction doesn't apply when using the open source version of Apache Spark or other tools that support Iceberg.
Method #2 – Using the Delta Lake UniForm Feature
Delta Lake's Universal Format (UniForm) bridges compatibility between Delta tables and Iceberg reader clients. In essence, UniForm leverages the shared foundation of Delta Lake and Iceberg: they utilize Parquet data files accompanied by a metadata layer. Rather than rewriting data, UniForm produces Iceberg metadata asynchronously, allowing Iceberg clients to interpret Delta tables as if they were native Iceberg tables. This functionality means a single set of data files can cater to both formats. This is achieved by exposing a “REST Catalog” Iceberg catalog interface so that Unity Catalog acts as the Iceberg catalog.
The primary benefit of UniForm is the interoperability it introduces. Given the vast ecosystem around data processing, the ability to read Delta tables with Iceberg reader clients broadens the scope of operations and analytics that can be performed. This can be valuable for organizations using a mixed environment of tools, such as Databricks, BigQuery, or Apache Spark.
However, there are some limitations:
Deletion vector support: UniForm doesn't support tables with deletion vectors enabled, limiting its compatibility with certain table configurations.
Unsupported data types: Certain data types like LIST, MAP, and VOID are unsupported for Delta tables when UniForm is enabled, potentially restricting the types of data that can be managed.
Write operations: While Iceberg clients can read data from UniForm-enabled tables, write operations are not supported, which can impact the ability to modify data in such tables.
Client-specific limitations: Specific limitations may be tied to individual Iceberg reader clients, regardless of the UniForm feature, potentially affecting the behavior of certain client applications.
Delta Lake features: Although some advanced Delta Lake features like Change Data Feed and Delta Sharing are operational for Delta clients with UniForm, they may require additional support when working with Iceberg.
Iceberg Time-Travel Limited to select snapshots and possible inconsistency
Catalog versioning
Flexibility with tools like Project Nessie
Not possible; must use Unity Catalog
Transactions ('MERGE INTO')
Not permissible with Databricks' custom Spark
Supported
Driver resource consumption change
No change
Might increase
Deletion vectors
Merge-on-Read for Iceberg
Not supported
Unsupported data types
All Iceberg Types supported
LIST, MAP, VOID
Write operations to the Iceberg table from other engines/tools
Supported
Not supported with UniForm
Advanced Delta features (CDC, Delta Sharing)
N/A
Limited support in Iceberg
Metadata consistency to latest data
Instant
Asynchronous, might be lagging
Navigating the intricacies of table formats in data analytics can be challenging. The Databricks platform provides a formidable setting for machine learning and data science applications, with Delta Lake being its flagship table format. However, as data landscapes continuously evolve, organizations must remain flexible and forward-thinking. In this light, Apache Iceberg emerges as a significant contender.
Its neutral stance, broad ecosystem compatibility, and unique features offer compelling advantages over Delta Lake, especially for those keen on avoiding vendor lock-in. But as with every technology decision, it balances pros and cons. While Apache Iceberg's native integration offers a seamless experience, certain limitations with Databricks' customized Spark version might be a deal-breaker for some. On the flip side, while Delta Lake's UniForm feature provides broad compatibility, it comes with its set of constraints, particularly around data types and metadata consistency.
Our deep dive into both methods reveals that there's no one-size-fits-all answer. The decision hinges on your organization's specific needs, the existing tech stack, and long-term data strategy. Whether you lean toward the native integration of Apache Iceberg or opt for the UniForm feature of Delta Lake, ensure that the choice aligns with your overarching business goals. As data becomes increasingly pivotal in decision-making, ensuring you have the right infrastructure to manage, analyze, and derive insights from it remains paramount.
Tutorials for Trying Out Dremio (all can be done from locally on your laptop):
Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop
We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]
Aug 16, 2023·Dremio Blog: News Highlights
5 Use Cases for the Dremio Lakehouse
With its capabilities in on-prem to cloud migration, data warehouse offload, data virtualization, upgrading data lakes and lakehouses, and building customer-facing analytics applications, Dremio provides the tools and functionalities to streamline operations and unlock the full potential of data assets.
Aug 31, 2023·Dremio Blog: News Highlights
Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud
Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.