7 minute read · March 19, 2024

Top Reasons to Attend the Subsurface Conference for Apache Iceberg Fans

Alex Merced

Alex Merced · Senior Tech Evangelist, Dremio

Suppose you're a data engineer, data scientist, or data analyst. In that case, the Subsurface conference is an unmissable event held on May 2nd and 3rd, live online and in person in New York City. This premier gathering shines a spotlight on the innovative world of data lakehouses, offering a deep dive into the latest trends and solutions. This event is pivotal for professionals implementing or considering a data lakehouse and a must-attend for Apache Iceberg enthusiasts.

A Convergence of Minds and Ideas

The Subsurface conference stands out as a pivotal gathering for data professionals. It’s not just an event; it's a nexus where the brightest minds in data engineering, science, and analytics converge to share knowledge, insights, and experiences around the data lakehouse. Attending this conference provides a unique opportunity to network with peers, learn from industry leaders, and gain insights into the future of data technology.

Apache Iceberg: A Central Theme

For fans of Apache Iceberg, the Subsurface Conference is particularly compelling every year as it hosts several talks on this pivotal data lakehouse technology. Here’s a glimpse into the Iceberg-centric talks that make this event indispensable:

Beyond Tables: What's Next for Apache Iceberg in Data Architecture

Presenter: Ryan Bloom (Apache Iceberg co-creator, cofounder of Tabular)

Iceberg's core purpose is to enable multiple engines to use the same table simultaneously, with ACID guarantees and full SQL semantics. While building Iceberg-based data architecture, the community has added new specifications for use cases like catalog interaction, views, remote scan planning, views, and encryption. This talk will cover the other standards and projects that the community is working on and how those projects unlock better patterns in data architecture.

Data Contracts for Apache Iceberg

Presenter: Chad Sanderson (CEO and Co-Founder of Gable.ai)

This session tackles the necessity of modern data management in an age of hyper iteration, experimentation, and AI. He will explore why traditional data management practices fail and how the cloud has fundamentally changed data development. The talk will cover a modern application of data management best practices, including data change detection, data contracts, observability, and CI/CD tests, and outline the roles of data producers and consumers. 

Lessons Learned from Running Merge-on-Read Iceberg Pipelines at Scale

Presenter: Anton Okolnychyi (Software Engineer at Apple)

Organizations are leveraging merge-on-read Apache Iceberg operations to efficiently handle sparse updates. This talk will share insights from running such operations on tables with tens of petabytes of data. You'll learn when to choose merge-on-read over copy-on-write execution mode, how to optimize the write performance, and the best practices for maintaining such tables using Apache Iceberg's built-in tools. This presentation will benefit engineers considering Apache Iceberg adoption, as well as those who already use it and seek to enhance their existing production environments.

Gain valuable insights from real-world experiences of managing large-scale, merge-on-read Iceberg pipelines, highlighting best practices and common pitfalls.

Optimizing Data Lakehouse Performance: Leveraging Dremio’s SQL Query Engine, Lakehouse Management Features and Apache Iceberg for Scalable Analytics

Presenter: Balaji Ramaswamy (Advanced Support Director at Dremio)

At the heart of our discussion is the seamless integration of Dremio and Apache Iceberg, focusing initially on the ease of partitioning with Iceberg. This feature significantly enhances query performance by organizing data that aligns with how it is queried, thereby reducing data scanning volume. We then delve into the essential practices for Iceberg maintenance, ensuring that your data lakehouse remains optimized for current needs and future scalability.

Best Practices for Building an Iceberg Data Lakehouse with Dremio

Presenter: Alex Merced (Developer Advocate at Dremio)

A data lakehouse combines the flexibility and scalability of the data lake with the data management, governance, and analytics of the data warehouse. Open table formats like Apache Iceberg make it possible to efficiently manage and leverage data while maintaining complete control over your organization's most critical asset. In this session, we'll share best practices for building an Iceberg data lakehouse, including: ingesting data into Iceberg tables, automating table optimization for performance, and building and sharing virtual data products. We'll also share how Dremio's Git for Data capabilities enable data teams to apply a DataOps framework to their lakehouse management strategy, including version control, CI/CD, governance, and observability.

Syncing the Iceberg: Real-time Sailing at Terabyte Latitudes

Presenter: Antonio Murgia (Data Architect at Agile Lab)

Apache Iceberg, along with other table formats, promise ACID properties atop read-optimized and open file formats like Apache Parquet. But, is achieving this promise feasible when synchronising tables in near real-time? Will optimistic concurrency remain the optimal choice? What trade-offs will we encounter? Let's embark on a journey across glacial seas and find out!

Iceberg Development at Apple

Presenter: Russell Spitzer (Software Engineering Manager at Apple)

Apache Iceberg has evolved from a niche project to a major industry player, boasting a global community of skilled engineers from companies like Amazon, Alibaba, Cloudera, Dremio, Tabular, and notably Apple. Apple, a significant open-source supporter, has been actively involved with Iceberg since its inception, contributing with five members on the PMC. Our teams are pivotal in driving Iceberg's development, enhancing features like Metadata Tables, ZOrdering, and Vectorized Reads. In our presentation, we'll discuss how Iceberg aligns with our needs and outline future plans for tackling upcoming challenges.

Havasu: A Table Format for Spatial Attributes in a Data Lake Architecture

William Lyon (Developer Relations Engineer at Wherobots)

Havasu is an open table format that extends Apache Iceberg to support spatial data. Havasu introduces a range of features, including native support for manipulating and storing geometry and raster objects directly in data lake tables, and enables seamless querying and processing of spatial tables using Spatial SQL.

Conclusion

Subsurface provides an excellent opportunity to learn more about the Data Lakehouse space. Still, for fans of Apache Iceberg, there will be a lot of content to keep you learning and interested. Register for Subsurface today.

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.