h2h2

9 minute read · February 8, 2022

Top Videos for Learning about Open Source Apache Iceberg, Arrow, and Nessie

Alex Merced

Alex Merced · Senior Tech Evangelist, Dremio

The most exciting innovation happening in the cloud data lake ecosystem is being driven by the open source software (OSS) community. 

These cloud data lake technologies liberate access to data from proprietary walled gardens, enable high-speed reads and writes, and enable robust security controls to comply with modern regulatory demands. 

Some of those key technologies include Apache Arrow, Apache Iceberg and Project Nessie. If you’re working as a data engineer or data architect, or play another role on a data team, these are the technologies that are shaping the industry. How can you expand your knowledge of them and deepen your expertise? We’ve curated the top video presentations from past Subsurface LIVE conferences for you to check out (reminder: Subsurface LIVE Winter 2022 is happening March 2-3 and registration for this free virtual conference is open now). 

Apache Arrow

Apache Parquet provides a columnar file format that is quick to read and query, but what happens after the file is loaded into memory? This is what Apache Arrow aims to answer by being an in-memory columnar format to maximize the speed of query processing in memory. The Arrow Flight connector allows any JDBC compatible data store to take advantage of Arrow.

Upcoming! Subsurface LIVE 2022 Arrow Sessions

ON-DEMAND SUBSURFACE LIVE SESSION:

Apache Arrow: A New Gold Standard for Dataset Transport

Here is a session from Subsurface LIVE 2020. Check it out to learn more about the role Apache Arrow and Arrow Flight play in disrupting previous approaches to creating data services that transport large datasets. Watch it to learn about the technical details of why the Arrow protocol is an attractive choice as well as specific examples of where Arrow has been employed for better performance and resource efficiency. 

Apache Iceberg

Apache Iceberg is an open table format creating a new paradigm in defining tables on the data lake. Not only does it separate storage of the data from the metadata that tracks the table but it tracks the metadata across a tree of files. This separation of metadata concerns allows query planning to be blazing-fast while enabling features like hidden partitioning, partition/schema evolution, time-travel, version rollback, safe update/delete transactions, and safe concurrent writing.

Upcoming! Subsurface LIVE 2022 Iceberg Sessions

ON-DEMAND SUBSURFACE LIVE SUMMER 2020 KEYNOTE:

The Future of Intelligent Storage in Big Data

Watch this presentation to learn about the challenges and motivations behind building Apache Iceberg (incubating) as the next generation of big data analytical storage. You’ll hear about the current state and roadmap for production deployments. Lastly, you’ll learn about  the future of automated storage optimization and compute-based enhancements built on machine learning algorithms at Netflix.

ON-DEMAND SUBSURFACE LIVE SESSION:

Lessons Learned From Running Apache Iceberg at Petabyte Scale

This talk from Iceberg PMC member Anton Okolnychyi describes how to maintain Iceberg tables in their optimal shapes while running at petabyte scale. You’ll learn how to efficiently perform metadata and data compaction on Iceberg tables with millions of files without any impact on concurrent readers and writers.

ON-DEMAND SUBSURFACE LIVE SESSION:

Iceberg Case Studies

In this talk, Iceberg co-creator Ryan Blue introduces the use cases for Apache Iceberg tables that weren’t expected when it was created and explains the details so you can use Iceberg for similar cases.

Project Nessie

While Apache Iceberg allows several new possibilities at the table level, Project Nessie unlocks a git-like experience at the lake level. Using Project Nessie you can create branches to isolate work on several tables in a catalog without affecting data consumers querying the main branch, which can then be merged as a multi-table transaction when work on the branch is complete. This creates truly new possibilities for effective data collaboration and workflows.

ON-DEMAND SUBSURFACE LIVE SESSION:

Distributed Transactions on the Data Lake with Project Nessie

While database concepts like transactions, commits and rollbacks are necessary for traditional data warehousing workloads, they’re not sufficient for modern data platforms and data-driven companies. Project Nessie is a new open-source metastore that builds on table formats such as Apache Iceberg and Delta Lake to deliver multi-table, multi-engine transactions. In this talk, you’ll learn about the transactional model of Nessie and how it can help improve the ETL workflow.

Find More Great Open Source Conversations at Subsurface LIVE Winter 2022

Amazing conversations about disruptive technologies occur at Subsurface every year. Make sure you don’t miss out and register for the Winter 2022 Subsurface LIVE Conference held online March 2-3

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.