8 minute read · February 16, 2022
Top Videos on Talks about Open Data Architecture and Cloud Data Lake Best Practices

· Senior Tech Evangelist, Dremio

The affordability of object storage on cloud providers like AWS, Azure, or GCP has really transformed the value and practicality of data lakes. Cloud data lakes are becoming a fixture in the data architecture of an ever-expanding number of enterprises. The proper architecture and best practices for cloud data lakes can help maximize the value of your data lake by making your data accessible, easy to use, and blazing fast to query. At the annual Subsurface conference, cloud data lake architecture and best practices are front and center.
We’ve curated the top video presentations from past Subsurface LIVE conferences for you to check out. Plus, we’ve made a list of sessions you can join at the Subsurface LIVE Winter 2022 event happening March 2-3. Register now if you haven’t already reserved your spot for this free, virtual event!
Open Data Architecture and Cloud Data Lake Best Practices
A big focus of the Subsurface conference is cloud data lake best practices to increase effective use and access to your data lake while minimizing the burdens on engineers and costs to the enterprise.
Upcoming! Subsurface LIVE 2022 Sessions
Here are some of the talks from the upcoming Subsurface LIVE 2022 conference you should attend and watch live:
- Architecting the Data Lake at ForceMetrics (Breakout Session)
- Unsolved Challenges in Data Infrastructure (Breakout Session)
- An Open Data Architecture with Apache Iceberg (Breakout Session)
- Many More on the 2022 Agenda
On-Demand Subsurface Sessions
Let’s also take a look at some of the past talks that will give you insight into architecting your cloud data lake.
Keynote: The Future Is Open - The Rise of the Cloud Data Lake
The rise of cloud data lake storage (e.g., S3, ADLS) as the default bit bucket in the cloud, combined with the infinite supply and elasticity of cloud compute (e.g., EC2, Azure VMs), has ushered in a new era in data analytics architectures. In this new world, data can be stored and managed in open source file and table formats, such as Apache Parquet and Apache Iceberg, and accessed by best-of-breed elastic compute engines such as Dremio, Databricks and EMR. As a result, companies can now avoid becoming locked into monolithic systems such as cloud data warehouses and Hadoop distributions, and instead enjoy the flexibility of using the best of breed technologies of today and tomorrow. In this keynote presentation, Dremio Co-Founder and CPO Tomer Shiran discusses these trends and the building blocks that have come together to enable this new open architecture.
Building an Efficient Data Pipeline for Data-Intensive Workloads
Moving data through the pipeline in an efficient and predictable way is one of the most important aspects of modern data architecture, particularly when it comes to running data-intensive workloads such as IoT and machine learning in production. This talk breaks down the data pipeline and demonstrates how it can be improved with a modern transport mechanism that includes Apache Arrow Flight. This session details the architecture and key features of the Arrow Flight protocol and introduces an Arrow Flight Spark data source, showing how microservices can be built for and with Spark. Attendees will see a demo of a machine learning pipeline running in Spark with data microservices powered by Arrow Flight, highlighting how much faster and simpler the Flight interface makes this example pipeline.
Functional Data Engineering - A Set of Best Practices
This talk discusses the functional programming paradigm and explores how applying it to data engineering can bring a lot of clarity to the process. It helps solve some of the inherent problems of ETL, leads to more manageable and maintainable workloads and helps to implement reproducible and scalable practices. It empowers data teams to tackle larger problems and push the boundaries of what’s possible.
Build Lightning Fast Queries with Blazing Fast Object Storage
Organizations are increasingly leveraging analytics to turn data into insights for competitive advantage. However, the architectural considerations for platforms that support large data lake deployments of analytics applications change significantly as these efforts mature beyond small-scale to large-scale environments. One highly successful trend is the adoption of object storage in analytics allowing data teams to be able to analyze data anywhere and everywhere. Watch this talk to learn how to build out an enterprise-scale data lake for lightning-fast queries with blazing fast object storage.
Best Practices for Building a Scalable and Secure Data Lake on AWS
Watch this talk to learn about architectural patterns, approaches, and best practices for building scalable data lakes on AWS. You will learn how to first, build a data lake, and second, extend it to meet your company's needs using the producer-consumer and data mesh architectural patterns. You will learn how AWS Lake Formation makes it simple to deploy these architectures by allowing you to securely share data between teams using their choice of tools, including Dremio, Amazon Redshift, and Amazon Athena.
Keynote Data Mesh – Enabled with a Self-Serve Platform
Data Mesh is an alternative sociotechnical approach to managing analytical data. It is what comes after this inflection point. Its objective is to enable organizations to get value from data at scale with agility, in the face of organizational growth and complexity. It’s an approach that shifts the data culture, technology and architecture.
Learn Best Practices for Architecting Cloud Data Lakes at Subsurface LIVE Winter 2022
Amazing conversations about value-add best practices occur at the Subsurface conference every year. Make sure you don’t miss out and register for the Subsurface LIVE Winter 2022 conference March 2 and 3.