9 minute read · February 16, 2022

Top Videos on Talks about Open Data Architecture and Cloud Data Lake Best Practices

Alex Merced

Alex Merced · Senior Tech Evangelist, Dremio

The affordability of object storage on cloud providers like AWS, Azure, or GCP has really transformed the value and practicality of data lakes. Cloud data lakes are becoming a fixture in the data architecture of an ever-expanding number of enterprises. The proper architecture and best practices for cloud data lakes can help maximize the value of your data lake by making your data accessible, easy to use, and blazing fast to query. At the annual Subsurface conference, cloud data lake architecture and best practices are front and center.

We’ve curated the top video presentations from past Subsurface LIVE conferences for you to check out. Plus, we’ve made a list of sessions you can join at the Subsurface LIVE Winter 2022 event happening March 2-3. Register now if you haven’t already reserved your spot for this free, virtual event! 

Open Data Architecture and Cloud Data Lake Best Practices

A big focus of the Subsurface conference is cloud data lake best practices to increase effective use and access to your data lake while minimizing the burdens on engineers and costs to the enterprise. 

Upcoming! Subsurface LIVE 2022 Sessions 

Here are some of the talks from the upcoming Subsurface LIVE 2022 conference you should attend and watch live:

On-Demand Subsurface Sessions

Let’s also take a look at some of the past talks that will give you insight into architecting your cloud data lake.  

Keynote: The Future Is Open - The Rise of the Cloud Data Lake

The rise of cloud data lake storage (e.g., S3, ADLS) as the default bit bucket in the cloud, combined with the infinite supply and elasticity of cloud compute (e.g., EC2, Azure VMs), has ushered in a new era in data analytics architectures. In this new world, data can be stored and managed in open source file and table formats, such as Apache Parquet and Apache Iceberg, and accessed by best-of-breed elastic compute engines such as Dremio, Databricks and EMR. As a result, companies can now avoid becoming locked into monolithic systems such as cloud data warehouses and Hadoop distributions, and instead enjoy the flexibility of using the best of breed technologies of today and tomorrow. In this keynote presentation, Dremio Co-Founder and CPO Tomer Shiran discusses these trends and the building blocks that have come together to enable this new open architecture.

Building an Efficient Data Pipeline for Data-Intensive Workloads

Moving data through the pipeline in an efficient and predictable way is one of the most important aspects of modern data architecture, particularly when it comes to running data-intensive workloads such as IoT and machine learning in production. This talk breaks down the data pipeline and demonstrates how it can be improved with a modern transport mechanism that includes Apache Arrow Flight. This session details the architecture and key features of the Arrow Flight protocol and introduces an Arrow Flight Spark data source, showing how microservices can be built for and with Spark. Attendees will see a demo of a machine learning pipeline running in Spark with data microservices powered by Arrow Flight, highlighting how much faster and simpler the Flight interface makes this example pipeline.

Functional Data Engineering - A Set of Best Practices

This talk discusses the functional programming paradigm and explores how applying it to data engineering can bring a lot of clarity to the process. It helps solve some of the inherent problems of ETL, leads to more manageable and maintainable workloads and helps to implement reproducible and scalable practices. It empowers data teams to tackle larger problems and push the boundaries of what’s possible.

Build Lightning Fast Queries with Blazing Fast Object Storage

Organizations are increasingly leveraging analytics to turn data into insights for competitive advantage. However, the architectural considerations for platforms that support large data lake deployments of analytics applications change significantly as these efforts mature beyond small-scale to large-scale environments. One highly successful trend is the adoption of object storage in analytics allowing data teams to be able to analyze data anywhere and everywhere. Watch this talk to learn  how to build out an enterprise-scale data lake for lightning-fast queries with blazing fast object storage.

Best Practices for Building a Scalable and Secure Data Lake on AWS

Watch this talk to learn about  architectural patterns, approaches, and best practices for building scalable data lakes on AWS. You will learn how to first, build a data lake, and second, extend it to meet your company's needs using the producer-consumer and data mesh architectural patterns. You will learn how AWS Lake Formation makes it simple to deploy these architectures by allowing you to securely share data between teams using their choice of tools, including Dremio, Amazon Redshift, and Amazon Athena.

Keynote Data Mesh – Enabled with a Self-Serve Platform

Data Mesh is an alternative sociotechnical approach to managing analytical data. It is what comes after this inflection point. Its objective is to enable organizations to get value from data at scale with agility, in the face of organizational growth and complexity. It’s an approach that shifts the data culture, technology and architecture.

Learn Best Practices for Architecting Cloud Data Lakes at Subsurface LIVE Winter 2022

Amazing conversations about value-add best practices occur at the Subsurface conference every year. Make sure you don’t miss out and register for the Subsurface LIVE Winter 2022 conference March 2 and 3.

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.