Subsurface LIVE Sessions

Session Abstract

At LinkedIn, our data lake and big data compute infrastructure continually grow, not only to keep pace with the growth in the number of data applications, or their domains spanning data curation, AI, deep learning, business analytics and system operations, but also to accommodate evolving compute engines and architectures, ranging from vanilla MapReduce to declarative compute engines like Spark, Trino (formerly Presto SQL) and Hive, evolving storage and table formats, evolving data and metadata and evolving business logic. We built the Dali Catalog to make our data lake highly agile, efficient and secure, while not impacting data consumers.In this talk, we give a brief introduction to Dali, and discuss two of its cornerstones in detail: Coral and Transport. Coral is an open source SQL translation, analysis and rewrite engine that we use to virtualize SQL logic and SQL-based compute in general. Transport is an open source framework for defining user-defined functions once, and automatically translating them to native UDF versions of multiple engines (e.g., Hive, Spark, Trino/Presto), as if they were specifically written for each engine in the first place. We discuss use cases of leveraging Coral and Transport in the Dali Catalog to resolve views with UDFs and make them queryable in a number of compute engines, despite not being defined in an engine’s native language.Moreover, we discuss further applications of Coral and Transport virtualization features for abstracting data lake tables through relational algebra, and intelligently optimizing compute and storage in modern data lakehouses by analyzing and understanding the cluster workloads, and applying table and query rewrites accordingly.