At LinkedIn, our data lake and big data compute infrastructure continually grow, not only to keep pace with the growth in the number of data applications, or their domains spanning data curation, AI, deep learning, business analytics and system operations, but also to accommodate evolving compute engines and architectures, ranging from vanilla MapReduce to declarative compute engines like Spark, Trino (formerly Presto SQL) and Hive, evolving storage and table formats, evolving data and metadata and evolving business logic. We built the Dali Catalog to make our data lake highly agile, efficient and secure, while not impacting data consumers.
In this talk, we give a brief introduction to Dali, and discuss two of its cornerstones in detail: Coral and Transport. Coral is an open source SQL translation, analysis and rewrite engine that we use to virtualize SQL logic and SQL-based compute in general. Transport is an open source framework for defining user-defined functions once, and automatically translating them to native UDF versions of multiple engines (e.g., Hive, Spark, Trino/Presto), as if they were specifically written for each engine in the first place. We discuss use cases of leveraging Coral and Transport in the Dali Catalog to resolve views with UDFs and make them queryable in a number of compute engines, despite not being defined in an engine’s native language.
Moreover, we discuss further applications of Coral and Transport virtualization features for abstracting data lake tables through relational algebra, and intelligently optimizing compute and storage in modern data lakehouses by analyzing and understanding the cluster workloads, and applying table and query rewrites accordingly.
Walaa Eldin Moustafa is a Staff Software Engineer at LinkedIn, where he works on building big data infrastructure and solutions for enabling unified and performant data processing systems across different compute engines and language APIs. Walaa holds a PhD degree in Computer Science from the University of Maryland at College Park. He has co-authored a number of database publications at various database conferences including SIGMOD, ICDE, and IEEE Big Data in topics that focus on modern applications of relational and deductive database management systems, such as graph query processing, machine learning, data integration, and probabilistic databases.