Serverless Cloud Data Lake with Spark for Serving Weather Data
The Weather Company (TWC) collects weather data across the globe at the rate of 34 million records per hour, and the TWC History on Demand (HoD) application serves that historical weather data to users via an API, averaging 600,000 requests per day. Users are increasingly consuming large quantities of historical data to train analytics models, and require efficient asynchronous APIs in addition to existing synchronous ones that use Elasticsearch.This session presents TWC’s architecture that uses a serverless cloud data lake running on top of Apache Spark and how that enables a highly elastic and economic way of serving weather history data. We will explain our concept of data skipping indexes that boosts performance by orders of magnitude compared to an out-of-the-box Spark setup, as well as significantly reducing cost. This enables TWC HoD to triple weather data coverage from land only to the entire globe, while at the same time reducing costs by an order of magnitude.We will also review serverless cloud data lake architecture in general and elaborate on the composition of serverless building blocks such as serverless storage, serverless ETL, serverless SQL and serverless data pipeline orchestration. In addition, we will review a set of major enhancements, including built-in geospatial and time series functions and a built-in multi-tenant Hive Metastore.Finally, we will highlight how TWC was able to adopt the serverless cloud data lake platform for new applications by rolling out a brand-new global data collection pipeline and data lake for COVID-19 data in just a few weeks.
Dr. Paula Ta-Shma is a Research Staff Member in the Cloud & Data Technologies group at IBM Research – Haifa and is responsible for a group of research efforts in the area of hybrid data, with a particular focus on high performance, secure and cost-efficient data stores and processing engines. She is particularly interested in performant SQL analytics over object storage and leads work on data skipping whose work is now integrated into multiple IBM products and services. Previously, she led projects in areas such as cloud storage infrastructure for IoT and continuous data protection. Prior to working at IBM, Dr. Ta-Shma worked at several companies on database management systems, including Informix Software Inc. where she worked on Apache Derby. She holds M.Sc. and PhD degrees in computer science from the Hebrew University of Jerusalem.
Torsten Steinbach has a long record working as a database architect. He led the IBM Db2 performance tooling and worked on the workload managers in IBM Netezza and IBM Db2. He also led the deep integration of machine learning into IBM’s RDBMS. Over the past few years, Torsten built from scratch IBM’s cloud data lake platform, which is heavily based on open source software such as Apache Spark and Apache Kafka.