Covering Indexes in the Data Lake with Hyperspace
Thursday, July 22 2021
At Adobe, we use the Iceberg table format inside the Adobe Experience Platform Data Lake. Although Iceberg offers the capability for file skipping, when the data is properly laid out and used, in many cases this is not enough, and the queries executed over the data take a long time to complete. Similar to an RDBMS use case where high latency on queries can be alleviated with additional indexing at the cost of some extra storage, in data lakes the same pattern can be used. Hyperspace is an early phase indexing subsystem for Apache Spark that introduces the ability for users to build indexes on data, and together with Iceberg it can bring major improvements in query response time – up to 25 times faster in some cases. Hyperspace accommodates our two major data flow use cases – stale datasets and fast-changing datasets – and assures consistency when used.