Get Started Free
No time limit - totally free - just the way you like it.Sign Up Now
Innovative data science, and the data volumes and varieties it requires, find a natural home in the data lake. But the latest generation of cloud-native data lakes also hosts a rising share of mainstream business intelligence (BI) projects.
Why did this happen? We’re entering phase 2 of the cloud disruption to the BI market. Phase 1, still underway, centers on the cloud enterprise data warehouse (EDW). When expanding BI workloads first outgrew traditional EDWs, many data teams migrated to replacement EDWs on the cloud. They processed BI queries more rapidly there; and scaled volumes more easily, renting elastic cloud compute rather than managing their own hardware.
But cloud EDWs are not a silver bullet. To get the right performance, analysts and data engineers often must export data to cubes, BI extracts and aggregation tables that create administrative overhead and lock-in risk. Cloud processing methods can result in higher than expected monthly compute costs. Cloud EDWs also handle semi-structured data sets less efficiently than cloud data lakes.
This brings us to phase 2, the rise of BI on the cloud data lake. Enterprises now regularly extract and load operational data to cloud data lakes. And they want to support more than just data science there. Suppose BI analysts and data engineers could simply and economically run high-performance BI queries on structured and semi-structured datasets directly in the data lake?
Performance advancements make this a compelling possibility. Most data lakes now run Apache Spark in-memory processing rather than slow, legacy MapReduce software. Some data teams speed things up further by using Apache Arrow to focus in-memory queries just on the data columns that matter for a given job. Arrow also enables multiple processor types to run on data without serialization or deserialization. New commercial tools further reduce latency and improve throughput by applying distributed cache to parallel object storage in the data lake.
These capabilities usher in a new era of BI on the data lake. Analysts and data teams can meet latency and throughput requirements without moving data into unwieldy, hard-to-govern cubes, BI extracts or aggregation tables. They can quickly materialize data views to generate reports or answer ad-hoc requests and re-use those views in memory to avoid slow trips to disk. This could prove a powerful complement to the cloud EDW, which many organizations will continue to use for highly governed, read-write workloads on structured operational data.
BI on the data lake arrives none too soon. Business managers need more facts and figures to inform their actions, and they need to move quickly to remain competitive, which creates the need for frequent ad-hoc reporting and real-time dashboards on common data sets. Enterprises want to democratize data consumption at all levels to improve decision making and innovation. With the right architecture in place, they can meet these requirements economically and at scale. They also can flexibly adapt their platform to absorb new data types, host new workloads and support new use cases.
The following diagram compares architectures for each cloud disruption phase to data warehousing.
Now that we’ve charted the forces driving the rise of BI on the cloud data lake, let’s explore guiding principles to help data teams evaluate tools and design effective architectures that make it happen.
Cloud data lakes have achieved mainstream credibility with the traditional BI reporting crowd. Enterprises that follow the guidelines here can ensure data lakes keep this newfound credibility, complementing or in some cases even replacing their cloud EDWs with data lakes. Expect the trend to continue as platforms converge further, for example with S3 data lakes feeding Snowflake EDWs and the new Apache Iceberg project furnishing SQL-like tables for data lake engines running read-write workloads. Our next blog will explore how data lake query engines enable BI on the data lake today by simplifying the life of the data engineer.
About the author
Kevin’s passion is to decipher what technology means to business leaders and practitioners. He has invested 25 years in technology, as an industry analyst, writer, instructor, product marketer, and services…