6 minute read · May 21, 2020

Business Intelligence on the Cloud Data Lake, Part 1: Why It Arose, and How to Architect For It

Kevin Petrie

Kevin Petrie · Eckerson Group

Innovative data science, and the data volumes and varieties it requires, find a natural home in the data lake. But the latest generation of cloud-native data lakes also hosts a rising share of mainstream business intelligence (BI) projects.

Why did this happen? We’re entering phase 2 of the cloud disruption to the BI market. Phase 1, still underway, centers on the cloud enterprise data warehouse (EDW). When expanding BI workloads first outgrew traditional EDWs, many data teams migrated to replacement EDWs on the cloud. They processed BI queries more rapidly there; and scaled volumes more easily, renting elastic cloud compute rather than managing their own hardware.

But cloud EDWs are not a silver bullet. To get the right performance, analysts and data engineers often must export data to cubes, BI extracts and aggregation tables that create administrative overhead and lock-in risk. Cloud processing methods can result in higher than expected monthly compute costs. Cloud EDWs also handle semi-structured data sets less efficiently than cloud data lakes.

This brings us to phase 2, the rise of BI on the cloud data lake. Enterprises now regularly extract and load operational data to cloud data lakes. And they want to support more than just data science there. Suppose BI analysts and data engineers could simply and economically run high-performance BI queries on structured and semi-structured datasets directly in the data lake?

To learn more about this topic, join our upcoming webinar with Eckerson Group, The Rise of the Cloud Data Lake Engine: Architecting for Real-Time Queries. Click here to register.

Performance advancements make this a compelling possibility. Most data lakes now run Apache Spark in-memory processing rather than slow, legacy MapReduce software. Some data teams speed things up further by using Apache Arrow to focus in-memory queries just on the data columns that matter for a given job. Arrow also enables multiple processor types to run on data without serialization or deserialization. New commercial tools further reduce latency and improve throughput by applying distributed cache to parallel object storage in the data lake.

These capabilities usher in a new era of BI on the data lake. Analysts and data teams can meet latency and throughput requirements without moving data into unwieldy, hard-to-govern cubes, BI extracts or aggregation tables. They can quickly materialize data views to generate reports or answer ad-hoc requests and re-use those views in memory to avoid slow trips to disk. This could prove a powerful complement to the cloud EDW, which many organizations will continue to use for highly governed, read-write workloads on structured operational data.

BI on the data lake arrives none too soon. Business managers need more facts and figures to inform their actions, and they need to move quickly to remain competitive, which creates the need for frequent ad-hoc reporting and real-time dashboards on common data sets. Enterprises want to democratize data consumption at all levels to improve decision making and innovation. With the right architecture in place, they can meet these requirements economically and at scale. They also can flexibly adapt their platform to absorb new data types, host new workloads and support new use cases.

The following diagram compares architectures for each cloud disruption phase to data warehousing.

image alt text

Now that we’ve charted the forces driving the rise of BI on the cloud data lake, let’s explore guiding principles to help data teams evaluate tools and design effective architectures that make it happen.

  • Performance. In addition to the performance features described above, data teams should seek performance-enhancing capabilities such as cache prefetching, parallel read operations and workload isolation. They can further improve performance and efficiency by automatically deactivating cloud compute clusters they no longer need.
  • Self-service. BI analysts and other data consumers should be able to discover, create and share data sets, all virtually and without copies, and with minimal help from IT. Consider graphical BI and semantic-layer tools that eliminate manual coding so that analysts with no programming skills can run ad-hoc queries and generate reports to inform their business decisions.
  • Governance. Data teams must ensure they meet governance policies with role-based access controls, data quality checks, lineage tracking and granular masking of fields containing Personally Identifiable Information (PII). They also should apply a metadata framework and semantic layer that enforces the creation and reuse of reporting templates to help maintain consistent results.
  • SQL compatibility. Structured Query Language (SQL) remains the lingua franca for structuring, manipulating, querying and controlling access to database records. BI analysts and data engineers should be able to execute familiar SQL commands on operational data in your data lake, and use SQL to join that operational data with clickstream data, GPS map markers, or other types of (semi) structured datasets. While data scientists use languages like Python for specialized processing, BI-oriented data engineers still need fully SQL-compliant tools for data preparation and querying.
  • Open architecture. Count on change – to your data sources, platforms, processors, cloud service providers, data pipeline tools and BI tools. Prepare for future change tomorrow by investing in open tools today that integrate with the widest possible range of architectural components. Avoid cloud lock in wherever possible, to avoid getting “locked out” of innovation in critical new areas.

Cloud data lakes have achieved mainstream credibility with the traditional BI reporting crowd. Enterprises that follow the guidelines here can ensure data lakes keep this newfound credibility, complementing or in some cases even replacing their cloud EDWs with data lakes. Expect the trend to continue as platforms converge further, for example with S3 data lakes feeding Snowflake EDWs and the new Apache Iceberg project furnishing SQL-like tables for data lake engines running read-write workloads. Our next blog will explore how data lake query engines enable BI on the data lake today by simplifying the life of the data engineer.

About the author

Kevin Petrie

VP of Research at Eckerson Group

Kevin’s passion is to decipher what technology means to business leaders and practitioners. He has invested 25 years in technology, as an industry analyst, writer, instructor, product marketer, and services…

More About Kevin Petrie

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.