Data Virtualization seems promising. But does it scale with your data and BI needs?

Data engineers and data architects strive to provide data consumers with the data and analytics user experience they need. Data virtualization can look very promising — meet all my BI goals and not have to move data? Yes please!

However, as your data, user, and application scale grows, old and new problems arise that put you back where you started — or worse. Data lakes combined with Dremio can address BI needs at any scale.

The promise of Data Virtualization, and why it fails as you scale.

It has long been a goal of organizations to provide decision makers with self-service access to data and analytics so they can make evidence-based decisions in order to better run the business. There are a number of capabilities organizations need to provide in order to fully achieve that goal:

  • Self-service access for data consumers
  • A single place to go to access data
  • Access to a wide range of datasets
  • Fast availability for new or newly-requested datasets
  • Canonical KPI definitions
  • Provide all of these capabilities efficiently from a data engineering perspective

When trying to provide the above, things get tricky — enterprise architectures are complex. The number of different data sources can grow to be staggering. Further, different business units in an organization typically have some level of their own systems, often including their own Data Warehouse or Data Mart. There are tools that can connect directly to these different systems that data consumers can use, allowing you to provide self-service access to all the data for your end users on demand, without needing to do any additional work moving the data — seems like a fairly clean solution to a hard problem, why wouldn’t you want to do that?

This approach can work at a smaller scale. For organizations or business units early in their journey to be data-driven or ones that are inherently at a smaller scale, the amount of data, users, and applications they have typically isn’t very large. In those situations, Data Virtualization works fairly well. Because the scale is small, organizations are able to provide self-service access to this data with fairly interactive response times to their data consumers. This allows them to make self-service data-driven decisions quickly and run the business better.

See why Data Virtualization fails at scale

GO ON A DATA VIRTUALIZATION JOURNEY

However, as the amount of data, users, and applications grow in an organization, the first problem encountered is performance. Users aren’t able to get answers to their questions nearly as fast as they need to. These performance problems result in users asking fewer questions or even being unable to wait that long for an answer in some situations. This often leads to poor decisions because they’re made based on gut-feels or guesses, and sometimes not even data-informed guesses.

The causes of these performance problems fall into one or more of the following categories:

  • The source system is unable to send the data to the data virtualization platform fast enough at runtime, whether because of source system load or the design of the source’s storage system
  • The network speed between the data virtualization platform and the source system isn’t fast enough to transfer the data at runtime
  • The protocol used, generally JDBC, isn’t able to transmit the data from the source system to the data virtualization platform fast enough at runtime
  • The data is being transferred from the source system to the data virtualization platform over a single connection, i.e., serially
  • At a certain scale, the sheer amount of data that needs to be processed at runtime makes it impossible to do interactive analytics, simply due to physics of reading that much data

Because all of these problems stem from transferring the data at runtime, the solution to this performance problem is generally to transfer the data ahead of time on a case-by-case basis by making copies of some datasets into a Data Warehouse and connecting the Data Warehouse to the Data Virtualization tool too. This method also addresses needs that arise for removing the analytic workload from the operational systems, to ensure the operational system can serve its primary function properly.

Check out this video to see how RenaissanceRe modernizes BI with its centralized Cloud Data Lake and Dremio.

WATCH NOW
How Dell Technologies modernizes BI with its centralized Cloud Data Lake and Dremio

However, as the data, user, and application scale continues to grow, generally one Data Warehouse for the whole organization ends up being insufficient, whether for flexibility, cost, complexity, or operational reasons. Typically at that point, each business unit gets their own Data Warehouse or Data Mart. Once again, these Data Warehouses are connected to the Data Virtualization platform, and data consumers have self-service access to a wide range of data.

However, the performance problems of transferring data at runtime when joining between data in two Data Warehouses still apply for exactly the same reasons as before. So, when needing to join data at this scale between the two systems, this is done via IT request and creating an ETL pipeline to copy the data into the destination Data Warehouse.

This approach of Data Virtualization augmented with data copies to address the performance problems now causes a whole new set of problems at this scale:

  • Lack of self-service
  • Data engineering overhead
  • Slow turnaround time for requests
  • Data drift
  • Regulatory compliance issues
  • Infrastructure cost

The workaround of creating data copies to address performance problems associated with Data Virtualization is a big topic worth its own dedicated page. For more information, see this whitepaper discussing the unexpected cost of data copies.

Data Virtualization was supposed to help solve some of these problems. Instead, when used at scale, it exacerbates them and introduces new problems you didn’t have before.

Data Virtualization Breaks At Scale

All of these problems are downstream effects of a lack of performance in the Data Virtualization approach. So, if we can solve the performance problems, we can avoid the workarounds that cause these downstream issues.

Let’s recall the causes of the performance problems of Data Virtualization in the first place, and see how we can address them:

Data Virtualization Performance Problems How to Mitigate the Problem
The source system is unable to send the data to the Data Virtualization platform fast enough at runtime, whether because of source system load or the design of the source’s storage system. The processing engine fulfilling user requests needs to have direct access to the storage and data.
The network speed between the Data Virtualization platform and the source system isn’t fast enough to transfer the data at runtime. We need a performant network between the processing engine and the storage engine. When the data gets too large to be feasibly addressed via network throughput, we need to get the data closer via caching to reduce the network bandwidth needed or via pre-computation to reduce the amount of data needed to be transferred.
The protocol used, generally JDBC, isn’t able to transmit the data from the source system to the Data Virtualization platform fast enough at runtime. The processing engine fulfilling user requests needs to be able to access the data stored in columnar format using a protocol that doesn’t needlessly serialize the columnar data to a row format, just for it to be deserialized back to columnar format in the processing engine - i.e., it needs to be able to get the data for processing while the data stays in columnar format.
The data is being transferred from the source system to the Data Virtualization platform over a single connection, i.e., serially. The processing engine fulfilling user requests needs to be able to access the storage system with a high level of parallelism.
At a certain scale, the sheer amount of data that needs to be processed at runtime makes it impossible to do interactive analytics, simply due to the physics of reading that much data. The processing engine needs to provide the ability to do various pre-computations of the data, to reduce the amount of data needed to be processed at runtime.

If we look at the above requirements all together, the conclusion is that, in order to meet our business analytic goals, the majority of the data needs to be physically moved out of the source systems and into a centralized platform.

Data Warehouses are Limiting and Expensive

Data Warehouses become too expensive too fast, and result in unmanageable copies of data.

It’s now clear that in order to meet our business goals - providing self service, comprehensive data access, fast availability, and a single pane of glass to ensure consistent KPIs - we need to physically centralize at least a majority of the data in a central platform. The next logical question is — “where?”.

Teams typically consider two options: centralizing data in a Data Warehouse or a Data Lake.

Employing the Data Warehouse approach has been very popular. Given tools available in the market and previous limitations of SQL engines for the Data Lake, it is often considered the most sensible solution.

The Data Warehouse approach solves some of the problems encountered as a result of implementing Data Virtualization:

  • Data is now centralized, so there is no longer a need to transfer large amounts of data across networks at query runtime
  • Performance is no longer bottlenecked by source system load or the design of the source system’s storage
  • Additionally, data is stored in a columnar format and optimized for analytics use cases. Users are generally able to get interactive performance across various workloads

However, as most organizations have experienced, whether through a single business unit’s use of a Data Warehouse, or a large-scale implementation with numerous Data Warehouses and data marts, this approach doesn’t fill every gap brought about by Data Virtualization, and introduces its own additional challenges:

  • Rising costs that can be hard to predict or lack cost transparency and can grow exponentially as you scale
  • Constant need to create and manage complex ETL pipelines
  • Proliferation of data copies for performance and to provide different views of the data
  • Security management burden and risk for all of these data copies
  • Data drift and KPI drift
With our Data Warehouse based architecture, we used to have to wait 3-6 weeks for even the smallest changes to our BI dashboards because our engineers were so backlogged with data requests. We simply couldn’t afford to wait that long to make critical business decisions.
- Multinational Technology Company

Further, these organizations suffer from data lock-in and the inability to use other engines like machine learning platforms. Looking forward, they’re locked out of introducing new engines and innovations that may come out in the future.

Many teams have chosen to live with these challenges, simply because there hasn’t been a better alternative that solves immediate business goals and alleviates the challenges above.

Improvements in the Data Lake ecosystem have finally brought about a viable alternative to this traditional approach.

Sample RFI for SQL Query Accelerator as a Highly Performant Scale-Out Alternative to Data Virtualization

SAMPLE RFI

WHITE PAPER

The Unexpected Cost of Data Copies

Download Now

BLOG

5 Limitations of Data Warehouses in Today’s World of Infinite Data

Read More
Scale BI with Data Lakes and Dremio

Dremio no-copy data architecture offers query acceleration at near-infinite scale, on a centralized data lake.

Let’s recall the analytic requirements we want to achieve, and that we tried to with Data Virtualization:

  • Self-service access for data consumers
  • A single place to go to access data
  • Access to a wide range of datasets
  • Fast availability for new or newly-requested datasets
  • Canonical KPI definitions
  • Provide all of these capabilities efficiently from a data engineering perspective

The data virtualization approach checked a lot of boxes at a small scale. We know that small scale does not represent the objective as applications and users grow to meet the needs of data intensive business processes. Working around these problems (extract to data warehouse, data copies) gets us stuck in the same old mess...we need to take a different approach.

If we can get everything in a data lake working in tandem with the Dremio platform, then we don’t need data virtualization.
- A Fortune 50 High-Tech Company

The open data lake is the most efficient foundation to address the needs of modern data-driven teams. Object storage is so cheap...it's the fundamental, at-scale storage layer going forward.

Is there a way to land the data into Object storage, move it as little as possible (once) but still deliver that data to downstream systems in a timely manner?

BEFORE: Data Teams were Stuck

  • Data consumers previously suffered because the source system was unable to send the data to the data virtualization platform fast enough at runtime
    • We need to use a storage system optimized for analytics (OLAP) — a system that can scan a large amount of data quickly. Optimally, the storage system doesn’t do any processing work, outside of responding to read requests
  • The data is being transferred from the source system to the Data Virtualization platform over a single connection, i.e., serially
    • We need a storage system that can send data to engines in a high level of parallelism
  • At a certain scale, the sheer amount of data that needs to be processed at runtime makes it impossible to do interactive analytics, simply due to physics of reading that much data
    • In order to alleviate this, we need a solution that can enable optimization by preprocessing without impacting the ease of use the end users are experiencing. (i.e., materialized views + transparent substitution)

Data teams require:

  • A semantic/logical view of all the data to your end users to enable self-service
  • Interactivity on that logical view, without the need for pointing users to other marts or areas

AFTER: implementing a data lake and Dremio data terms deliver...

  • A low ETL solution
    • You can analyze the data directly in the data lake, no need to write and manage ingest jobs to copy it elsewhere
  • A low copy solution including no performance or data mart copies
    • Dremio’s performance capabilities (Data Reflections, Apache Arrow, Gandiva, Caching) are sufficient to provide interactive BI directly on the data lake. Those plus Dremio’s Semantic Layer eliminates the need for performance and data mart copies
  • Less reliance on IT (ie..real self service)
    • Many more business questions and reports can be addressed by LOB users directly due to Dremio’s governed self-service capabilities
  • On-time data applications...Execs no longer wait days or weeks for reporting changes
    • Data engineering is able to build the plumbing of the reports much faster due to the ease of use and increased productivity provided by Dremio

It may be thought one could achieve this with any massively parallel processing (MPP) SQL query engine that can query data lakes directly? Not exactly…

Performance is only part of the equation when thinking about solving for big data analytics at scale.

Proof of Concept (POC) Guide for SQL Query Accelerator

GET THE POC GUIDE
What customers are saying on G2
5 Stars Rating

"Enabling fast and easy access to historically scattered enterprise data"

5 Stars Rating

"User friendly self servicing Data lake engine"

5 Stars Rating

"Efficient and User Friendly SQL layer on top of open file format"

The ease of optimizations allow data professionals to unlock data consumers to get the most value from their data without proliferating copies of data everywhere. Only when an engine nails both of these things together will it have the desired impact on your organization's data challenges.

Dremio offers very simple optimizations to complex workloads (reflections). The downstream effects are massive (e.g. no app code changes, easily meet SLAs). The process of data exploration through productionalizing a data application is on a different level with a SQL platform like Dremio. Users are able to find data, define their logic and optimize the performance without moving any data out of the data lake.

Query PerformanceEase of OptimizationQuery Engine Value

Other tools are able to serve as the SQL layer but require many other steps and potentially multiple copies of the data for users to observe the performance they want (especially from BI style workloads).

Productionalizing interactive reports/dashboards

These optimizations and incredible performance gains require no changes at the application level (BI tools, dashboards, report logic, etc.). This makes it extremely simple for a wide range of users to benefit from these optimizations that are taking place at the query engine level.

Graph 2 of BI Query Performance

Figure 1: Dremio provides faster response times than other query engines out of the box.

Graph 1 of BI Query Performance

Figure 2: After creating reflections, Dremio provides subsecond response times with zero application changes.

There are many advantages to forming your data foundation on a data lake via open formats (Parquet/Iceberg). These benefits are magnified when you introduce the most advanced SQL engine on top (Dremio). Set your team up for current and future data success...try Dremio today!

MaximizeMinimizeperformancescalabilityusabilityCostCopiesConfusionof your datawithDremio

Take the next step!

Start getting more business value directly from your data lake today. Getting started with Dremio is easy.

Renaissance Customer Talk

Customer Talk

Check out this video to see how RenaissanceRe modernizes BI with its centralized Cloud Data Lake and Dremio.

Watch Now
POC Guide

POC Guide

Download this free POC guide to evaluate whether an analytic platform meets your business analytic requirements.

Download the POC guide now

Hands-on tutorial

Want to see it for yourself? Try this free BI dashboard tutorial hands-on.

Try now

Data Modernization Consultation

Talk to a Data Analytics & BI Expert on how you can modernize your data architecture with cloud data lakes and a lakehouse platform.

Learn More

How can Dremio help?

Dremio is the SQL Lakehouse company, enabling companies to leverage open data architectures to drive more value from their data for BI and analytics.