Data Virtualization seems promising. But does it scale with your data and BI needs?

Data engineers and data architects strive to provide data consumers with the data and analytics user experience they need. Data virtualization can look very promising - meet all my BI goals and not have to move data? Yes please!

However, as your data, user, and application scale grows, old and new problems arise that put you back where you started — or worse. Data lakehouses combined with Dremio can address BI needs at any scale.

The promise of Data Virtualization, and why it fails as you scale

It has long been a goal of organizations to provide decision-makers with self-service access to data and analytics so they can make evidence-based decisions and run the business better. There are a number of capabilities organizations need to provide in order to fully achieve that goal:

  • Self-service access for data consumers
  • A single place to go to access data
  • Access to a wide range of datasets
  • Fast availability for new or newly-requested datasets
  • Canonical KPI definitions
  • Efficient management from a data engineering perspective

Trying to provide these capabilities can get tricky - enterprise architectures are complex. The number of different data sources can grow to be staggering. Further, different business units in an organization typically have some level of their own systems, often including their own Data Warehouse or Data Mart. There are tools that can connect directly to these different systems that data consumers can use, allowing you to provide self-service access to all the data for your end users on demand, without needing to do any additional work moving the data. This approach seems like a fairly clean solution to a hard problem, so why wouldn’t you want to do that?

This approach can work at a smaller scale. For organizations or business units early in their journey to be data-driven or ones that are inherently at a smaller scale, the amount of data, users, and applications they have typically isn’t very large. In those situations, Data Virtualization works fairly well. Because the scale is small, organizations are able to provide self-service access to this data with fairly interactive response times to their data consumers. This allows them to make self-service data-driven decisions quickly and run the business better.

See Why Data Virtualization Fails at Scale

However, as the amount of data, users, and applications grow in an organization, the first problem encountered is performance. Users aren’t able to get answers to their questions nearly as fast as they need to be effective. These performance problems result in users asking fewer questions or even being unable to wait for an answer at times. The result is poor decisions based on hunches or guesses lacking data.

The causes of these performance problems fall into one or more of the following categories:

  • The source system is unable to send the data to the data virtualization platform fast enough at runtime, whether because of source system load or the design of the source’s storage system.
  • The network speed between the data virtualization platform and the source system isn’t fast enough to transfer the data at runtime.
  • The protocol used, generally JDBC, isn’t able to transmit the data from the source system to the data virtualization platform fast enough at runtime.
  • The data is being transferred from the source system to the data virtualization platform over a single connection, i.e. serially.
  • At a certain scale, the sheer amount of data that needs to be processed at runtime makes it impossible to do interactive analytics, simply due to the physics of reading that much data.

Because all of these problems stem from transferring the data at runtime, the solution to this performance problem is generally to transfer the data ahead of time on a case-by-case basis by making copies of some datasets into a Data Warehouse and connecting the Data Warehouse to the Data Virtualization tool, too. This method also addresses needs that arise for removing the analytic workload from the operational systems, to ensure the operational system can serve its primary function properly.

See How Renaissance ReInsurance Discovered the Pitfalls of Data Virtualization

However, as the data, user, and application scale continues to grow, generally one Data Warehouse for the whole organization ends up being insufficient, whether for flexibility, cost, complexity, or operational reasons. Typically at that point, each business unit gets their own Data Warehouse or Data Mart. Once again, these Data Warehouses are connected to the Data Virtualization platform, and data consumers have self-service access to a wide range of data.

The performance problems of transferring data at runtime when joining between data in two Data Warehouses still apply for exactly the same reasons as before. So, when needing to join data at this scale between the two systems, this is done via IT request and creating an ETL pipeline to copy the data into the destination Data Warehouse.

Augmenting Data Virtualization with data copies to address the performance problems now causes a whole new set of problems at this scale:

  • Lack of self-service 
  • Data engineering overhead
  • Slow turnaround time for requests
  • Data drift 
  • Regulatory compliance issues
  • Infrastructure cost

The workaround of creating data copies to address performance problems associated with Data Virtualization is a big topic worth its own dedicated page. For more information, see this whitepaper discussing the unexpected cost of data copies.

Data Virtualization was supposed to help solve some of these problems. Instead, when used at scale, it exacerbates the problems and introduces new issues you didn’t have before.

All of these problems are downstream effects of a lack of performance in the Data Virtualization approach. So, if we can solve the performance problems, we can avoid the workarounds that cause these downstream issues. 

Let’s recall the causes of the performance problems of Data Virtualization in the first place, and see how we can address them:

Data Virtualization Performance ProblemsProblem Mitigation
The source system is unable to send the data to the DV platform fast enough at runtime because of source system load or the design of the source’s storage system.The processing engine fulfilling user requests needs to have direct access to the storage and data.
The network speed between the DV platform and the source system isn’t fast enough to transfer the data at runtime.We need a performant network between the processing engine and the storage engine. When the data gets too large to be feasibly addressed via network throughput, we need to get the data closer via caching to reduce the network bandwidth needed or via pre-computation to reduce the amount of data needed to be transferred.
The protocol used, generally JDBC, isn’t able to transmit the data from the source system to the Data Virtualization platform fast enough at runtime.The processing engine fulfilling user requests needs to be able to access the data stored in columnar format using a protocol that doesn’t needlessly serialize the columnar data to a row format, just for it to be deserialized back to columnar format in the processing engine, i.e., it needs to be able to get the data for processing while the data stays in columnar format.
The data is being transferred from the source system to the Data Virtualization platform over a single connection, i.e., serially.The processing engine fulfilling user requests needs to be able to access the storage system with a high level of parallelism.
At a certain scale, the sheer amount of data that needs to be processed at runtime makes it impossible to do interactive analytics, simply due to the physics of reading that much data.The processing engine needs to provide the ability to do various pre-computations of the data, to reduce the amount of data needed to be processed at runtime.

If we look at the above requirements all together, the conclusion is that, in order to meet our business analytic goals, the majority of the data needs to be physically moved out of the source systems and into a centralized platform.

POC Guide for a Modern Approach to Business Analytics

BI and Analytics Paradigm

Data Warehouses become too expensive too fast, and result in unmanageable copies of data

It’s now clear that in order to meet our business goals - providing self service, comprehensive data access, fast availability, and a single pane of glass to ensure consistent KPIs - we need to physically centralize at least a majority of the data in a central platform. The next logical question is..."Where?" 

Teams typically consider two options: Centralizing data in a Data Warehouse or a Data Lakehouse.

Employing the Data Warehouse approach has been very popular. Given tools available in the market and previous limitations of SQL engines for the Data Lakehouse, it is often considered the most sensible solution. 

The Data Warehouse approach solves some of the problems encountered as a result of implementing Data Virtualization:

  • Data is now centralized, so there is no longer a need to transfer large amounts of data across networks at query runtime.
  • Performance is no longer bottlenecked by source system load or the design of the source system’s storage.
  • Additionally, data is stored in a columnar format and optimized for analytics use cases. Users are generally able to get interactive performance across various workloads.

However, as most organizations have experienced, whether through a single business unit’s use of a Data Warehouse, or a large-scale implementation with numerous Data Warehouses and data marts, this approach doesn’t fill every gap brought about by Data Virtualization, and introduces its own additional challenges:

  • Rising costs can be hard to predict or lack cost transparency, and can grow exponentially as you scale
  • Constantly creating and managing complex ETL pipelines
  • Proliferating data copies with different views of the data
  • Security management burden and risk for all of these data copies
  • Data drift and KPI drift

Further, these organizations suffer from data lock-in and the inability to use other engines like machine learning platforms. Looking forward, they’re locked out of introducing new engines and innovations that may come out in the future.

Many teams have chosen to live with these challenges, simply because there hasn’t been a better alternative that solves immediate business goals and alleviates the challenges above.

Improvements in the Data Lakehouse ecosystem have finally brought about a viable alternative to this traditional approach.

We used to have to wait 3-6 weeks for even the smallest changes to our BI dashboards because our engineers were so backlogged with data requests. We simply couldn't afford to wait that long to make critical business decisions.
Multinational Technology Company

Dremio's no-copy data architecture offers query acceleration at near-infinite scale, on a centralized data lakehouse

Let’s recall the analytic requirements we want to achieve, and that we tried to with Data Virtualization:

  • Self-service access for data consumers
  • A single place to go to access data
  • Access to a wide range of datasets
  • Fast availability for new or newly-requested datasets
  • Canonical KPI definitions
  • Efficient management from a data engineering perspective

The Data Virtualization approach checked a lot of boxes at a small scale, but cannot meet the needs of data intensive business processes of growing numbers of applications and users.  Working around these problems (extract to data warehouse, data copies) gets us stuck in the same old mess...we need to take a different approach.

The Open Data Lakehouse is the most efficient foundation to address the needs of modern data-driven teams. Object storage is so cheap...it's the fundamental, at-scale storage layer going forward.

If we can get everything in a data lake working in tandem with the Dremio platform, then we don’t need data virtualization.
A Fortune 50 High-Tech Company

Is there a way to land the data into Object storage, move it as little as possible (once) but still deliver that data to downstream systems in a timely manner?

BEFORE: Data Teams Were Stuck

Slow Processing

Data consumers suffered because the source system was unable to send the data to the Data Virtualization platform fast enough at runtime.

We need to use a storage system optimized for analytics (OLAP) - a system that can scan a large amount of data quickly. Ideally, the storage system doesn’t do any processing work, outside of responding to read requests.

Single Connection

The data is transferred from the source system to the Data Virtualization platform over a single connection, i.e. serially.

We need a storage system that can send data to engines in a high level of parallelism.

High Data Volumes

At a certain scale, the sheer amount of data that needs to be processed at runtime makes it impossible to do interactive analytics, simply due to physics of reading that much data.

We need a solution that enables optimization by preprocessing without impacting the ease of use the end users are experiencing, i.e., materialized views + transparent substitution.

Data teams require:

  • A semantic/logical view of all the data to your end users to enable self-service.
  • Interactivity on that logical view, without the need for pointing users to other marts or areas.

AFTER Implementing a Data Lakehouse: Data Teams Deliver

A Low ETL Solution

Analyze the data directly in the data lakehouse. No need to write and manage ingest jobs to copy it elsewhere.

A low copy solution with no performance or data mart copies

Dremio’s performance capabilities (Data Reflections, Apache Arrow, Gandiva, Caching) are sufficient to provide interactive BI directly on the data lakehouse. Those, plus Dremio Spaces, eliminate the need for performance and data mart copies.

Less reliance on IT

Many more business questions and reports can be addressed by LOB users directly due to Dremio’s governed self-service capabilities.

On-time data applications

Execs no longer wait days or weeks for reporting changes. Data engineering builds the plumbing of the reports much faster due to the ease of use and increased productivity.

It may be thought one could achieve this with any massively parallel processing (MPP) SQL query engine that can query data lakehouses directly? Not exactly…. Performance is only part of the equation when thinking about solving for big data analytics at scale.

The ease of optimizations (Dremio Data Reflections) allow data professionals to unlock data consumers to get the most value from their data without proliferating copies of data everywhere. Only when an engine nails both of these issues together will it have the desired impact on your organization's data challenges.

Dremio offers very simple optimizations to complex workloads (reflections). The downstream effects are massive, e.g., no app code changes, easily meet SLAs. The process of data exploration through  productionalizing a data application is on a different level with a SQL platform like Dremio. Users are able to find data, define their logic and optimize the performance without moving any data out of the data lakehouse. 

Other tools are able to serve as the SQL layer but require many other steps and potentially multiple copies of the data for users to observe the performance they want - especially from BI style workloads.

Productionalizing interactive reports/dashboards

These optimizations and incredible performance gains require no changes at the application level, i.e., BI tools, dashboards, report logic, etc. This makes it extremely simple for a wide range of users to benefit from these optimizations taking place at the query engine level.

There are several advantages to forming your data foundation on a data lakehouse via open formats (parquet/iceberg). These benefits are magnified when you introduce the most advanced SQL engine on top (Dremio).  Set your team up for current and future data success...try Dremio today!

Get Started Free

No time limit - totally free - just the way you like it.

Sign Up Now

See Dremio in Action

Not ready to get started today? See the platform in action.

Watch Demo

Talk to an Expert

Not sure where to start? Get your questions answered fast.

Contact Us