Next-Gen Data Analytics - Open Data Architecture
The Rise of the Data Warehouse
The variety of data has changed dramatically in the last few years, and the arrival of self-service discovery and analytics tools, and new, state-of-the-art methods for fast and easy access to data lakes have come not a moment too soon. In this blog, I’ll review classic solutions for collecting and consuming data, how things have changed, and how Dremio can work directly with your data with lightning-fast query speeds.
Here is a classical example of how data warehouses can be built.
We all have seen such solutions to collect data, process and consume data. The model works fine as long you have structured data and a reasonable size in the hundreds of GB. If data size grows to terabytes and petabytes, you need to invest more time to understand how to partition the data and precalculate cubes and BI extracts.
What Has Changed in the Data Analytics World
- The data. Data is the new oil and has become more varied. There is not as much relational data anymore. Instead, the non- and semi-structured data such as JSON, Parquet, voice, images and video are dominating the area. To integrate that data, new modern object storage like AWS S3 and ADLS in the cloud or Scality and Dell EMC ECS on premises can solve those data integration challenges, making it easier to store all your data in a lake.
- The rise of self-service discovery and analytics tools. Tableau, Power BI and Jupyter Notebooks give analysts and data scientists the freedom to explore the data on their own.
Unfortunately, access to the data lake is not easy. The amount of raw data is massive and retrieval from the data lake is usually less performant. In order to control compute resource usage, only a special group of people can access the data lake directly. The other problem is figuring out how to keep data secure and closely governed. Because of the nature of data lakes, there is no easy way to control data access to data or align to enterprise-wide master data management practices.
The next question is, how can the data in the lake be consumed with acceptable performance while also being governed?
First and foremost, natural tendencies are to reuse methods that have been in place for years—let’s build or extend a data warehouse. This will give us the required and aligned structure and will provide us with a fast layer on which we can build data marts or cubes.
However, the old pattern cannot satisfy current capability requirements. The requirements of easy ingestion and self-service data access add stress to the data warehouse which is, by its nature, built as a monolith. The data warehouse creates potential conflict due to its inflexibility and the agile nature of the self-service layer. As a result, data engineers are not able to satisfy the analyst requirements.
What we need is:
- Fast data access without complex ETL processes or cubes
- An easy way to get access to the data lake without duplicating the data
- Governed, secured and audited data access
- An easily searchable semantic layer
Dremio’s Data Lake Engine delivers lightning-fast query speed and a self-service semantic layer operating directly against your data lake storage.
Dremio provides connections to S3, ADLS, Hadoop or wherever your data is. Apache Arrow, Data Reflections, C3 and other Dremio technologies work together to speed up queries by up to 1,000x. An abstraction layer enables IT to apply security and business meaning, while enabling analysts and data scientists to explore data and derive new virtual datasets.
Dremio works directly with your data lake storage. You don’t have to send your data to Dremio, or have it stored in proprietary formats that lock you in. Dremio is built on open source technologies such as Apache Arrow, and can run in any cloud or data center. Dremio’s powerful joining abilities mean that you can easily take advantage of other data sources as well.