The Apache Iceberg Advantage
An in-depth review of Apache Iceberg, an open table format for enterprise data lakes.
read moreAt its simplest, data warehouse is a system used for storing and reporting on data. The data typically originates in multiple systems, then it is moved into the data warehouse for long-term storage and analysis. This storage is structured such that users from many divisions or departments within an organization can access and analyze the data according to their needs.
A data warehouse contains data from many operational sources. It is used to analyze data.
Data warehouses are analytical tools, built to support decision making and reporting for users across many departments. They are also archives, holding historical data not maintained in operational systems.
Data warehouses work to create a single, unified system of truth for an entire organization. Unfortunately, as you might imagine, trying to maintain accuracy and thoroughness in such a system is incredibly difficult.
So, if data warehouses are so complex to build and maintain, why do organizations bother?
Data warehouses can provide:
Storing comprehensive data in structured relationships, means that data warehouses can also provide answers to a whole variety of complex questions, like:
Data warehouses are built in many different forms, attempting to account for and structure the complexity of the organizations that use them.
But the basic architecture is pretty consistent:
First the raw data is formatted, sometimes called cleansing and normalizing. You can think of this as a pipeline that moves raw data from sources to the warehouse, ensuring the data is appropriately named and formatted and in accurate relationships with the rest of the data that is stored. This is often called the integration layer, and is not necessarily considered part of the data warehouse itself.
The formatted data is then stored in the data warehouse itself. An access layer allows the tools and applications to retrieve data in a format that suits their needs.
There is another aspect to data warehouse architecture that governs the whole structure called metadata. Metadata is data about the data. The data engineers and data scientists who maintain a data warehouse collect information about data sources, naming conventions, refresh schedules, etc., and use this information to maintain data quality and to make sure that the data warehouse is serving its intended purposes.
Generally, data management systems can be considered either OLAP (Online Analytical Processing), or OLTP (Online Transaction Processing). In general, OLTP systems create or capture data in online applications, and OLAP systems analyze data that has been collected from one or more systems. A data warehouse is an example of an OLAP system.
Purpose |
Examples |
Designed To |
Storage Term |
As such, OLTP and OLAP systems have different technical attributes and requirements.
Storage Type |
Query Type |
Query Scope |
Query Latency |
Concurrent Queries |
Data Size |
Physical Storage |
Vendor Products (relational) |
Vendor Products (non-relational) |
Vendor Products (cloud) |
ETL systems govern the movement of data between the systems of source data and a data warehouse (i.e.: the pipeline mentioned in the section on data warehouse architecture), as well as movement from a data warehouse to data marts. Data must first be extracted from a source, then transformed according to the standards of the next layer of storage, and finally the formatted data must be correctly loaded into this next layer.
Strictly speaking, a database is any structured collection of data. An Excel spreadsheet, Rolodex, or address book would all be very simple examples of databases. Software such as Excel, Oracle, or MongoDB is a database management system (DBMS) that allows users to access and manage the database. It is common for people to call a DBMS a database. A data warehouse then is a type of database. It is specialized in the data it stores - historic data from many sources - and the purpose it serves - analytics.
A key attribute of databases, and therefore data warehouses, is that they contain structured data. The way that data is stored – from what fields are available, to date formats, and everything in between – is agreed upon in advance and the entire database follows this structure, or schema, rigorously. Their relative consistency and stability, mean that data warehouses can serve queries from many types of roles in the organization. This process is very structured, very predictable, and very efficient, but it’s also hard to do well.
Data lakes are another means of storing data, but without the rigorous schema of a data warehouse. In contrast to a data warehouse, schema in the data lake is applied by the query. This means it is much easier to load the data into a data lake, but queries are more complex to construct. Data lakes usually take much longer to return results when compared to data warehouses.
If a data warehouse holds and integrates data from across an organization, a data mart is a smaller subset of the data, specialized for the use of a given department or division. Often data marts are built and controlled by a single department, using the central data warehouse along with internal operating systems and external data. Data marts typically hold just one subject area, for example marketing or sales. Because they are smaller and more specific, they are often easier to manage and maintain, as well as having more flexible structures.
OLAP systems often contain tremendous quantities of data, which can make running specific queries on them slow. To speed up and simplify querying, OLAP systems can be broken down further into sub-databases called cubes, which contain a limited set of dimensions and therefore provide faster querying time.
Business intelligence software is a critical layer on top of a data warehouse that allows the information within it to be used to make business decisions. Business intelligence software has many different types of functionality, but it usually includes some sort of engine for building and executing queries, as well as a way to store and visualize results for incorporation into documents containing business analysis.
When considering which tools to use, it’s important to be sure that they meet your requirements in terms of scalability (can it grow with your needs), access (which users and how many need access to your warehouse and from what locations), and integrations (does this system integrate with your data sources and BI tools).
Relational databases are the systems that store the majority of business data. They have been around for several decades and are very mature. Relational databases exist for OLTP and OLAP use cases. They are very well understood and provide extensive ecosystems of complementary technologies for management and data access. Popular relational databases include:
MPP databases are a data warehouse technology that focuses on hardware and software to support parallel processing. That is, query processing is broken up into many smaller parallel tasks that are executed together across multiple servers. This approach speeds up query and ingestion times dramatically. However, MPP databases can be expensive, as the most popular vendors have closed-source products.
Popular MPP databases include:
Similar to the MPP approach, Hadoop is also able to distribute computational tasks across a cluster of computers. However, Hadoop is open source. Some data lakes use Hadoop because of its ability to cheaply and quickly handle large volumes of structured and unstructured data.
Every company manages a data warehouse to capture the history of their business. But when it comes to general purpose analytics, data warehouses are frequently used because they are the only tool available. Data warehouses are powerful and useful, but require large amounts of work both upfront and on an ongoing basis to use and maintain. Because data warehouses have to be centralized and standardized in order to serve many users across an organization, they always include assumptions about what should be stored, and how, that cannot be optimal for all users.
Dremio is complementary to your data warehouse, and opens up a broader set of data to a larger set of data consumers for diverse analytical needs. Learn more about how Dremio complements your data warehouse.
An in-depth review of Apache Iceberg, an open table format for enterprise data lakes.
read moreWatch Andres Bogsnes, Master Expert at Nordea Asset Management on how Nordea Asset Management journey to implement Data Domains with Dremio globally.
read moreGalp data strategy is built upon the pillar of democratizing data access and analytics, promoting decentralization when it comes to data product development.
read more