Dremio Jekyll

Data Warehouses Explained by Dremio

At its simplest, data warehouse is a system used for storing and reporting on data. The data typically originates in multiple systems, then it is moved into the data warehouse for long-term storage and analysis. This storage is structured such that users from many divisions or departments within an organization can access and analyze the data according to their needs.

Diagram showing how data warehouses are connected to other business systems

A data warehouse contains data from many operational sources. It is used to analyze data.

Data warehouses are analytical tools, built to support decision making and reporting for users across many departments. They are also archives, holding historical data not maintained in operational systems.

Data warehouses work to create a single, unified system of truth for an entire organization. Unfortunately, as you might imagine, trying to maintain accuracy and thoroughness in such a system is incredibly difficult.

Why Use A Data Warehouse?

So, if data warehouses are so complex to build and maintain, why do organizations bother?

Data warehouses can provide:

  • A single point of access for all data, rather than requiring users to connect to dozens or even hundreds of systems individually
  • An assurance of data quality
  • A history of the data they store
  • Separation between day-to-day operational systems and analytical systems, for security reasons
  • A standard set of semantics around data, for example: consistency in naming conventions, codes for different types of products, languages, and currencies, and so on

Storing comprehensive data in structured relationships, means that data warehouses can also provide answers to a whole variety of complex questions, like:

  • How much revenue has each of our product lines brought in per month over the past ten years, broken out by city and state?
  • What is the average transaction size at one of our ATMs, broken out by time of day and total customer assets?
  • What is the percentage employee turnover for the past year in stores that have been open for at least three years? How many hours did those employees work per week?

Data Warehouse Architecture

Data warehouses are built in many different forms, attempting to account for and structure the complexity of the organizations that use them.

But the basic architecture is pretty consistent:

First the raw data is formatted, sometimes called cleansing and normalizing. You can think of this as a pipeline that moves raw data from sources to the warehouse, ensuring the data is appropriately named and formatted and in accurate relationships with the rest of the data that is stored. This is often called the integration layer, and is not necessarily considered part of the data warehouse itself.

The formatted data is then stored in the data warehouse itself. An access layer allows the tools and applications to retrieve data in a format that suits their needs.

There is another aspect to data warehouse architecture that governs the whole structure called metadata. Metadata is data about the data. The data engineers and data scientists who maintain a data warehouse collect information about data sources, naming conventions, refresh schedules, etc., and use this information to maintain data quality and to make sure that the data warehouse is serving its intended purposes.

Key Concepts

OLTP vs. OLAP

Generally, data management systems can be considered either OLAP (Online Analytical Processing), or OLTP (Online Transaction Processing). In general, OLTP systems create or capture data in online applications, and OLAP systems analyze data that has been collected from one or more systems. A data warehouse is an example of an OLAP system.

  OLTP OLAP
Purpose Capture and create data Analyze data
Examples Systems that run your business: CRM, finance, etc. Data warehouses and data marts
Designed To Deal efficiently with large volumes of transactions Allow analysts to efficiently answer questions that require data from multiple source systems
Storage Term Short-term or medium-term; the most recent year or most recent few years of data Long-term, typically many years or the full history of the business

As such, OLTP and OLAP systems have different technical attributes and requirements.

  OLTP OLAP
Storage Type Optimized for small reads and writes to maximize throughput and to minimize latency. Extensive use of indexes. Optimized for large reads of many rows of data. May use no indexes or just a few indexes.
Query Type Reads and writes of one or a few rows Mostly reads of many rows
Query Scope Small, discrete pieces of data, such as updating the address for a customer Large numbers of records, eg, what was the average purchase across all sales over the past 5 years
Query Latency 10ms - 200ms 1 sec - several minutes
Concurrent Queries 100-10,000 qps <50
Data Size 10GB-1TB 1TB-1PB
Physical Storage Row-oriented – meaning that a system scans all values in a given row, discarding what it doesn’t need. Good for retrieving all information about a specific object, but less efficient at performing operations on a column for all rows in a table. Column-oriented – meaning that a system scans all values in a given column. Good for performing aggregations across large numbers of records. Less efficient for looking at all attributes for a given object (the whole row).
Vendor Products (relational) Oracle, SQL Server, DB2 Teradata, Vertica, Greenplum, Netezza, Sybase IQ
Vendor Products (non-relational) MongoDB, Elasticsearch, Cassandra Hadoop
Vendor Products (cloud) Amazon RDS, Amazon DynamoDB, Azure DocumentDB, MongoDB Atlas Amazon Redshift, Azure SQL Data Warehouse, Google BigQuery

Extract, Transform, Load (ETL)

ETL systems govern the movement of data between the systems of source data and a data warehouse (i.e.: the pipeline mentioned in the section on data warehouse architecture), as well as movement from a data warehouse to data marts. Data must first be extracted from a source, then transformed according to the standards of the next layer of storage, and finally the formatted data must be correctly loaded into this next layer. Learn more about ETL tools.

Data Warehouse vs. Database

Strictly speaking, a database is any structured collection of data. An Excel spreadsheet, Rolodex, or address book would all be very simple examples of databases. Software such as Excel, Oracle, or MongoDB is a database management system (DBMS) that allows users to access and manage the database. It is common for people to call a DBMS a database. A data warehouse then is a type of database. It is specialized in the data it stores - historic data from many sources - and the purpose it serves - analytics.

Data Warehouse vs. Data Lake

A key attribute of databases, and therefore data warehouses, is that they contain structured data. The way that data is stored – from what fields are available, to date formats, and everything in between – is agreed upon in advance and the entire database follows this structure, or schema, rigorously. Their relative consistency and stability, mean that data warehouses can serve queries from many types of roles in the organization. This process is very structured, very predictable, and very efficient, but it’s also hard to do well.

Consensus is still forming around the precise definition of a data lake. But, generally speaking, data lakes are another means of storing data, but without the rigorous schema of a data warehouse. In contrast to a data warehouse, schema in the data lake is applied by the query. This means it is much easier to load the data into a data lake, but queries are much more complex to construct, which limits the use of the data lake to sophisticated roles like data engineers and data scientists. Data lakes usually take much longer to return results when compared to data warehouses. Most data lakes are built on Hadoop.

Data Warehouse vs. Data Mart

If a data warehouse holds and integrates data from across an organization, a data mart is a smaller subset of the data, specialized for the use of a given department or division. Often data marts are built and controlled by a single department, using the central data warehouse along with internal operating systems and external data. Data marts typically hold just one subject area, for example marketing or sales. Because they are smaller and more specific, they are often easier to manage and maintain, as well as having more flexible structures.

OLAP Cubes

OLAP systems often contain tremendous quantities of data, which can make running specific queries on them slow. To speed up and simplify querying, OLAP systems can be broken down further into sub-databases called cubes, which contain a limited set of dimensions and therefore provide faster querying time.

Business Intelligence Tools

Business intelligence software is a critical layer on top of a data warehouse that allows the information within it to be used to make business decisions. Business intelligence software has many different types of functionality, but it usually includes some sort of engine for building and executing queries, as well as a way to store and visualize results for incorporation into documents containing business analysis.

Data Warehouse Technologies

When considering which tools to use, it’s important to be sure that they meet your requirements in terms of scalability (can it grow with your needs), access (which users and how many need access to your warehouse and from what locations), and integrations (does this system integrate with your data sources and BI tools).

Relational Databases

Relational databases are the systems that store the majority of business data. They have been around for several decades and are very mature. Relational databases exist for OLTP and OLAP use cases. They are very well understood and provide extensive ecosystems of complementary technologies for management and data access. Popular relational databases include:

  • IBM DB2. IBM bundles their core relational database, DB2, with their application server, data mining technologies, Cognos for cubes and BI, and other technologies. They call this DB2 Warehouse. DB2 is most popular with IBM shops who use IBM’s hardware, software, and services.
  • Microsoft SQL Server. Microsoft has built out the features of its core relational database over many releases to support large data warehousing deployments. They have also built out integrations to their cloud offerings and Hadoop to provide a more comprehensive offering for structured and unstructured data. Microsoft also has the best management tooling and integrations with other Microsoft products, including Windows. SQL Server is most popular with Microsoft shops and tends to be more cost competitive.
  • Oracle Exadata. Like IBM, Oracle is also a hardware vendor. Exadata is an appliance that is built on the core Oracle database, with hardware and software that has been optimized for large scale, high performance systems. Oracle is the most popular database in the market. Exadata is very expensive.

MPP (Massively Parallel Processing) Analytical Databases

MPP databases are a data warehouse technology that focuses on hardware and software to support parallel processing. That is, query processing is broken up into many smaller parallel tasks that are executed together across multiple servers. This approach speeds up query and ingestion times dramatically. However, MPP databases can be expensive, as the most popular vendors have closed-source products.

Popular MPP databases include:

  • Teradata. Teradata is the gold standard. It is the most widely used and most mature. Unlike the relational vendors listed above, Teradata has always focused on the data warehouse exclusively. It is available as an appliance only and is very expensive.
  • IBM Netezza. Netezza is another data warehouse appliance offering that grew quickly as a result of aggressive pricing and ease of use. The software is based in part on the open source relational database PostgreSQL. One of the key Netezza innovations was the use of field programmable gate arrays (FPGA) to speed up query processing. Netezza was acquired by IBM in 2010.
  • HP Vertica. Vertica was an innovative column-oriented MPP database that came from research at MIT. Unlike competing products at the time, Vertica was available as software that supported multiple hardware platforms and provided compelling performance as a result of a novel SQL execution engine and strong compression capabilities. Vertica was acquired by HP in 2011.
  • Pivotal Greenplum. Greenplum is an MPP database based on PostgreSQL. Because it is based on PostgreSQL, it has a broader ecosystem of complementary tools. And because it is row-oriented it makes a different set of tradeoffs compared to column-oriented products. It was acquired by EMC in 2010, then open sourced by Pivotal in 2015.
  • SAP Sybase IQ. Sybase IQ was one of the first column-oriented databases and entered the market in the late 1990s. It has a long track record, especially in Sybase-centric markets like Financial Services. IQ was well known for strong performance, but was viewed as being expensive and difficult to manage and scale. Sybase was acquired by SAP in 2010.

Hadoop

Similar to the MPP approach, Hadoop is also able to distribute computational tasks across a cluster of computers. However, Hadoop is open source, and as such is easier and cheaper for enterprises to adopt. Most data lakes use Hadoop because of its ability to cheaply and quickly handle large volumes of structured and unstructured data, and because it is well-suited to storing data in its native formats so that analysts can apply whichever transformations they need. Hadoop is not technically a database, but it is overlaps in a number of functional areas.

Alternatives to Data Warehouses

Data warehouses are powerful and useful, but require large amounts of work both upfront and on an ongoing basis to use and maintain. Because data warehouses have to be centralized and standardized in order to serve many users across an organization, they always include assumptions about what should be stored, and how, that cannot be optimal for all users.

There is a better way. Dremio is an alternative to data warehouses.