At its simplest, data warehouse is a system used for storing and reporting on data. The data typically originates in multiple systems, then it is moved into the data warehouse for long-term storage and analysis. This storage is structured such that users from many divisions or departments within an organization can access and analyze the data according to their needs.
A data warehouse contains data from many operational sources. It is used to analyze data.
Data warehouses are analytical tools, built to support decision making and reporting for users across many departments. They are also archives, holding historical data not maintained in operational systems.
Data warehouses work to create a single, unified system of truth for an entire organization. Unfortunately, as you might imagine, trying to maintain accuracy and thoroughness in such a system is incredibly difficult.
So, if data warehouses are so complex to build and maintain, why do organizations bother?
Data warehouses can provide:
Storing comprehensive data in structured relationships, means that data warehouses can also provide answers to a whole variety of complex questions, like:
Data warehouses are built in many different forms, attempting to account for and structure the complexity of the organizations that use them.
But the basic architecture is pretty consistent:
First the raw data is formatted, sometimes called cleansing and normalizing. You can think of this as a pipeline that moves raw data from sources to the warehouse, ensuring the data is appropriately named and formatted and in accurate relationships with the rest of the data that is stored. This is often called the integration layer, and is not necessarily considered part of the data warehouse itself.
The formatted data is then stored in the data warehouse itself. An access layer allows the tools and applications to retrieve data in a format that suits their needs.
There is another aspect to data warehouse architecture that governs the whole structure called metadata. Metadata is data about the data. The data engineers and data scientists who maintain a data warehouse collect information about data sources, naming conventions, refresh schedules, etc., and use this information to maintain data quality and to make sure that the data warehouse is serving its intended purposes.
Generally, data management systems can be considered either OLAP (Online Analytical Processing), or OLTP (Online Transaction Processing). In general, OLTP systems create or capture data in online applications, and OLAP systems analyze data that has been collected from one or more systems. A data warehouse is an example of an OLAP system.
|Purpose||Capture and create data||Analyze data|
|Examples||Systems that run your business: CRM, finance, etc.||Data warehouses and data marts|
|Designed To||Deal efficiently with large volumes of transactions||Allow analysts to efficiently answer questions that require data from multiple source systems|
|Storage Term||Short-term or medium-term; the most recent year or most recent few years of data||Long-term, typically many years or the full history of the business|
As such, OLTP and OLAP systems have different technical attributes and requirements.
|Storage Type||Optimized for small reads and writes to maximize throughput and to minimize latency. Extensive use of indexes.||Optimized for large reads of many rows of data. May use no indexes or just a few indexes.|
|Query Type||Reads and writes of one or a few rows||Mostly reads of many rows|
|Query Scope||Small, discrete pieces of data, such as updating the address for a customer||Large numbers of records, eg, what was the average purchase across all sales over the past 5 years|
|Query Latency||10ms - 200ms||1 sec - several minutes|
|Concurrent Queries||100-10,000 qps||<50|
|Physical Storage||Row-oriented – meaning that a system scans all values in a given row, discarding what it doesn’t need. Good for retrieving all information about a specific object, but less efficient at performing operations on a column for all rows in a table.||Column-oriented – meaning that a system scans all values in a given column. Good for performing aggregations across large numbers of records. Less efficient for looking at all attributes for a given object (the whole row).|
|Vendor Products (relational)||Oracle, SQL Server, DB2||Teradata, Vertica, Greenplum, Netezza, Sybase IQ|
|Vendor Products (non-relational)||MongoDB, Elasticsearch, Cassandra||Hadoop|
|Vendor Products (cloud)||Amazon RDS, Amazon DynamoDB, Azure DocumentDB, MongoDB Atlas||Amazon Redshift, Azure SQL Data Warehouse, Google BigQuery|
ETL systems govern the movement of data between the systems of source data and a data warehouse (i.e.: the pipeline mentioned in the section on data warehouse architecture), as well as movement from a data warehouse to data marts. Data must first be extracted from a source, then transformed according to the standards of the next layer of storage, and finally the formatted data must be correctly loaded into this next layer. Learn more about ETL tools.
Strictly speaking, a database is any structured collection of data. An Excel spreadsheet, Rolodex, or address book would all be very simple examples of databases. Software such as Excel, Oracle, or MongoDB is a database management system (DBMS) that allows users to access and manage the database. It is common for people to call a DBMS a database. A data warehouse then is a type of database. It is specialized in the data it stores - historic data from many sources - and the purpose it serves - analytics.
A key attribute of databases, and therefore data warehouses, is that they contain structured data. The way that data is stored – from what fields are available, to date formats, and everything in between – is agreed upon in advance and the entire database follows this structure, or schema, rigorously. Their relative consistency and stability, mean that data warehouses can serve queries from many types of roles in the organization. This process is very structured, very predictable, and very efficient, but it’s also hard to do well.
Consensus is still forming around the precise definition of a data lake. But, generally speaking, data lakes are another means of storing data, but without the rigorous schema of a data warehouse. In contrast to a data warehouse, schema in the data lake is applied by the query. This means it is much easier to load the data into a data lake, but queries are much more complex to construct, which limits the use of the data lake to sophisticated roles like data engineers and data scientists. Data lakes usually take much longer to return results when compared to data warehouses. Most data lakes are built on Hadoop.
If a data warehouse holds and integrates data from across an organization, a data mart is a smaller subset of the data, specialized for the use of a given department or division. Often data marts are built and controlled by a single department, using the central data warehouse along with internal operating systems and external data. Data marts typically hold just one subject area, for example marketing or sales. Because they are smaller and more specific, they are often easier to manage and maintain, as well as having more flexible structures.
OLAP systems often contain tremendous quantities of data, which can make running specific queries on them slow. To speed up and simplify querying, OLAP systems can be broken down further into sub-databases called cubes, which contain a limited set of dimensions and therefore provide faster querying time.
Business intelligence software is a critical layer on top of a data warehouse that allows the information within it to be used to make business decisions. Business intelligence software has many different types of functionality, but it usually includes some sort of engine for building and executing queries, as well as a way to store and visualize results for incorporation into documents containing business analysis.
When considering which tools to use, it’s important to be sure that they meet your requirements in terms of scalability (can it grow with your needs), access (which users and how many need access to your warehouse and from what locations), and integrations (does this system integrate with your data sources and BI tools).
Relational databases are the systems that store the majority of business data. They have been around for several decades and are very mature. Relational databases exist for OLTP and OLAP use cases. They are very well understood and provide extensive ecosystems of complementary technologies for management and data access. Popular relational databases include:
MPP databases are a data warehouse technology that focuses on hardware and software to support parallel processing. That is, query processing is broken up into many smaller parallel tasks that are executed together across multiple servers. This approach speeds up query and ingestion times dramatically. However, MPP databases can be expensive, as the most popular vendors have closed-source products.
Popular MPP databases include:
Similar to the MPP approach, Hadoop is also able to distribute computational tasks across a cluster of computers. However, Hadoop is open source, and as such is easier and cheaper for enterprises to adopt. Most data lakes use Hadoop because of its ability to cheaply and quickly handle large volumes of structured and unstructured data, and because it is well-suited to storing data in its native formats so that analysts can apply whichever transformations they need. Hadoop is not technically a database, but it is overlaps in a number of functional areas.
Data warehouses are powerful and useful, but require large amounts of work both upfront and on an ongoing basis to use and maintain. Because data warehouses have to be centralized and standardized in order to serve many users across an organization, they always include assumptions about what should be stored, and how, that cannot be optimal for all users.
There is a better way. Dremio is an alternative to data warehouses.