What is a Data Pipeline?
Businesses generate massive amounts of data that must be analyzed in order to derive business value. Unfortunately, analyzing data within the systems where the data is created is not ideal for the following reasons:
- Analytics is computationally taxing. If you use the same systems for analysis that you use for capturing your data, you risk impairing both the performance of your service, as well as slowing down your analysis.
- Data from multiple systems or services sometimes needs to be combined in ways that make sense for analysis. For example, you might have one system that captures events, and another that stores user data or files. Having a separate system to govern your analytics means you can combine these data types without impacting or degrading performance.
- You may not want analysts to have access to production systems, or conversely, you may not want production engineers to have access to all of your organization’s data.
- If you need to change the way you store your data, or what you store, it’s a lot less risky to make those changes on a separate system. This allows the systems that back your services to continue uninterrupted.
These two systems (data generating vs. data analyzing) are often called OLTP (online transaction processing) and OLAP (online analytical processing). Here’s a quick overview of the functions of each:
|Purpose||Capture and create data||Analyze data|
|Examples||Systems that run your business: CRM, finance, etc.||Data warehouses and data marts|
|Designed To||Deal efficiently with large volumes of transactions||Allow analysts to efficiently answer questions that require data from multiple source systems|
|Storage Term||Short-term or medium-term; the most recent year or most recent few years of data||Long-term, typically many years or the full history of the business|
So, generally speaking, you’ll have at least two systems for data management, but in order for these systems to work, you need a means of moving data between them. This is where the concept of data pipelines comes in.
Moving data between systems requires many steps: from copying data, to moving it from an on-premises location into the cloud, to reformatting it or joining it with other data sources. Each of these steps needs to be done, and usually requires separate software.
A data pipeline is the sum of all these steps, and its job is to ensure that these steps happen reliably to all data. These processes should be automated, but most organizations need at least one or two engineers to maintain the systems, repair failures and update according to the changing needs of the business.
The many automated transformations required for effective and thorough data pipelines mean that, even with careful planning, the pipes will sooner or later begin to leak. The following are some common problems that may arise:
- Changes that occur in the schema (or formatting) of your transactional data will necessitate changes to the script that makes it available to your analytical systems. These updates are difficult to automate and require meticulous attention. Failing to update schema, or missing just one or two details in your script, can cause your pipeline to fail or, possibly worse, create data that is incorrect.
- This issue also happens in reverse: if your analysts find that they need the data differently structured, you’ll have to remake the format, using the same meticulous care.
- Even if you set up the scripts beautifully, they will experience small failures from time to time. You’ll need a way to be alerted to problems, a way to locate and repair them quickly, and a means of recovering or reprocessing any data lost in the shuffle.
A good data pipeline functions like the plumbing it is named after: quietly, reliably and in the background. But like plumbing, you’ll want on-site or on-call professionals who can perform repairs in the event of a leak.
Moving a Dataset
Data pipelines can be built in many shapes and sizes, but here’s a common scenario to get a better sense of the generic steps in the process. Imagine an e-commerce system that needs to move operations data about purchases to a data warehouse. Say, for example, you’ve been selling pet rocks and accessories for some time, and you need to get a clear look at who is buying “Iggy the Pumice.” If you build a data pipeline to move all transactions including an Iggy to a data warehouse, you’ll be able to check in on past and current Iggies and plan for future inventory. So you set out, undaunted, to build your pipeline:
1 - In the beginning, an order is placed, creating an order record which might include customer_id, product_ids, total paid, timestamp and anything else that the system was built to record. Each of these items are collected into a record of the customer’s action. This is the source of your data.
2 - Next, this data needs to be combined with data from other systems. For example, you might need to immediately connect to a customer database to verify VIP membership for free shipping. You might also want a demographics system to pick up information about shipping ZIP code (population, median income, distance from major city, etc.), or a segmentation system to associate the customer with one or more customer segments. Likely, your source data will need to be combined with all of these systems and possibly more. This is called joining data.
3 - Additional processing may be necessary. For example, some of the fields in your source data might contain discrete elements, like a ZIP code in an address field that needs to be accessible on its own. Further, some of the data logged by your transactional systems will be unnecessary to include in analytical systems. Specific customer information like full addresses and payment details may need to be masked.
4 - Now these different types of data need to be standardized. You might want to map customer age into age ranges. Item color names might need to be standardized, or you might need to be sure you’re consistent about how different time zones are logged.
5 - Dirty records from source data must be filtered out. Even the most carefully built OLTP systems can log errors or blanks that will throw off any other systems in your pipeline. You’ll need to correct for things like missing fields or data that is clearly wrong (an incorrect ZIP code, a customer that has marked their date of birth as 1876, etc.). You need to be sure your data is correct.
6 - In the final step, data can be loaded into the destination. You’ll want to reconcile your records to be sure that all data logged by the source is accounted for. Even if some logs cannot be loaded (for example, due to errors that need a separate review), you’ll need to have some record of their absence or risk corrupting your dataset. To do so, you’ll need a system in place to notify other processes of the pipeline’s completion and its final outcome (total number of records moved, rejected, flagged for review, etc.).
7 - Most of these processes should be automated. You can decide whether to run this process on a schedule or if it should happen continuously.
Each of these end-to-end processes entails a data pipeline. The basic parts and processes of most data pipelines are:
Sources. Data is accessed from different sources: relational database (RDBMS), application APIs, Apache Hadoop, NoSQL, cloud sources and so on. As data is accessed, security controls must be observed, and best practices must be followed for optimal performance and reliability. Data schema and data statistics are gathered about the source to facilitate pipeline design. In the example above, the source of the data is the operational system that a customer interacts with.
Joins. It is common for data to be combined from different sources as part of a data pipeline. Joins specify the logic and criteria for how the data is combined.
Extraction. Some discrete data elements may be embedded in larger fields, such as a ZIP code in an address field. In some cases multiple values are grouped together, like categories for a business. Or, discrete values may need to be extracted, or certain elements of a given field may need to be masked.
Standardization. On a field-by-field basis, data may need to be standardized in terms of units of measure, dates, attributes such as color or size, and codes related to industry standards.
Correction. It is common for data to include errors. These could be invalid fields such as a state abbreviation or ZIP code that does not exist, or abbreviations that need to be expanded. There may also be corrupt records that need to be removed or reviewed in a separate process.
Loads. Once the data is ready, it needs to be loaded into a system for analysis. The destination is typically an RDBMS, a data warehouse, or Hadoop. Each destination has specific best practices that are important to follow for performance and reliability.
Automation. Data pipelines are normally performed many times, and typically on a schedule or continuously. Errors must be detected, and the status needs to be reported to established monitoring systems.
In terms of technology, the stages of your data pipeline might use one or more of the following:
Event frameworks help capture events from your applications more easily, creating an event log that can then be processed for use.
Message bus is hardware or software that ensures that data sent between clusters of machines is properly queued and received. A message bus allows systems to immediately send/receive data to/from other systems without needing to wait for acknowledgement, and without needing to worry about errors or system inaccessibility. Properly implemented, a message bus also makes it easier for different systems to communicate using their own protocols.
Data persistence stores your data in files or other non-volatile storage so that it can be processed in batches, rather than all at once, simultaneously.
Workflow management structures the tasks (or processes) in your data pipeline, and makes it easier to supervise and manage them.
Serialization frameworks convert data into more compact formats for storage and transmission.
Many companies have hundreds or thousands of data pipelines. Companies build each pipeline with one or more technologies, and each pipeline might be approached differently. Datasets often begin with an organization’s audience or customer, but they may also originate with given departments or divisions within the organization itself. It can be useful to think of your data in terms of events that are logged and then translated across a pipeline, and transformed according to the needs of your users and the systems they maintain.
Moving data from place to place via a data pipeline means that different end users can query more thoroughly and accurately, rather than having to go to a myriad of different sources. Good data pipeline architecture will account for all sources of events as well as provide support for the formats and systems each event or dataset should be loaded into.
Data Pipeline Technologies
The best tool depends on the step of the pipeline, the data, and the associated technologies. For example, streaming event data might require a different tool than using a relational database. Working in a data center might involve different tools than working in the cloud.
Some examples of products used in building data pipelines:
- Data warehouses
- ETL tools
- Data Prep tools
- Luigi, a workflow scheduler used to manage jobs and processes in Hadoop and similar systems
- Python/Java/Ruby, programming languages used to write processes in many of these systems
- AWS Data Pipelines, another workflow management service that schedules and executes data movement and processes
- Kafka, a real-time event streaming platform that allows you to move data between systems and applications, and can also transform or react to these data streams
Data engineers are responsible for using these tools and others like them to build and maintain data pipelines. As noted above, you need someone to fix pipelines if and when they fail, and it’s always nice to have a dedicated plumber, rather than someone frantically pinch-hitting in the midst of a flood.
A carefully managed data pipeline provides organizations access to reliable and well-structured datasets for analytics. Automating the movement and transformation of data allows the consolidation of data from multiple sources so that it can be used strategically. Dremio helps you deliver “last-mile” data pipeline needs, and picks up where traditional data pipelines leave off, providing a scalable, open source platform for interactive analytics for data in any format, at any scale. Learn more about Dremio.