If you've ever discussed data warehousing, you've probably heard the term “ETL.” It refers to processes that allow businesses to access data, modify it, and store it. Organizations use ETL for a variety of reasons, including the efficient management of data and the ability to run business intelligence (BI) against their data. There are several types of ETL tools available to make the process easier and to reduce ETL complexities.
ETL stands for extract, transform, and load. The term is an acronym for the actions an ETL tool performs on a given set of data in order to accomplish a specific business goal.
Extract:The ETL tool takes (literally “extracts”) data directly from wherever it lives. This is the first step in processing the data and tells the tool how the data is stored and which security controls are in place. From there, the tool can issue the appropriate queries to read the data and understand whether any data has changed since the last extract.
Transform:Here’s where the ETL tool does something to the data it just extracted. Maybe it changes information inside some table cells. Or perhaps it merges the data from one source with another source. In some cases, the tool might add a new entry in the same format as all the other data. Regardless, the ETL tool modifies the data and may interact with different systems and applications depending on the specific request.
Load:Now that the data has been transformed, the ETL tool loads it into the destination system. More often than not, this means storing it in a data warehouse or data lake, and the tool will optimize the load to make storage as efficient as possible. Destination systems are designed for analytical workloads, and bulk or parallel loads can reduce the total time required to load data.
How Do ETL Tools Work?
To better understand how ETL tools work, imagine working at a large department store during the holidays; your job is to take merchandise from under the counter, wrap it in green and red wrapping paper, and put it into a customer’s shopping bag.
First, you extract the merchandise from its original source: a shelf below the checkout counter. Let’s say it’s a baseball cap. Then you transform the baseball cap by wrapping it in sparkly paper covered with a candy cane pattern. After that, you load the wrapped present into a bag that the customer takes home.
ETL tools do the same thing – they transform data.
An ETL tool might take updated accounting information from your ERP system (extract), combine it with other accounting data (transform), and store the transformed data (load) in your organization’s data lake for analytical analysis. Ultimately, the ETL tool takes source data, performs some action upon it, and stores it so that you can run BI tools against it.
Where Does an ETL Tool Get Its Information? Common Extraction Sources
The sources for raw data are numerous, and they include:
Legacy computing systems
Sales applications, such as CRM systems
Any data storage platform
ERP systems of various kinds
Common Transformation Actions
Using the source information, ETL transforms the data in some way. Typically, the transformation will involve:
Data cleansing: Errors like misspellings and missing values in the data are resolved. Cleansing may also involve removing unreadable or unusable data.
Standardization and sorting: The transformation process usually standardizes the data so that the loaded data can be used after it’s stored away.
Removing duplicates: Duplicate information is removed from the data set.
Data quality rules: Many organizations have proprietary requirements that improve the quality of their data.
Common Load Actions
Then there’s the load process. Companies increasingly distinguish between “long haul” and “last-mile” ETL:
Long haul: These tasks often require multiple hours per job. In this ETL scenario, jobs processing might occur in large batches.
Last-mile: This is a more lightweight, incremental type of ETL. Here, latency impacts performance quite a lot, and it’s most used where a broad range of ETL needs are being addressed.
Why Use ETL Tools?
Reporting and analytics are two of the primary reasons ETL tools exist in the first place. The “load” function of ETL occurs in such a way that any structured and unstructured data is formatted so it can be analyzed with BI tools to yield insights that aid in business decisions. As the data is loaded into the data warehouse or data lake, the ETL tool sets the stage for long-term analysis and usage of such data.
Examples include banking data, insurance claims, and retail sales history. After the source information is transformed and loaded, organizations can access and use that data via the preferred applications they use for analysis. It’s not hard to see how this arrangement could be useful for, say, pulling the average number of checking account withdrawals in a given month or quickly calculating how many accident claims were filed by insureds in July.
There are even more modern uses for the structured and unstructured data processed by ETL tools. Internet of Things (IoT) devices pull loaded data for use in machine learning applications. The same goes for social media applications, which draw from data stored in data lakes to make a whole host of decisions — for example, what notifications to send a user or which ads to serve.
The use cases are virtually endless. As big data gets bigger and more lives are powered by applications that process increasingly large data sets, the utility of ETL tools is bound to expand even further.
What Is ETL Testing?
In certain cases, an organization might want to confirm that the data processed by their ETL tools are being managed appropriately. Basically, they want to know the data was extracted properly and completely, transformed correctly, and loaded into the right system in the right format.
Use Cases for ETL Testing
Here are some examples of when an organization might pursue ETL testing:
You just set up your data warehouse and need to confirm that everything is running smoothly
The organization just completed a major data migration or integration effort
There’s reason to suspect you’ve got data quality or performance issues
ETL testing isn’t something any organization with ETL tools needs to be doing all the time, but it definitely can and should be done in certain circumstances. As we explore the different types of ETL tools organizations use, keep in mind that there’s always a way to verify the integrity of the functions performed by any of these tools.
ETL Use Cases or Integration Patterns
ETL is necessary to turn data into action, and there’s a difference between long-haul and last-mile ETL needs. If we expand on that, it’s clear that different use cases call for different ETL approaches or integration patterns. Three of the most common use cases or integration patterns are:
ETL for Batch Processing
Batch processing requires ETL tools to extract batches of data from the source on a predetermined schedule before transforming and loading it into the data warehouse or data lake. It’s the traditional approach and is appropriate for large volumes of data collected over a period of time. A good example is in a retail setting where transactions take place during the course of the day and the data is batched for ETL processing once the stores close to calculate daily revenues.
ETL for Streaming Data
ETL for streaming data looks at continuous ETL processes performed on a real-time stream. Streaming ETL processes create better data latency than batch processing because data is transformed and loaded in real time, rather than waiting on a scheduled batch update. In addition, the continuous work means a smaller amount of processing capacity is required at any one time and spikes in usage can be avoided. However, the faster processing may also result in more errors and “messier” data than a batch process. ETL for streaming data is useful in circumstances where businesses need to monitor and adjust frequently such as in the use of IoT data in industrial processes and machine learning, on financial trading floors, or in e-commerce settings.
ETL for Change Data Capture (CDC)
ETL for CDC is a process for tracking changes made to the source data and ensuring those changes are replicated in the data warehouse or data lake so that everyone accessing the information has the most current data possible. Change data can be delivered either in a batch process or in real time depending on the needs of end users. Like ETL for streaming data, a CDC process can result in improved efficiency for ETL resources because it only deals with the portion of data that has changed, requiring smaller amounts of computing, network bandwidth, and storage. CDC is vital in settings like fraud detection where credit card companies need to know instantly if a card is being used in multiple places at the same time.
It’s important to note that having one kind of ETL pipeline or integration pattern does not preclude use of the others. Many businesses have batch, streaming, and CDC ETL pipelines running alongside one another to meet different data and analytic needs.
Types of ETL Tools
In recent years, the breadth of ETL tools has grown rapidly as organizations have embraced new data warehousing and data lake technologies and deploy more streaming and CDC ETL integration patterns. Organizations can choose from a range of ETL tools to meet their diverse ETL needs.
Established IT Company ETL Offerings
Major IT companies have offered ETL solutions for several decades, first as on-premises batch processing options and today with more sophisticated offerings with GUIs that allow users to easily design ETL pipelines that connect many sources of data. The ETL tools are often bundled as part of a larger platform and appeal to enterprises who may have older, legacy systems that they need to work with and build on. The central players in this space include Informatica, IBM, Oracle, and Microsoft.
Custom ETL Solutions
For businesses with in-house data engineering and support capacity, it is possible to design and create custom tools and pipelines using scripting languages like SQL or Python. With the appropriate levels of engineering and development expertise, organizations might also take advantage of a number of open source solutions like Talend Open Studio or Pentaho Data Integration that help improve the development, execution, and performance of ETL pipelines. While this allows for a high level of customization and flexibility, building and maintaining custom ETL systems also requires more administration and maintenance than out-of-the-box options.
Cloud-Native ETL Tools
Increasingly, companies are turning to cloud-native tools that can integrate with proprietary data sources and ingest data from different web apps or on-premises sources. Organizations use these tools to move data between systems and to copy, transform, and enrich data before writing it to data warehouses or data lakes. Many businesses use more than one cloud-native tool, as each includes connectors for different data sources and each has its own strengths and weaknesses. Examples include Segment, RudderStack, and Azure Data Factory.
How to Evaluate ETL Tools
So, which ETL tools are right for your organization? To effectively evaluate ETL tools, it’s important to consider what types are available, how you’re going to use them, and what data sources and/or data storage arrangements you’re going to use.
To get started in your evaluation, here are the main things to consider:
Complexity of your ETL needs
Existing cloud vendor relationships
In-house development capabilities
A lot of your evaluation will come down to how your organization uses — or needs to start using — data. In assessing the complexity of your ETL needs, you have to account for a wide array of factors, including:
Structured versus unstructured raw data
Integration with source data platforms
Legacy data storage platforms
Cloud service integrations
Long term versus long haul processing
Regulations that dictate where your data can live; legislation may dictate how organizations in banking, insurance, healthcare, and other industries store their data
Certain ETL tools are stronger in different areas. A large hospital, for instance, will have different data processing needs from a small software company. For the former, an on-premises enterprise ETL solution might not just be practical — it might be the only option when it comes to regulation surrounding patient medical records. On the other hand, the latter might have more flexibility when using traditional versus cloud-based tools for ETL.
Speaking of cloud tools, you should also consider existing vendor relationships. If your organization already stores data in AWS S3, uses Microsoft Azure for various cloud computing services, or is heavily invested in Google’s cloud ecosystem, it may pay to lean on the company you already use. Not only will you get the advantages of cloud ETL — you won’t have to spin up any complicated one-off integrations with a proprietary or third-party data platform. The tool will play nice with what you already have; after all, that’s how it was designed!
The same holds true for organizations using enterprise, on-prem solutions in other areas of their business. Longtime IBM customers, for example, may find it relatively straightforward to segue into DataStage and stay with IBM for their ETL needs.
For organizations with significant in-house development capabilities, there are additional options. If your team can effectively manage and take advantage of any open source ETL tools, that can be a very cost-effective way to fulfill your ETL needs. Depending on how ambitious you are and how much your ETL situation impacts your business, you could harness your considerable IT talent to create a homegrown tool for ETL.
Common ETL Challenges
While the possibilities for ETL may seem endless, each new iteration of ETL seeks to solve some problem that previous generations of software either didn’t account for or couldn’t anticipate. After all, the marketplace is always changing. New ETL platforms are constantly striving to accommodate new data challenges, such as the ever-evolving complexity of data sources and issues related to scaling.
Below are a few of the most common ETL challenges that platform creators are always working to address.
Growing Volume and Variety of Data
Organizations face an ever-increasing volume of data from a huge variety of new sources and an ever-expanding set of possibilities for using that data. This explosion of data drives increasing complexity in ETL pipelines, making them more cumbersome to design, execute, and maintain. Increased data volumes also have a knock-on effect on costs for intermediate storage, and fees for ETL pipeline execution and compute resources. Although the capacity of ETL tools is improving, especially with the development of more cloud-based tools, there is a risk that the growth in data variety and volume will eventually outstrip an organization’s capacity to process it effectively. Ensuring easy scalability will be essential to long-term growth.
Everybody Wants Access
As organizations gather and process more data, they also generate more requests for data — not just from data scientists and analysts, but from business users who are less familiar with data work. And, of course, everybody wants that data as soon as possible in a format they can work with. Additional requests mean additional work for your data engineering teams as they create new ETL workflows, and a delay in time to value for data scientists and business analysts. Providing the right data in the right format to the right people at the right time without compromising speed or performance can be a difficult balance.
New types of data often mean new types of errors to contend with. As discussed earlier, data cleansing is a key part of the transformation aspect of ETL, but the data cleansing protocol that works for one kind of source data might not be sufficient for data coming from another source and/or being loaded into a different warehouse environment.
You could end up with duplicates, misplaced data, or other inaccuracies that compromise the integrity of your data. Unsurprisingly, the more hard coding required for your ETL process (think ETL “tools” that are actually just manual Python commands), the more likely you will run into errors.
Cybersecurity, Privacy, and Regulatory Concerns
Whether facing the direct threat of an attack on the systems, or simply responding to the downstream effects of new legislation, businesses need their ETL tools to ensure that sensitive data is protected and only shared with appropriate users. In addition, stringent retention regulations and policies increasingly require organizations to be able to look back into historic data, and they need ETL tools that can work smoothly with legacy data formats.
Reducing the Burdens of ETL
The thing about ETL is that it’s time-consuming and error-prone. In many cases, it takes a long time to extract your data, perform the necessary transformations, and load it into the new location in its proper format. There’s also a lot that can go wrong during a multi-step process like ETL. The more steps in the procedure, the more opportunities you have to make mistakes. It’s often costly, too.
This means that for developers building new ETL tools, the focus is on reducing the amount of complexity in ETL pipelines even while they’re facing increasing data volume and variety and new requests. For that reason, easily scalable cloud-based ETL tools that can work with a variety of sources are growing in popularity. In combination with ETL for streaming data and traditional batch processing, these tools help organizations create a more tailored approach to ETL.
How Dremio Cloud Helps Reduce the ETL Burden
Dremio dramatically reduces the need for complex ETL processes by reimagining data optimization and query technologies with Dremio Cloud, a frictionless, infinitely scalable platform that helps organizations run all their SQL workloads and automate their data management operations. Included in Dremio Cloud are two services designed to help businesses reduce their ETL burden, among other benefits: Dremio Sonar and Dremio Arctic. Dremio Sonar is a SQL engine for open platforms that makes it possible to achieve interactive response times directly on data lake storage without having to copy the data into warehouses, marts, extracts, or cubes. Its integrated query acceleration technology, Reflections, speeds up queries behind the scenes so data applications and analysts can interact seamlessly with data without having to worry about optimizing either the data or their queries. Frictionless BI tool integrations allow business users to quickly and easily visualize their data without the need for any new ETL workflows from the data engineering team.
Dremio Arcticoffers a modern metadata and data management service. It helps simplify data workflows and incorporates background compute jobs that automatically optimize the data in the lakehouse, reducing the need for complex ETL processes before the data is stored.
By enabling queries and metadata management directly on data lakes, Dremio offers organizations the ability to reduce the complexity and cost of ETL processes and data warehouse storage, without compromising speed and business performance.
Ready to Get Started? Here Are Some Resources to Help
The Apache Iceberg Advantage
An in-depth review of Apache Iceberg, an open table format for enterprise data lakes.