Simplifying the Data Pipeline
It is always amazing how the time duration required to deliver an enterprise data pipeline connecting two different data sources. In many of these organizations, the elapsed time to create a data pipeline creation can take weeks, even months to complete a pipeline request.
Initially, this made no sense, after all, how hard is it to export the data as a CSV format from one database (i.e. MySQL) and then import the CSV formatted data into another (Oracle) and finish-up with a couple of column data type casts? However, when you consider how many data pipelines are required for organizations to support all of the active data analytic projects, these long data pipeline durations doesn’t seem so long. Then, when you finally factor the complexity of steps for a proper Data Quality review and certification of the newly created pipeline, the length of time often required to deliver a new pipeline becomes understandable.
However, just because it is understandable, does not imply that organizations can continue to afford such delays getting the right data to the right analyst in a timely fashion. The enterprise consequences resulting from data delays often includes:
- Loss of business opportunities because it can be weeks to months before the actual analysis can begin.
- Increased IT costs as enterprises must hire more Data Engineers to support the ever-growing volume of data pipeline creation and enhancement tickets.
- And finally, have been increasingly hearing about data scientists who have given up waiting for datasets and resigned from the company because they could not deliver the analysis in the requested time frame. This last problem results in additional higher costs, plus the overhead required to ramp-up on a regular basis new data scientists.
Historically, the solution was to create a Data Warehouse/DataMart where all data sources were glued together using a Star Schema or very wide normalized tables enabling the analyst access to the enterprises data. The process took a fair amount of upfront planning and implementation time but was somewhat workable because back then data volumes were reasonably small and quite often the source data came originated from the same database (i.e. DB2 or Oracle).
Traditional data pipeline
The problem with the traditional data pipeline to get data in the hands of the analysts included the following:
- The data warehouse by itself requires substantial servers and duplicates storage
- The time to create all the processes required to build the data pipeline can be lengthy
- Pulling together heterogeneous data sources into a star schema for the data warehouse is complex and building very wide tables with hundreds to thousands of fields can affect reporting performance.
To address the traditional data pipeline problems, the concept of a Data Lake consisting of data whose structure is not yet defined. Gone is the need to anticipate every possible JOIN and the integration of every field which conceivable could be used over the next couple of years. Instead, the data lake introduces the ability to create simpler reporting sets containing only the data you need now.
While the introduction of a Data Lake in most organizations unlocked whole troves of data which were not part of the existing data warehouses, there were still the following problems:
- Data replication from source system to the data lake is expensive and time consuming
- Pipelining efforts required to keep source and data lake in sync still required many data engineers.
Those benefits aside, quickly after the data lake was adopted, it was renamed the “Data Swamp”. Companies would have hundreds of thousands of ‘tables/directories’ in their Data Lake with only a couple of thousand actually getting used.
There has to be a better way
Dremio offers a different solution. If we take our three datasets from the image below, what we are going to do is register them into Dremio without writing a single line of code, and through their registration all three of them will now use the same dialect of SQL and as a result there is no programming required to get into a canonical model. From that point, users can join the data, curate data (calculate fields, filter data, etc) and when ready, they can go and consume the data from Tableau in run time.
The entire process, can be executed without any programming, and without making any copies of data or extracting and moving data out of its original storing source. One of the great things about this process is that if a new column is added to any of the data sources, it will automatically be registered with Dremio and it will be immediately available to the data consumer.
Dremio makes data consumers and data engineers more self-sufficient and more productive by providing a self service semantic layer that allows them to access data at the speed of thought when they need (without the intervention of other teams i.e I.T)
How does this work with Dremio?
This is an example of a simple scenario where I would need to join and work with data from HDFS and Oracle. I can join these data sources using Dremio’s UI, I also have the option to use the REST API and ODBC and JDBC drivers. Using the UI I simply can select two columns and create the join and the Dremio will automatically build the SQL sentence.
I have total control over this SQL, I can edit it as needed or type in new functions to get the output query just the way I want it. Notice that there is no “CREATE TABLE AS” in this SQL code, however this is supported if needed. You also may notice that there is no need to convert the data sources into canonical format or make use of any temp storage, Dremio simply optimizes the data to bring you lightning fast queries.
The value of speed, no data movement, open source flexibility and query optimization translates into not only large amounts of peace of mind but also from the query perspective you can accelerate your time to insight from days to just a few minutes.
The benchmark numbers shown above where observed when running the same sample query (or attempting to) in three different scenarios. Without Dremio the task was nearly impossible, with other popular virtualization tools in the market the results were somewhat acceptable but not the best and then using Dremio query responses where down to a sub-second scale.
It is more than just performance
Dremio offers more than just speedy queries, in the example above, through virtual datasets (VDS) Dremio can join over six different data sources covering two different tenants (Finance and Marketing) and we can create a base semantic layer where things are curated and values normalized for each one of those departments; Naturally, each one of these departments have different data needs and they consume data in different ways, Dremio’s semantic layer allows data architects to provide consumers with data that satisfy their business necessities.
In scenario we can join HDFS and ADLS together to come up with a dataset for Finances which will support their board analysis, revenue growth, cash flow, etc. and that data will be provided to the Finances team without any temporary stores or copies. On the other side the Marketing team will need data from Teradata, HDFS, Hive and S3 to perform sales analysis, lead and demand generation analysis and so on.
The benefit of this is that regardless of the team that your users belong to, they don’t have to comb through thousands of rows and columns, but the semantic layer allows them to focus and have immediate access to the data that matters to them.
What is happening to the data?
Once all the VDSs are created for the different teams, is important to keep track of how all these data sources are being used. Dremio allows you to gain this understanding through its data lineage feature. Here you can automatically track where data is coming from, how it is being used and who is using it. It is a central place to understand where data was generated and what is happening to it.
Data Curation Options
Dremio enables you to curate data by interacting with the dataset in a visual manner, resulting in a new virtual dataset. Note that Dremio does not create a copy of the data, so there is no overhead to creating a new virtual dataset. Some of the operations that you can execute against your dataset are: filtering, extracting, un-nesting, splitting, re-naming, creating a calculated field, just to name a few.
To simplify your data pipeline you need speed, security, flexibility and self-service in one package. If you miss the catalog, the data virtualization, security and the query performance, then you are once again adding friction resulting in continued complex pipelines and missed opportunities.
In this article we’ve gone from a very complex process where we really couldn’t tell exactly what is going on to a scenario where using Dremio and registering all the data sources that we need so we can execute highly performant queries at runtime without having to write any code, wait for the intervention of I.T or move data anywhere.