The Apache Iceberg Advantage
An in-depth review of Apache Iceberg, an open table format for enterprise data lakes.
read moreOne of the most common cases that we face when dealing with different data sources is “Data Inconsistency”. Data inconsistency exists when different and conflicting versions of the same data appear in different places. Data inconsistency creates unreliable information, because it is difficult to determine which version of the information is correct. As we all know, it is difficult to make correct and timely data-driven decisions based on conflicting information.
In this tutorial, we will take a look at how you can use a look-up dataset and joins in Dremio to fix inconsistencies in data, easy and fast.
For this tutorial you’ll need access to Dremio. This is a more advanced tutorial, so we suggest you first read Getting Oriented to Dremio and Working With Your First Dataset if you haven’t already.
For this example, We are going to use a version of the 100 Sales Records dataset that has been modified to contain inconsistencies in the “Region” field.
Next, let’s identify the problematic data by running the following query on the SQL editor
SELECT DISTINCT Region FROM "100_Sales_Records_inconsistency"
Here we have several versions of what should be the same region, i.e Eu should be Europe, Aus should be Australia and Oceania.
To fix this issue, first I’ll download the results of the query that we just ran, and save the CSV locally on my machine.
Now, I’ll open the file on a text or spreadsheet editor and add a second column with the correct names. This is the lookup table that we will use to correct the dataset.
I will name the file “region_lookup”, save it as a CSV and then upload it to my home space.
Now, I will open the “region_lookup” dataset and click the “Join” button, select “custom join” and then from the @consumers space, select 100_sales_records_inconsistency.
I’ll use “inner” for the type of join option, and then select and drag “Region” from the left table into the “add join condition” screen. From the right hand table, I’ll select the field that I want to correct, which in this case is “Region”, then click apply.
At this point we can drop the fields that we don’t need from the new dataset. Select the drop-down menu for “Region” and then click “Drop”
I’ll repeat the same process for “Region0” and then rename “region_lookup” to “Sales Region”. Then I can save the dataset.
Now, I’m going to slightly modify the original query that I ran to make sure that the Region field in the sales dataset now contains consistent information.
SELECT Distinct Sales_Region FROM sales
And here are the results
Data inconsistency is something that we all as data consumers will have to face more than what we would like. Thankfully, there are tools and procedures that help us identify and solve these issues in a rather quick manner so we can invest more time making timely data-driven decisions.
In this tutorial we walked through the steps of using Dremio to join an inconsistent dataset with a look-up table, this procedure helped us clean the data in the original dataset so it would be more adequate for consumption.
We hope you enjoyed this tutorial, to learn more about how Dremio can help you solve your data challenge and gain insights from your data faster, check out Dremio University and stay tuned for more tutorials like this.
An in-depth review of Apache Iceberg, an open table format for enterprise data lakes.
read moreWatch Andres Bogsnes, Master Expert at Nordea Asset Management on how Nordea Asset Management journey to implement Data Domains with Dremio globally.
read moreGalp data strategy is built upon the pillar of democratizing data access and analytics, promoting decentralization when it comes to data product development.
read more