Dremio Jekyll

Using Dremio to Fix Data Inconsistency

Intro

One of the most common cases that we face when dealing with different data sources is “Data Inconsistency”. Data inconsistency exists when different and conflicting versions of the same data appear in different places. Data inconsistency creates unreliable information, because it is difficult to determine which version of the information is correct. As we all know, it is difficult to make correct and timely data-driven decisions based on conflicting information.

In this tutorial, we will take a look at how you can use a look-up dataset and joins in Dremio to fix inconsistencies in data, easy and fast.

Assumptions

For this tutorial you’ll need access to Dremio. This is a more advanced tutorial, so we suggest you first read Getting Oriented to Dremio and Working With Your First Dataset if you haven’t already.

Loading Data intro Dremio

For this example, We are going to use a version of the 100 Sales Records dataset that has been modified to contain inconsistencies in the “Region” field.

image alt text

Identifying Inconsistencies

Next, let’s identify the problematic data by running the following query on the SQL editor

1
SELECT DISTINCT Region FROM "100_Sales_Records_inconsistency"

image alt text

Here we have several versions of what should be the same region, i.e Eu should be Europe, Aus should be Australia and Oceania.

Fixing Inconsistencies

To fix this issue, first I’ll download the results of the query that we just ran, and save the CSV locally on my machine.

image alt text

Now, I’ll open the file on a text or spreadsheet editor and add a second column with the correct names. This is the lookup table that we will use to correct the dataset.

image alt text

I will name the file “region_lookup”, save it as a CSV and then upload it to my home space.

image alt text

Joining Datasets

Now, I will open the “region_lookup” dataset and click the “Join” button, select “custom join” and then from the @consumers space, select 100_sales_records_inconsistency.

image alt text

I’ll use “inner” for the type of join option, and then select and drag “Region” from the left table into the “add join condition” screen. From the right hand table, I’ll select the field that I want to correct, which in this case is “Region”, then click apply.

image alt text

At this point we can drop the fields that we don’t need from the new dataset. Select the drop-down menu for “Region” and then click “Drop”

image alt text

I’ll repeat the same process for “Region0” and then rename “region_lookup” to “Sales Region”. Then I can save the dataset.

image alt text

Verifying the Results

Now, I’m going to slightly modify the original query that I ran to make sure that the Region field in the sales dataset now contains consistent information.

1
SELECT Distinct Sales_Region FROM sales

And here are the results

image alt text

Conclusion

Data inconsistency is something that we all as data consumers will have to face more than what we would like. Thankfully, there are tools and procedures that helps us identify and solve these issues in a rather quickly manner so we can invest more time making timely data-driven decisions.

In this tutorial we walked through the steps of using Dremio to join an inconsistent dataset with a look-up table, this procedure helped us clean the data in the original dataset so it would be more adequate for consumption.

We hope you enjoyed this tutorial, to learn more about how Dremio can help you solve your data challenge and gain insights from your data faster, check out Dremio University and stay tuned for more tutorials like this.