In the era of big data and analytics, the quality of data plays a critical role in making informed decisions and extracting meaningful insights. However, ensuring data quality can be complex, requiring thorough checks and validations.
In this blog article, we explore 10 essential data quality checks using three powerful tools: SQL, Pandas, and Polars. Whether you prefer working with traditional SQL databases, the popular Pandas library in Python, or the efficient Polars library, we've got you covered. By leveraging these tools, you can identify issues like missing values, duplicates, data inconsistencies, and more.
For the Pandas and Polars examples, we’ll assume we’ve loaded the data from a Parquet file into DataFrame and LazyFrame, respectively, as shown below.
PANDAS
#Load the data from the Parquet file into a DataFrame
orders_received_df = pd.read_parquet('orders_received.parquet')
POLARS
# Read the data from the Parquet file into a LazyFrame
orders_received_lf = pl.read_parquet('orders_received.parquet')
10 Data Quality Checks
The "orders_received" dataset represents a fictional collection of data that captures information about orders received by a company. This dataset provides valuable insights into the company's sales operations, enabling analysis and decision-making processes. It includes key attributes such as "order_number," a unique identifier for each order; "product_code," which indicates the specific product being ordered; "customer_id," used to associate orders with respective customers; "quantity," representing the quantity of the product ordered; and "order_date," indicating the date when the order was received. These attributes collectively offer a comprehensive snapshot of the orders received, facilitating business growth and customer satisfaction initiatives.
This example helps identify orders in the "orders_received" table that have missing order numbers. It allows you to find any incomplete or improperly recorded orders.
SQL
SELECT * FROM orders_received WHERE order_number IS NULL;
PANDAS
# Checking for NULL values
null_orders = orders_received_df[orders_received_df['order_number'].isnull()]
POLARS
# Checking for NULL values
null_orders = orders_received_lf.filter(pl.col('order_number').is_null())
Counting Distinct Values
This technique helps you count the product codes in the "orders_received" table. It allows you to identify if there are multiple occurrences of the same product code, which could indicate data entry errors or duplicates.
SQL
-- Getting all values appearing more than once
SELECT product_code, COUNT(*) AS count
FROM orders_received
GROUP BY product_code
HAVING COUNT(*) > 1;
PANDAS
# Getting all values appearing more than once
product_counts = orders_received_df['product_code'].value_counts()
multiple_occurrences = product_counts[product_counts > 1]
POLARS
# Getting all values appearing more than once
distinct_product_counts = orders_received_lf.groupby('product_code').agg(pl.count('product_code').alias('count'))
multiple_occurrences = distinct_product_counts.filter(pl.col('count') > 1)
Validating Data Types
This example checks if the "quantity" column in the "orders_received" table contains only numeric values. It helps identify any instances where non-numeric or improperly formatted data is present.
SQL
SELECT * FROM orders_received WHERE quantity::text !~ '^[0-9]+$';
PANDAS
# Validating data types
non_numeric_quantity = orders_received_df[~orders_received_df['quantity'].astype(str).str.isnumeric()]
POLARS
# Validating data types
non_numeric_quantity = orders_received_lf.filter(pl.col('quantity').cast(pl.UInt32).is_null())
Identifying Duplicates
This query helps find duplicate order numbers in the "orders_received" table. It allows you to spot any orders that might have been recorded multiple times, ensuring data accuracy and preventing potential errors.
SQL
SELECT order_number, COUNT(*) AS duplicate_count
FROM orders_received
GROUP BY order_number
HAVING COUNT(*) > 1;
This example verifies that the "customer_id" in the "orders_received" table corresponds to valid customer records in the "customers" table. It helps maintain referential integrity by ensuring all orders are associated with existing customers.
SQL
SELECT *
FROM orders_received
WHERE customer_id NOT IN (SELECT customer_id FROM customers);
PANDAS
# Checking data integrity with foreign keys
foreign_key_mismatch = orders_received_df[~orders_received_df['customer_id'].isin(customers_df['customer_id'])]
POLARS
# Checking data integrity with foreign keys
foreign_key_mismatch = orders_received_lf.filter(
~pl.col('customer_id').isin(pl.scan_csv("customers.csv").select("customer_id"))
)
Verifying Data Range or Constraints
This query checks if the "order_date" falls within a specified range, such as a particular year. It helps ensure that the orders recorded in the "orders_received" table adhere to specific temporal constraints or business rules.
SQL
SELECT *
FROM orders_received
WHERE order_date < '2023-01-01' OR order_date > '2023-12-31';
PANDAS
# Verifying data range or constraints
out_of_range_dates = orders_received_df[(orders_received_df['order_date'] < '2023-01-01') | (orders_received_df['order_date'] > '2023-12-31')]
POLARS
# Verifying data range or constraints
out_of_range_dates = orders_received_lf.filter(
(pl.col('order_date') < pl.lit('2023-01-01')) | (pl.col('order_date') > pl.lit('2023-12-31'))
)
Identifying Records with Invalid or Inconsistent Formats
This example identifies "product_code" values in the "orders_received" table that do not conform to a desired alphanumeric format. It helps identify any entries that might have been incorrectly recorded or contain invalid characters.
SQL
SELECT *
FROM orders_received
WHERE NOT REGEXP_LIKE(product_code, '^[A-Za-z0-9]+$');
PANDAS
# Identifying records with invalid or inconsistent formats
invalid_product_codes = orders_received_df[~orders_received_df['product_code'].str.match(r'^[A-Za-z0-9]+$')]
POLARS
# Identifying records with invalid or inconsistent formats
def does_not_match(s: str) -> bool:
import re
return not re.match(r'^[A-Za-z0-9]+$', s)
invalid_product_codes = orders_received_lf.filter(
pl.col('product_code').apply(does_not_match).alias('invalid_code')
)
Detecting Outliers or Unusual Values
This query identifies any unusually high or low quantities by comparing the "quantity" values in the "orders_received" table with their statistical distribution. It helps detect potential data entry errors or outliers that deviate significantly from the norm.
SQL
SELECT *
FROM orders_received
WHERE quantity < (SELECT AVG(quantity) - (3 * STDDEV(quantity)) FROM orders_received)
OR quantity > (SELECT AVG(quantity) + (3 * STDDEV(quantity)) FROM orders_received);
Detecting stale data is an important aspect of data quality checks to ensure the timeliness and relevance of the "orders_received" dataset. Stale data refers to outdated information that no longer reflects the current state of the business. By performing stale data checks, you can identify records that may have become obsolete or irrelevant due to time constraints. This helps maintain the accuracy and reliability of the dataset, allowing for more informed decision-making and analysis. Let’s assume our data should only represent orders from the last 90 days.
SQL
SELECT *
FROM orders_received
WHERE order_date < CURRENT_DATE - INTERVAL '90 days';
PANDAS
# Detecting stale data
today = datetime.now().date()
stale_data = orders_received_df[orders_received_df['order_date'] < (today - timedelta(days=90))]
POLARS
# Detecting stale data
stale_date = today - timedelta(days=90)
stale_data = orders_received_lf.filter(pl.col('order_date') < pl.lit(stale_date))
Detecting Changing Data
Detecting changing data is an important data quality check that helps identify records in the "orders_received" dataset where the values have been modified or updated since a specific reference point. This check allows you to spot instances where data has changed unexpectedly, which could be a result of data corruption or manual errors. By performing this check, you can ensure the accuracy and consistency of the dataset for reliable analysis and decision-making.
To detect changing data, you would typically need to have a reference point or a historical dataset to compare against the current dataset. By comparing corresponding values between the reference dataset and the current dataset, you can identify any differences or changes in the data.
SQL
SELECT *
FROM orders_received
WHERE EXISTS (
SELECT *
FROM orders_reference
WHERE orders_reference.order_number = orders_received.order_number
AND (
orders_reference.order_date <> orders_received.order_date
OR orders_reference.quantity <> orders_received.quantity
-- Include other relevant columns for comparison
)
);
PANDAS
# Perform an inner merge to compare the datasets
changing_data = pd.merge(orders_received_df, orders_reference_df, on='order_number', suffixes=('_current', '_reference'))
changing_data = changing_data[changing_data['order_date_current'] != changing_data['order_date_reference']
| changing_data['quantity_current'] != changing_data['quantity_reference']
]
POLARS
# Perform an inner join to compare the datasets
orders_received_lf = orders_received_lf.with_column(orders_received_lf['order_date'].alias('order_date_current'))
orders_received_lf = orders_received_lf.with_column(orders_received_lf['quantity'].alias('quantity_current'))
changing_data = orders_received_lf.join(
orders_reference_lf,
on='order_number'
).filter(
(pl.col('order_date_current') != pl.col('order_date_reference'))
| (pl.col('quantity_current') != pl.col('quantity_reference'))
# Include other relevant columns for comparison
)
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
Conclusion
By implementing these data quality checks, you can trust the accuracy of your data and make reliable decisions based on high-quality information. Data quality is an ongoing process, and these techniques serve as essential tools to ensure the integrity and usefulness of your datasets.
Try Dremio Cloud free for 30 days
Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.
Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop
We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]
Aug 16, 2023·Dremio Blog: News Highlights
5 Use Cases for the Dremio Lakehouse
With its capabilities in on-prem to cloud migration, data warehouse offload, data virtualization, upgrading data lakes and lakehouses, and building customer-facing analytics applications, Dremio provides the tools and functionalities to streamline operations and unlock the full potential of data assets.
Aug 31, 2023·Dremio Blog: News Highlights
Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud
Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.