16 minute read · August 16, 2023

10 Data Quality Checks in SQL, Pandas and Polars

Alex Merced

Alex Merced · Developer Advocate, Dremio

In the era of big data and analytics, the quality of data plays a critical role in making informed decisions and extracting meaningful insights. However, ensuring data quality can be complex, requiring thorough checks and validations. 

In this blog article, we explore 10 essential data quality checks using three powerful tools: SQL, Pandas, and Polars. Whether you prefer working with traditional SQL databases, the popular Pandas library in Python, or the efficient Polars library, we've got you covered. By leveraging these tools, you can identify issues like missing values, duplicates, data inconsistencies, and more. 

Keep in mind that you may want to isolate ingested data for doing these data quality checks using a “data as code” approach, and if you are working with Apache Iceberg tables, that can be done effectively by ingesting the data onto a branch with a Dremio Arctic catalog and then merging your changes after validation.

For the Pandas and Polars examples, we’ll assume we’ve loaded the data from a Parquet file into DataFrame and LazyFrame, respectively, as shown below.

PANDAS

#Load the data from the Parquet file into a DataFrame
orders_received_df = pd.read_parquet('orders_received.parquet')

POLARS

# Read the data from the Parquet file into a LazyFrame
orders_received_lf = pl.read_parquet('orders_received.parquet')

10 Data Quality Checks

The "orders_received" dataset represents a fictional collection of data that captures information about orders received by a company. This dataset provides valuable insights into the company's sales operations, enabling analysis and decision-making processes. It includes key attributes such as "order_number," a unique identifier for each order; "product_code," which indicates the specific product being ordered; "customer_id," used to associate orders with respective customers; "quantity," representing the quantity of the product ordered; and "order_date," indicating the date when the order was received. These attributes collectively offer a comprehensive snapshot of the orders received, facilitating business growth and customer satisfaction initiatives.

{
  "type": "object",
  "properties": {
    "order_number": { "type": "integer" },
    "product_code": { "type": "string" },
    "customer_id": { "type": "integer" },
    "quantity": { "type": "integer" },
    "order_date": { "type": "string", "format": "date" }
  }
}

Checking for NULL Values

This example helps identify orders in the "orders_received" table that have missing order numbers. It allows you to find any incomplete or improperly recorded orders.

SQL

SELECT * FROM orders_received WHERE order_number IS NULL;

PANDAS

# Checking for NULL values
null_orders = orders_received_df[orders_received_df['order_number'].isnull()]

POLARS

# Checking for NULL values
null_orders = orders_received_lf.filter(pl.col('order_number').is_null())

Counting Distinct Values

This technique helps you count the product codes in the "orders_received" table. It allows you to identify if there are multiple occurrences of the same product code, which could indicate data entry errors or duplicates.

SQL

-- Getting all values appearing more than once
SELECT product_code, COUNT(*) AS count
FROM orders_received
GROUP BY product_code
HAVING COUNT(*) > 1;

PANDAS

#  Getting all values appearing more than once
product_counts = orders_received_df['product_code'].value_counts()
multiple_occurrences = product_counts[product_counts > 1]

POLARS

#  Getting all values appearing more than once
distinct_product_counts = orders_received_lf.groupby('product_code').agg(pl.count('product_code').alias('count'))
multiple_occurrences = distinct_product_counts.filter(pl.col('count') > 1)

Validating Data Types

This example checks if the "quantity" column in the "orders_received" table contains only numeric values. It helps identify any instances where non-numeric or improperly formatted data is present.

SQL

SELECT * FROM orders_received WHERE quantity::text !~ '^[0-9]+$';

PANDAS

# Validating data types
non_numeric_quantity = orders_received_df[~orders_received_df['quantity'].astype(str).str.isnumeric()]

POLARS

# Validating data types
non_numeric_quantity = orders_received_lf.filter(pl.col('quantity').cast(pl.UInt32).is_null())

Identifying Duplicates

This query helps find duplicate order numbers in the "orders_received" table. It allows you to spot any orders that might have been recorded multiple times, ensuring data accuracy and preventing potential errors.

SQL

SELECT order_number, COUNT(*) AS duplicate_count
FROM orders_received
GROUP BY order_number
HAVING COUNT(*) > 1;

PANDAS

# Identifying duplicates
duplicate_orders = orders_received_df[orders_received_df.duplicated(subset='order_number', keep=False)]

POLARS

# Identifying duplicates
duplicate_orders = orders_received_lf.with_column(
    orders_received_lf.groupby('order_number').agg(pl.count('*').alias('count'))
).filter(pl.col('count') > 1).select(orders_received_lf.columns())

Checking Data Integrity with Foreign Keys

This example verifies that the "customer_id" in the "orders_received" table corresponds to valid customer records in the "customers" table. It helps maintain referential integrity by ensuring all orders are associated with existing customers.

SQL

SELECT *
FROM orders_received
WHERE customer_id NOT IN (SELECT customer_id FROM customers);

PANDAS

# Checking data integrity with foreign keys
foreign_key_mismatch = orders_received_df[~orders_received_df['customer_id'].isin(customers_df['customer_id'])]

POLARS

# Checking data integrity with foreign keys
foreign_key_mismatch = orders_received_lf.filter(
    ~pl.col('customer_id').isin(pl.scan_csv("customers.csv").select("customer_id"))
)

Verifying Data Range or Constraints

This query checks if the "order_date" falls within a specified range, such as a particular year. It helps ensure that the orders recorded in the "orders_received" table adhere to specific temporal constraints or business rules.

SQL

SELECT *
FROM orders_received
WHERE order_date < '2023-01-01' OR order_date > '2023-12-31';

PANDAS

# Verifying data range or constraints
out_of_range_dates = orders_received_df[(orders_received_df['order_date'] < '2023-01-01') | (orders_received_df['order_date'] > '2023-12-31')]

POLARS

# Verifying data range or constraints
out_of_range_dates = orders_received_lf.filter(
    (pl.col('order_date') < pl.lit('2023-01-01')) | (pl.col('order_date') > pl.lit('2023-12-31'))
)

Identifying Records with Invalid or Inconsistent Formats

This example identifies "product_code" values in the "orders_received" table that do not conform to a desired alphanumeric format. It helps identify any entries that might have been incorrectly recorded or contain invalid characters.

SQL

SELECT *
FROM orders_received
WHERE NOT REGEXP_LIKE(product_code, '^[A-Za-z0-9]+$');

PANDAS

# Identifying records with invalid or inconsistent formats
invalid_product_codes = orders_received_df[~orders_received_df['product_code'].str.match(r'^[A-Za-z0-9]+$')]

POLARS

# Identifying records with invalid or inconsistent formats
def does_not_match(s: str) -> bool:
    import re
    return not re.match(r'^[A-Za-z0-9]+$', s)

invalid_product_codes = orders_received_lf.filter(
    pl.col('product_code').apply(does_not_match).alias('invalid_code')
)

Detecting Outliers or Unusual Values

This query identifies any unusually high or low quantities by comparing the "quantity" values in the "orders_received" table with their statistical distribution. It helps detect potential data entry errors or outliers that deviate significantly from the norm.

SQL

SELECT *
FROM orders_received
WHERE quantity < (SELECT AVG(quantity) - (3 * STDDEV(quantity)) FROM orders_received)
   OR quantity > (SELECT AVG(quantity) + (3 * STDDEV(quantity)) FROM orders_received);

PANDAS

# Detecting outliers or unusual values
quantity_outliers = orders_received_df[(orders_received_df['quantity'] < (orders_received_df['quantity'].mean() - (3 * orders_received_df['quantity'].std()))) | (orders_received_df['quantity'] > (orders_received_df['quantity'].mean() + (3 * orders_received_df['quantity'].std())))]

POLARS

# Detecting outliers or unusual values
quantity_outliers = orders_received_lf.filter(
    (pl.col('quantity') < (pl.col('quantity').mean() - (3 * pl.col('quantity').std())))
    | (pl.col('quantity') > (pl.col('quantity').mean() + (3 * pl.col('quantity').std())))
)

Detecting Stale Data

Detecting stale data is an important aspect of data quality checks to ensure the timeliness and relevance of the "orders_received" dataset. Stale data refers to outdated information that no longer reflects the current state of the business. By performing stale data checks, you can identify records that may have become obsolete or irrelevant due to time constraints. This helps maintain the accuracy and reliability of the dataset, allowing for more informed decision-making and analysis. Let’s assume our data should only represent orders from the last 90 days.

SQL

SELECT *
FROM orders_received
WHERE order_date < CURRENT_DATE - INTERVAL '90 days';

PANDAS

# Detecting stale data
today = datetime.now().date()
stale_data = orders_received_df[orders_received_df['order_date'] < (today - timedelta(days=90))]

POLARS

# Detecting stale data
stale_date = today - timedelta(days=90)
stale_data = orders_received_lf.filter(pl.col('order_date') < pl.lit(stale_date))

Detecting Changing Data

Detecting changing data is an important data quality check that helps identify records in the "orders_received" dataset where the values have been modified or updated since a specific reference point. This check allows you to spot instances where data has changed unexpectedly, which could be a result of data corruption or manual errors. By performing this check, you can ensure the accuracy and consistency of the dataset for reliable analysis and decision-making.

To detect changing data, you would typically need to have a reference point or a historical dataset to compare against the current dataset. By comparing corresponding values between the reference dataset and the current dataset, you can identify any differences or changes in the data.

SQL

SELECT *
FROM orders_received
WHERE EXISTS (
  SELECT *
  FROM orders_reference
  WHERE orders_reference.order_number = orders_received.order_number
    AND (
      orders_reference.order_date <> orders_received.order_date
      OR orders_reference.quantity <> orders_received.quantity
      -- Include other relevant columns for comparison
    )
);

PANDAS

# Perform an inner merge to compare the datasets
changing_data = pd.merge(orders_received_df, orders_reference_df, on='order_number', suffixes=('_current', '_reference'))
changing_data = changing_data[changing_data['order_date_current'] != changing_data['order_date_reference']
                              | changing_data['quantity_current'] != changing_data['quantity_reference']
                              ]

POLARS

# Perform an inner join to compare the datasets
orders_received_lf = orders_received_lf.with_column(orders_received_lf['order_date'].alias('order_date_current'))
orders_received_lf = orders_received_lf.with_column(orders_received_lf['quantity'].alias('quantity_current'))

changing_data = orders_received_lf.join(
    orders_reference_lf,
    on='order_number'
).filter(
    (pl.col('order_date_current') != pl.col('order_date_reference'))
    | (pl.col('quantity_current') != pl.col('quantity_reference'))
    # Include other relevant columns for comparison
)

Conclusion

By implementing these data quality checks, you can trust the accuracy of your data and make reliable decisions based on high-quality information. Data quality is an ongoing process, and these techniques serve as essential tools to ensure the integrity and usefulness of your datasets.

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.