16 minute read · August 18, 2025
Column Nullability Constraints in Dremio
· Lead Engineer
· Engineer
Note
The functionality described in this blogpost is available in minimum version DC-211 or 26.1.0 or above.
Traditional data lake architectures suffered from, among others, the inability to evolve table schemas safely and the lack of robust data quality through mechanisms enforcing nullability constraints. In real-world applications, nullability enforcement proves essential for instance in regulatory data (e.g. requiring user_consent to be non-null) or machine learning pipelines (ensuring features are complete). Apache Iceberg emerged as a solution to bridge these gaps, enabling the modern lakehouse architecture that combines the scalability and cost-effectiveness of data lakes with the reliability and performance characteristics of data warehouses.
Schema evolution in traditional data lakes required expensive full table rewrites when structural changes were needed, making it operationally prohibitive to adapt to changing business requirements. Equally important, traditional data lakes lacked systematic approaches to enforce data quality constraints, particularly around column nullability—the specification of whether fields can contain null values. This absence of nullability enforcement led to inconsistent data quality, and possibly complex downstream data validation requirements.
Apache Iceberg addresses both challenges through its metadata-driven design. It enables schema evolution as metadata-only operations, eliminating the need for data rewrites [1]. Simultaneously, it provides robust nullability constraint enforcement at the schema level, ensuring data quality guarantees are maintained across table versions and schema changes. Additionally, nullability constraints enable query optimizations including predicate simplification by eliminating redundant null checks, enabling filter pushdowns for non-nullable columns of join inputs, and join type rewriting (converting outer joins to inner joins when columns are non-nullable). The interaction between these two capabilities—schema evolution and nullability constraints—creates a comprehensive framework for managing evolving data structures while maintaining data integrity.
Nullability Constraints in Apache Iceberg
Nullability constraints in Apache Iceberg explicitly declare table columns and fields in nested types as either required (NOT NULL) or optional (NULL) [2]. Required columns and fields enforce that every row must contain a valid, non-null value, providing strong data quality guarantees. Optional columns and fields permit null values to represent missing, undefined, or inapplicable values.
This enforcement mechanism functions during data ingestion or write operations, where violations of nullability constraints result in write failures, and during schema evolution operations, where nullability changes must satisfy compatibility rules. Making nullability specification a critical design decision for table schemas.
Schema Evolution
Apache Iceberg supports five types of schema modifications as metadata-only operations:
- Add: introduce new columns for the table or new fields in struct types
- Remove: delete existing columns from tables or fields from struct types
- Rename: change the names of table columns or struct fields while preserving their data
- Widen: widen data types for table columns, struct fields, map keys/values, or list elements
- Reorder: rearrange the table columns or struct fields without affecting data values
The first four operations directly interact with nullability constraints and require careful consideration of data integrity implications, as nullability can be specified for both top-level table columns and individual fields within nested types. The schema evolution mechanism relies on unique field identifiers to maintain referential integrity across schema versions, ensuring that nullability constraints are preserved and enforced consistently throughout the table's evolution.
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
See it in Action
Table creation in Apache Iceberg involves defining one or more columns that have specified names, data types, and nullability properties. When creating a column, use the keyword NOT NULL to designate it as a required field. The keyword NULL designates the column as optional. If neither keyword is present, the column is set to optional by default. In the following, examples are provided to show nullability constraints can be specified.
Example: NULL and NOT NULL Syntax
Let's start with a simple example. The following table definition with two columns, name and age. The column name is defined as NOT NULL while age is defined as NULL.
CREATE TABLE personTable (
name VARCHAR NOT NULL,
age INT
);
INSERT INTO personTable VALUES ('Alice', 25);
INSERT INTO personTable VALUES ('Bob', NULL);
INSERT INTO personTable VALUES ('Charlie');
INSERT INTO personTable VALUES (NULL, 30); -- FAILSThe third insert operation fails because it attempts to insert a null value into the required column name, violating the NOT NULL constraint.
Example: CTAS
Consider the following create table as select (CTAS) query:
CREATE TABLE personTargetTable AS SELECT * FROM personTable;
This CTAS query succeeds, but note that the nullability constraints are not inherited by the target table. This behavior preserves the design principle that columns default to nullable when nullability constraints are not explicitly specified. Consequently, the nullability constraints from the source table do not influence the nullability of columns in the target table.
To create a target table with required columns using CTAS, the nullability constraints must be explicitly specified:
CREATE TABLE personTargetTableWithAgeRequired ( name VARCHAR NOT NULL, age INT NOT NULL ) AS SELECT * FROM personTable;
Note, however, that creating a new table with required columns using CTAS succeeds only when the source data contains no null values for the columns designated as required. In the above example, this means that both personTable.name and personTable.age columns must not contain null values for the CTAS operation to succeed.
In case there is a mismatch between the specified data types of the new table's columns and the data types of source table columns, the type coercions are similar to CREATE TABLE with a subsequent INSERT INTO. For instance:
CREATE TABLE personTargetTableWithAgeRequired (
name VARCHAR NOT NULL,
age INT NOT NULL
) AS SELECT * FROM (VALUES('Bob', CAST(1.1 AS DOUBLE)));In this example, the CTAS operation succeeds, but the DOUBLE value 1.1 gets truncated to 1 as it is coerced to the INT data type.
Example: DROP NOT NULL
After defining nullability constraints during table creation, schema evolution may require modifying these properties. Apache Iceberg supports nullability changes in both directions: converting required columns to optional and converting optional columns to required. However, converting optional to required columns requires default values to handle existing null values (see Future Improvements: Default Values section). This operation is safe because it relaxes constraints without affecting existing data integrity and requires an explicit ALTER statement with DROP NOT NULL syntax:
ALTER TABLE personTable CHANGE age age INT DROP NOT NULL;
It also prevents modifying the column's nullability constraint from required to optional inadvertently, for instance:
ALTER TABLE personTable CHANGE age age INT NULL; ALTER TABLE personTable CHANGE age age INT;
Both the above statements result in an error. Modifying the column's nullability constraint from required to optional has to be done explicitly via DROP NOT NULL.
Note that using DROP NOT NULL on an optional column has no effect.
Example: ADD COLUMNS
For the same reason that modifying an optional column to a required column is not supported currently, adding a new required column to an existing table is not supported. However, it is possible to add an optional column. For instance:
ALTER TABLE personTable ADD COLUMNS ( email VARCHAR NULL, date_of_birth DATE );
Currently, it is not possible to add a required column to an existing table so using NOT NULL constraint on a column to be added results in an error.
Example: COPY INTO
The nullability constraints are also enforced during data loading operations. The COPY INTO command handles nullability constraint violations based on the source file format and ON_ERROR mode. For Parquet files, only the default ABORT mode is supported, while CSV and JSON files can be loaded with two additional modes CONTINUE and SKIP_FILE.
-- This will abort the entire COPY INTO operation if
-- any file contains null values for the 'name' column
COPY INTO personTable
FROM '@my_stage/person_data/'
FILES ('file1.csv', 'file2.csv', 'file3.csv')
ON_ERROR = 'ABORT';With CONTINUE mode, only the rows with the nullability constraint violations are skipped, whereas with SKIP_FILE mode, the entire file is skipped.
While loading CSV and JSON files NULL_IF can be used to replace specific string values with NULL during loading:
-- This converts string values 'None' and 'NA' to NULL during CSV loading
-- and the operation will abort if such string values are present.
COPY INTO personTable
FROM '@my_stage/person_data/'
FILES ('file1.csv', 'file2.csv', 'file3.csv')
(NULL_IF ('None', 'NA'));
ON_ERROR = 'ABORT';Nullability Constraints on Nested Types
Nullability constraints in Apache Iceberg extend beyond simple column types to nested data structures including structs, maps, and lists. These constraints can be applied at multiple levels within the type hierarchy: the nested type itself can be nullable or required, and individual fields within nested structures can have their own nullability specifications.
For example, the struct type column itself can be declared as optional or required, and additionally, each nested field within the struct can be independently declared as optional or required, providing fine-grained control over data quality requirements within complex nested types.
The following examples demonstrate the various combinations of nullability constraints that can be applied to nested types, showing how these constraints interact at different levels of the type hierarchy.
In case of list nested type, all four variants are allowed:
CREATE TABLE listsTable ( c1 LIST<INT NOT NULL> NOT NULL, c2 LIST<INT NOT NULL>, c3 LIST<INT> NOT NULL, c4 LIST<INT> ); -- row 1: [1, 2, 3], [1, 2, 3], [1, 2, 3], [null, null, null] -- row 2: [4, 5, 6], null, [null, null, null], null INSERT INTO listsTable VALUES ([1, 2, 3], [1, 2, 3], [1, 2, 3], [null, null]); INSERT INTO listsTable VALUES ([4, 5, 6], null, [null, null, null], null);
Although, the nullability of map key is not explicitly mentioned, it is always NOT NULL. In other words, MAP<INT, INT> is the same as MAP<INT NOT NULL, INT>.
CREATE TABLE mapsTable (
c1 MAP<INT, INT NOT NULL> NOT NULL,
c2 MAP<INT, INT NOT NULL>,
c3 MAP<INT, INT> NOT NULL,
c4 MAP<INT, INT>
);
-- Examples of rows:
-- row 1: {1: 1, 2: 2}, {1: 1, 2: 2}, {1: 1, 2: 2}, {1: null, 2: null}
-- row 2: {4: 4, 5: 5, 6: 6}, null, {4: null, 5: null, 6: null}, null
Again, for struct type, all nullability combinations are possible.
CREATE TABLE structTable (
c1 STRUCT<a: INT NOT NULL> NOT NULL,
c2 STRUCT<a: INT NOT NULL>,
c3 STRUCT<a: INT> NOT NULL,
c4 STRUCT<a: INT>
)
-- Examples of rows:
-- row 1: '{"a": 1}, {"a": 2}, {null}, null'
-- row 2: '{"a": 1}, null, {null}, null'Future Improvements: Default Values (Iceberg V3)
The nullability constraint specification interacts with Iceberg's schema evolution capabilities in complex ways. When columns are added to existing tables, the nullability constraint determines whether historical data requires backfill operations or can rely on default null values. Similarly, changing a column from nullable to non-nullable requires validation that no existing null values violate the new constraint.
These interactions become particularly significant in production environments where schema changes must maintain backward compatibility while ensuring data integrity across multiple table versions. Apache Iceberg V3 specification introduces "default values" to capture these interactions [2]. A default value of a column can be interpreted as an attribute of the column that defines the value to be used when no explicit value is provided during data insertion or when adding new columns to existing tables.
The V3 specification defines two types of default values: “write-default” and “initial-default”. The “write-default” value is used when inserting new rows where no value is specified for a column, providing a consistent fallback mechanism for data ingestion operations. The “initial-default” value is applied retroactively to existing data when a new column is added to a table, eliminating the need for expensive backfill operations while maintaining data consistency across all table versions.
Default values interact with nullability constraints in several important ways. For required (NOT NULL) columns, default values provide a mechanism to satisfy the non-null constraint when no explicit value is provided, preventing write failures that would otherwise occur. For optional columns, default values offer an alternative to null when representing missing data, allowing for more semantically meaningful defaults such as zero for numeric fields or empty strings for text fields. This interaction becomes particularly significant during schema evolution, where adding a required column with a default value allows the operation to succeed without violating existing data integrity constraints.
Conclusion
Column nullability serves as a safeguard for reliable data systems. Apache Iceberg's capabilities in enforcing and evolving nullability rules are crucial for ensuring data quality. Understanding the null, along with the specifics of engine support, is essential for constructing dependable data systems.
References
[1] Schema Evolution in Apache Iceberg: https://iceberg.apache.org/spec/#schema-evolution
[2] Schema and data types https://iceberg.apache.org/spec/#schemas-and-data-types
[3] Default Values in Apache Iceberg: https://iceberg.apache.org/spec/#default-values