25 minute read · April 11, 2025

Pre-Computing Secure Materializations

James Starr

James Starr · Senior Staff Software Engineer, Dremio

Brandon Chong

Brandon Chong · Staff Software Engineer, Dremio

Federated queries on the lakehouse can be slow as terabytes of data is scanned and joined across distributed systems. A common solution to this performance problem is to build access layer views that contain the expensive operations and materialize them so that users access pre-computed data. However, a problem with this approach is that if the pre-computed data contained user or role specific security policies such as column masks or role based access filters, the pre-computed materialization could only be used by a specific user or role. Reflections are Dremio’s query acceleration technology that automatically creates and maintains materialized representations of your datasets. Dremio Reflections do not have this user specific limitation as we can analyze the query plan for a materialization, extract out the column masks and row filters and then re-apply them at query time while ensuring data correctness, security policy enforcement and governance requirements.

How Secure Materializations are Built?

In this blog we will be going over:

For the duration of the blog we will refer to this schema:

CREATE TABLE employees ( employee_id INT, name VARCHAR, hire_date DATE, job_title VARCHAR, department_id INT, manager_id INT, clearance_level INT, salary DECIMAL(10, 2) );

Meta Concept for RCAC

Suppose we create a column mask for the employees table as follows:

-- Only HR is allowed to see the salary 
CREATE FUNCTION protect_salary (val DECIMAL(10, 2)) RETURNS DECIMAL(10, 2) RETURN SELECT CASE WHEN is_member('hr') THEN val ELSE NULL END;

Next, we apply the column mask to the salary column:

ALTER TABLE employees MODIFY COLUMN salary SET MASKING POLICY protect_salary (salary);

Now, let's set up a row policy:

-- Only show employees that have been hired in the last 100 days 
CREATE FUNCTION is_recent_employee (hire_date DATE) RETURNS BOOLEAN RETURN SELECT DATEDIFF(NOW(), hire_date) < 100;

We apply the row policy to the table with:

-- Set the row policy 
ALTER TABLE employees ADD ROW ACCESS POLICY is_recent_employee (hire_date);

With these policies in place, when we run a query like:

SELECT department_id, UPPER(name), salary FROM employees WHERE department_id IN (10, 20, 30)

The query is semantically rewritten to:

SELECT department_id, UPPER(name), salary FROM ( SELECT protect_salary(salary) AS salary, employee_id, name, hire_date, job_title, department_id, manager_id, clearance_level FROM employees AS rcac WHERE is_recent_employee(hire_date) ) AS employees WHERE department_id IN (10, 20, 30);

Here, the row and column access policies are applied first to the table (represented by the subquery), and then the operations from the original query are applied afterward. This ensures data integrity by enforcing row-level security and column masking before the query logic is executed.

Meta Concept for Reflections

Suppose we create a view on the employees table like so:

CREATE VIEW recentEmployees AS SELECT * FROM employees WHERE DATE_PART('YEAR', hire_date) = 2024

We can optionally create a raw reflection on this view to speed up query execution:

ALTER VIEW recentEmployees CREATE RAW REFLECTION recentEmployeesAccelerated USING DISPLAY( employee_id, name, hire_date, job_title, department_id, manager_id, clearance_level, salary );

Now, imagine a user writes a query like this:

SELECT UPPER(name) FROM employees WHERE department_id IN (10, 20, 30) AND DATE_PART('YEAR', hire_date) = 2024

Semantically, the engine can rewrite the query to:

SELECT UPPER(name) FROM ( SELECT * FROM employees WHERE DATE_PART('YEAR', hire_date) = 2024 ) WHERE department_id IN (10, 20, 30)

At first glance, this might seem like a trivial transformation—simply pushing part of the filter into the subquery. It doesn't appear to save much computation. However, the key insight is that the subquery:

SELECT * FROM employees WHERE DATE_PART('YEAR', hire_date) = 2024

...is exactly the same as the recentEmployees view we defined earlier. So, instead of querying the raw employees table, we can semantically rewrite the query to:

SELECT UPPER(name) FROM recentEmployeesAccelerated WHERE department_id IN (10, 20, 30)

This approach is advantageous because the filter (DATE_PART('YEAR', hire_date) = 2024) has already been precomputed in the accelerated view (recentEmployeesAccelerated). This saves us from performing redundant computations, enabling the engine to simply filter the accelerated view and apply the transformation to the name column.

This technique, where we rewrite the query to leverage an already-accelerated view, is known as reflection matching. If a user enables autonomous reflections, this optimization will be automatically applied by the system.

RCAC + Reflections

Let’s revisit the rewritten RCAC query from the first section:

SELECT department_id, UPPER(name), salary FROM ( SELECT protect_salary(salary) AS salary, employee_id, name, hire_date, job_title, department_id, manager_id, clearance_level FROM employees AS rcac WHERE is_recent_employee(hire_date) ) AS employees WHERE department_id IN (10, 20, 30);

Now that we understand how reflections can create a materialized view for use as a subquery, we might consider accelerating this query by materializing a view for:

SELECT protect_salary(salary) AS salary, employee_id, name, hire_date, job_title, department_id, manager_id, clearance_level FROM employees AS rcac WHERE is_recent_employee(hire_date)

However, there's a slight challenge here: both protect_salary and is_recent_employee are context-sensitive functions. This means the results of the subquery can vary depending on factors such as:

  • The user executing the query
  • The timing of query execution
  • User-role group associations
  • The UDF definitions of protect_salary and is_recent_employee

Since these variables can change frequently, constructing a materialized view to serve future queries becomes unreliable. So, we need a different approach.

Rewriting the Query

To solve this, we can rewrite the original query to:

SELECT department_id, upper_name, protect_salary(salary) 
FROM ( SELECT department_id, UPPER(name) as upper_name, salary, hire_date 
      FROM employees 
      WHERE department_id IN (10, 20, 30) ) as acceleratable_employees 
WHERE is_recent_employee(hire_date)

Notice that in this refactored version, the subquery now depends only on static variables (i.e., department_id, UPPER(name), and salary). This makes it possible to create a materialized view on the subquery, as these variables don't change based on context-sensitive factors.

Creating a View and Reflection

Now we can create a view that encapsulates the static part of the query:

CREATE VIEW myViewForRCAC AS SELECT department_id, UPPER(name) as upper_name, salary, hire_date FROM employees WHERE department_id IN (10, 20, 30)

Next, we can establish a reflection on top of this view to accelerate the query:

ALTER VIEW myViewForRCAC CREATE RAW REFLECTION myViewForRCACAccelerated USING DISPLAY( department_id, upper_name, salary, hire_date )

Final Query

Now, the original query semantically becomes:

SELECT department_id, upper_name, protect_salary(salary) FROM myViewForRCACAccelerated WHERE is_recent_employee(hire_date)

Key Concept: Commutative Property of Projections and Filters

This transformation relies on the commutative property of SQL operators—specifically, projections and filters. When a query specifies WHERE A AND B, we can apply A first and then B, or vice versa, and the result will be the same.

This property is extremely useful for optimizing RCAC (Row-Level Security) and Reflections. By applying the row policy and column mask last in the query execution order (after all other projections and filters), we can accelerate the operations preceding these policies. The row policy and column mask will be computed at runtime, while the rest of the query benefits from the materialized view, reducing computational overhead.

Conclusion

By deferring the row policy and column mask to the final filter and projection steps, we can efficiently accelerate queries with reflections, all while maintaining the integrity of row-level security (RCAC) policies. This approach allows for faster query performance without compromising security or correctness.

Mathematical Limitations

Now that we understand how RCAC + Reflections leverages the commutative property of filters and projections to maximize computation pushdown to a materialized view, let's explore scenarios where this transformation isn't possible. These scenarios share the common limitation of non-commutative operators, which we'll delve into for educational purposes.

1. Aggregate on Column Masks

Consider a simple query like:

SELECT MIN(salary) FROM employees;

Expanding the RCAC policy, we get something like:

SELECT MIN(salary) FROM ( SELECT protect_salary(salary) AS salary FROM employees AS rcac WHERE is_recent_employee(hire_date) );

If we try to perform the commutative rewrite, transforming the query to:

SELECT protect_salary(min_salary) FROM ( SELECT MIN(salary) AS min_salary FROM employees ) AS acceleratable_view WHERE is_recent_employee(hire_date);

While the query can be rewritten using the commutative property of filters and projections, a major issue arises: the results differ. This is because protect_salary and MIN are not commutative—the order in which they are applied matters.

  • MIN(salary) computes the minimum salary across all rows, whereas protect_salary(salary) operates on each individual salary value and may mask the salary depending on the user's role.
  • Applying MIN(salary) first will yield a single result (the minimum salary), and then protect_salary will mask that value, while reversing the order leads to applying the protect_salary first, which affects all individual salary values before the MIN is computed.

Thus, in this case, the order of operations cannot be swapped, and the result may be incorrect if we attempt to apply the RCAC policy at the wrong stage.

2. NULL-Generating JOINs

Consider the following query with a RIGHT JOIN:

SELECT emp.department_id, dept.department_name, emp.name FROM employees AS emp RIGHT JOIN department AS dept ON emp.department_id = dept.department_id;

Expanding the RCAC policy, we get:

SELECT emp.department_id, dept.department_name, emp.name, salary 
FROM ( SELECT protect_salary(salary) AS salary, * 
      FROM employees AS rcac 
      WHERE is_recent_employee(hire_date) ) 
      AS emp RIGHT JOIN department AS dept ON emp.department_id = dept.department_id;

If we attempt to rewrite this query as:

SELECT emp.department_id, dept.department_name, emp.name, protect_salary(salary) FROM employees AS emp RIGHT JOIN department AS dept ON emp.department_id = dept.department_id WHERE is_recent_employee(hire_date);

We run into a problem: a RIGHT JOIN generates NULL values on the left side (i.e., employees) for all rows that don't meet the join condition. This creates issues because protect_salary and is_recent_employee are being applied to NULL values, which leads to incorrect results.

General Rule:

  • You can only defer the RCAC policy if it is applied on the "join side" of the join.
    • Left side of LEFT JOIN
    • Right side of RIGHT JOIN
    • Either side of INNER JOIN
    • Neither side of FULL OUTER JOIN

In this case, since the RCAC policy is being applied to employees (the left side of a RIGHT JOIN), the join introduces NULLs which disrupt the application of protect_salary and is_recent_employee.

3. SET Operations (UNION, INTERSECT, EXCEPT)

Consider a query with a UNION:

SELECT * FROM a UNION SELECT * FROM employees UNION SELECT * FROM c;

Expanding the RCAC policy, we get:

SELECT * FROM a UNION SELECT * FROM ( SELECT protect_salary(salary) AS salary, * FROM employees AS rcac WHERE is_recent_employee(hire_date) ) UNION SELECT * FROM c;

We may try to defer the application of protect_salary and is_recent_employee to the entire UNION query:

SELECT protect_salary(salary) AS salary, * FROM ( SELECT * FROM a UNION SELECT * FROM employees UNION SELECT * FROM c ) WHERE is_recent_employee(hire_date);

However, this approach fails because tables a and c do not have the same RCAC policies as employees. Applying protect_salary and is_recent_employee on a and c would lead to inconsistent results, as these tables are not subject to the same row and column-level security policies as employees.

Inconsistent application of policies across different tables makes this transformation invalid, and the results would be incorrect.

4. Trimming Projects

Consider the following query using a subset of columns:

SELECT protect_salary(salary) AS salary, name FROM employees AS rcac WHERE is_recent_employee(hire_date);

Now, if a user tries to expand this query into:

SELECT department_id, UPPER(name), salary FROM ( SELECT protect_salary(salary) AS salary, name FROM employees AS rcac WHERE is_recent_employee(hire_date) ) AS employees WHERE department_id IN (10, 20, 30);

If we attempt to defer the is_recent_employee filter, we get:

SELECT department_id, UPPER(name), salary FROM ( SELECT protect_salary(salary) AS salary, name FROM employees AS rcac ) AS employees WHERE department_id IN (10, 20, 30) AND is_recent_employee(hire_date);

Here, is_recent_employee(hire_date) depends on hire_date, but hire_date was not included in the subquery’s selected columns. This leads to a semantic issue, as hire_date is required for evaluating the row policy but is not available in the view.

This example demonstrates how RCAC (Row-Level Security) and reflections help maintain data integrity when views do not include all the necessary columns to enforce row policies and column masks.

Conclusion

In summary, while RCAC + Reflections benefits from the commutative property of projections and filters in many cases, there are notable limitations due to non-commutative operations, such as aggregates on column masks, NULL-generating joins, SET operations, and trimming projects. Understanding these limitations is critical for building optimized queries that still respect data security and integrity.

Investigating RCAC Pullup Limitations

Due to the limitations outlined earlier (Pre-Computing Secure Materializations | Mathematical Limitations ), RCAC Pullup cannot always be applied to integrate with reflections.

Let’s explore a practical scenario where RCAC Pullup fails to apply effectively, based on the query profile.

Example: Query Profile Analysis

Suppose you are executing a query with an RCAC policy and reflection that fails to accelerate. Upon investigating the "Special Pull Up" phase in the query profile we might observe something like this:

LogicalProject(n_nationkey=[$0], n_name=[$1], n_regionkey=[$2], n_comment=[$3]): rowcount = 6.25, cumulative cost = {131.25 rows, 506.25 cpu, 200.0 io, 200.0 network, 0.0 memory}, id = 5774 ExpansionNode(path=[spaceA.vds3]): rowcount = 6.25, cumulative cost = {125.0 rows, 481.25 cpu, 200.0 io, 200.0 network, 0.0 memory}, id = 5772 LogicalUnion(all=[false]): rowcount = 6.25, cumulative cost = {125.0 rows, 481.25 cpu, 200.0 io, 200.0 network, 0.0 memory}, id = 5770 LogicalFilter(condition=[isEvenNationKey_<spaceA.vds1>($0)]): rowcount = 6.25, cumulative cost = {56.25 rows, 225.0 cpu, 100.0 io, 100.0 network, 0.0 memory}, id = 5758 ExpansionNode(path=[spaceA.vds1]): rowcount = 25.0, cumulative cost = {50.0 rows, 200.0 cpu, 100.0 io, 100.0 network, 0.0 memory}, id = 5756 LogicalProject(n_nationkey=[$0], n_name=[UPPER($1)], n_regionkey=[$2], n_comment=[UPPER($3)]): rowcount = 25.0, cumulative cost = {50.0 rows, 200.0 cpu, 100.0 io, 100.0 network, 0.0 memory}, id = 5754 ScanCrel(table=[cp.nation_ctas.t1."0_0_0.parquet"], snapshot=[7278432390332716255], columns=[`n_nationkey`, `n_name`, `n_regionkey`, `n_comment`], splits=[1]): rowcount = 25.0, cumulative cost = {25.0 rows, 100.0 cpu, 100.0 io, 100.0 network, 0.0 memory}, id = 5707 LogicalProject(n_nationkey=[$0], n_name=[$1], n_regionkey=[$2], n_comment=[$3]): rowcount = 6.25, cumulative cost = {62.5 rows, 250.0 cpu, 100.0 io, 100.0 network, 0.0 memory}, id = 5768 LogicalFilter(condition=[starts_with($1, 'A':VARCHAR(1))]): rowcount = 6.25, cumulative cost = {56.25 rows, 225.0 cpu, 100.0 io, 100.0 network, 0.0 memory}, id = 5766 ExpansionNode(path=[spaceA.vds2]): rowcount = 25.0, cumulative cost = {50.0 rows, 200.0 cpu, 100.0 io, 100.0 network, 0.0 memory}, id = 5764 LogicalProject(n_nationkey=[$0], n_name=[$1], n_regionkey=[$2], n_comment=[$3]): rowcount = 25.0, cumulative cost = {50.0 rows, 200.0 cpu, 100.0 io, 100.0 network, 0.0 memory}, id = 5762 ScanCrel(table=[cp.nation_ctas.t1."0_0_0.parquet"], snapshot=[7278432390332716255], columns=[`n_nationkey`, `n_name`, `n_regionkey`, `n_comment`], splits=[1]): rowcount = 25.0, cumulative cost = {25.0 rows, 100.0 cpu, 100.0 io, 100.0 network, 0.0 memory}, id = 5707

In this profile, we can see that:

LogicalFilter(condition=[isEvenNationKey_<spaceA.vds1>($0)])

is stuck under the LogicalUnion operator:

LogicalUnion(all=[false])

This means we were unable to pull the isEvenNationKey row policy above the UNION operator, aligning with the limitation described here: Pre-Computing Secure Materializations | SET Operations

Other Solutions to Pullup Limitations

If RCAC Pullup fails due to these limitations, we have a few alternatives:

  1. Manual Optimization:
    • Refactor the query manually to ensure the RCAC policy is applied at the correct stage.
    • For example, instead of applying the RCAC policy after a SET operation, try applying it to the individual components before they are unioned.
  2. Leverage Reflections More Effectively:
    • Use raw reflections on smaller components of the query that avoid the problematic operators.
    • Reflections can still be beneficial in cases where the query logic doesn't involve non-commutative operators, like aggregates or unions.

Conclusion

Integrating row column access control with materializations enables Dremio Reflections to deliver high-performance query execution without compromising on security or flexibility, making it an ideal solution for scalable, secure data access in the lakehouse architecture. Furthermore, by enabling pre-compute materializations to be re-usable across users and roles, significant cost savings can be achieved through more efficient engine resource utilization.

Sign up for AI Ready Data content

Unlock the Full Potential of the: Power Your AI Initiatives with Trusted Data

Ready to Get Started?

Enable the business to accelerate AI and analytics with AI-ready data products – driven by unified data and autonomous performance.