Row and column level security is an important functionality for the majority of Dremio users. To implement this requirement, Dremio provides the following SQL functions:
is_member(ldap_group)
query_user()
USER IN (list of user names).
Our documentation provides basic information on how to utilize these functions using examples with hard-coded users and groups. In this article, we cover this topic at depth with complete data-driven row and column level permissions.
The Scenario
This is a common example of data tables that contain sensitive data. Let’s take SALES_REGIONS defined below as an example of sensitive data that requires restricted view permissions.
Column
Description
REGION
Non-sensitive column
SALES_VOLUME
Sensitive column. Only marketing people from that region can see that data
SALES_CONTACT
Sensitive column. All marketing people can see that data.
Region
Sales_Volume
Sales_Contact
WA
1,000,000.00
Sam
CA
1,100,000.00
Jon
This scenario takes into consideration the following conditions:
People in the Marketing team can view the sensitive data only if it belongs to their region.
There are three LDAP groups: Marketing that lists all people in the Marketing team, CA_Org *and *WA_Org that lists people in the entire organization that belongs to the correspondent region.
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
What to avoid
The easiest way to implement row-level permissions, is to create a SQL statement that would allow users see only rows that they have access to.
SELECT REGION, SALES_VOLUME, SALES_CONTACT
FROM SALES_REGIONS
WHERE IS_MEMBER('Marketing') AND
( (REGION = 'WA' AND IS_MEMBER('WA_Org')) OR
(REGION = 'CA' AND IS_MEMBER('CA_Org')) )
In order to provide a more comprehensive view to that data, we can implement column-level permission we can create the following SQL statement:
SELECT
REGION,
CASE WHEN REGION = 'WA' AND IS_MEMBER('WA_Org') THEN SALES_VOLUME
WHEN REGION = 'CA' AND IS_MEMBER('CA_Org') THEN SALES_VOLUME
ELSE ' *** ' END SALES_VOLUME,
SALES_CONTACT
FROM SALES_REGIONS
WHERE is_member('Marketing')
However, this SQL code contains hardcoded LDAP groups and this approach is not encouraged since it is error-prone and hard to maintain.
Implementing data-driven permissions
For this example, let’s take into consideration the table SALES_PERM that contains all required permission information:
Column
Description
TABLE_NAME
Table in the scope of the permission record
COLUMN_NAME
Column in the scope of the permission record
REGION
Sensitive Data Key
LDAP_GROUP
Ldap group that has access to the sensitive data
TABLE_NAME
COLUMN_NAME
REGION
LDAP_GROUP
SALES_REGIONS
Marketing
SALES_REGIONS
SALES_VOLUME
CA
CA_Org
SALES_REGIONS
SALES_VOLUME
WA
WA_Org
SALES_REGION
SALES_CONTACT
Marketing
The new row level permission query now looks like this:
SELECT R.REGION, R.SALES_VOLUME, R.SALES_CONTACT
FROM SALES_REGIONS R
JOIN SALES_PERM P ON
P.TABLE_NAME = 'SALES_REGIONS' AND
P.COLUMN_NAME = 'SALES_VOLUME' AND
R.REGION = P.REGION AND
IS_MEMBER(LDAP_GROUP)
The column level permission query would look like this:
SELECT
R.REGION,
CASE WHEN P.REGION IS NOT NULL THEN R.SALES_VOLUME
ELSE ' *** ' END SALES_VOLUME,
R.SALES_CONTACT
FROM SALES_REGIONS R
JOIN SALES_PERM RP ON
RP.TABLE_NAME = 'SALES_REGIONS' AND
RP.COLUMN_NAME IS_NULL AND
RP.REGION IS NULL AND
IS_MEMBER(RP.LDAP_GROUP)
LEFT JOIN SALES_PERM CP ON
CP.TABLE_NAME = 'SALES_REGIONS' AND
CP.COLUMN_NAME = 'SALES_VOLUME' AND
CP.REGION = R.REGION AND
IS_MEMBER(CP.LDAP_GROUP)
This approach does not have any data hardcoded and is a much better approach from maintenance and security perspective.
Considerations and Best Practices
Designing a permission table is a challenging task and depends on the concrete requirements. So far, the following approaches have been considered and implemented:
As per the example above, a single table can contain row and column level permissions for all tables. While it may look like a simple design that would also be easy to maintain, the number of columns in this table may start growing quickly. This approach may work well when the number of sensitive columns is small and/or the column names are consistent across all tables and databases.
Another approach is to use separate permission tables for different tables and columns. This approach would allow for more flexibility but has the potential to increase maintenance overhead.
A great approach to consider is to utilize a single key/value table where key could be a table name or a combination of a table name and and a column name. Value would be a permission flag, such as Y/N. This table can be pivoted with views to make it easy to consume in the virtual data sets (VDSs).
Instead if making the permission flag a Y/N, we can support multiple values, such as:
100 - no access, completely mask the data
200 - partial access, show last 4 digits
300 - full access, show data as is.
That approach would allow for various levels of security as well as easy potential CI/CD automation.
More normalized/advanced design can support additive and inverse LDAP groups, such as Marketing and WA_Org from the example above. This approach would most likely require LDAP_GROUPS table with many-to-many relationship to other permission table(s).
What to look for
Row duplicates: when following any of the design options above, there is always the risk of generating duplicate rows. It’s possible that a query user belongs to multiple LDAP groups listed in the permission table. When joining a data table, it might produce duplicate rows. Here are a couple of options to consider:
Assign the DISTINCT clause. Depending on the data size, that option might be costly.
The other, better option, is to prevent duplicates by ensuring that the permission table does not return more than one record. For example, when using a permission flag. We can use MIN function to support the “least permission” approach or MIN to support “most permission” approach.
This approach has been tested on sub-second queries with datasets as large as 50M records without any impact on performance. However, performance is still a significant design consideration. Improper design (using DISTINCT for example) can drastically increase execution time. It’s important to validate design from a performance perspective.
Conclusion
In this article, we walked through the steps of implementing security on your data lake using data-driven permissions using Dremio. We have shared a sample scenario that is very similar to what many enterprises face when they need to secure data. We have also talked about design considerations, best practices and things to keep an eye out for when performing this implementation.
We hope you find this information useful, to learn more about how you can enhance the security of your data lake using Dremio, checkout these resources:
Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg
By unifying data from diverse sources, simplifying data operations, and providing powerful tools for data management, Dremio stands out as a comprehensive solution for modern data needs. Whether you are a data engineer, business analyst, or data scientist, harnessing the combined power of Dremio and Apache Iceberg will undoubtedly be a valuable asset in your data management toolkit.
Oct 12, 2023·Product Insights from the Dremio Blog
Table-Driven Access Policies Using Subqueries
This blog helps you learn about table-driven access policies in Dremio Cloud and Dremio Software v24.1+.
Aug 31, 2023·Dremio Blog: News Highlights
Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud
Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.