Using Data-driven Permissions to Secure Your Data Lake
Row and column level security is an important functionality for the majority of Dremio users. To implement this requirement, Dremio provides the following SQL functions:
- USER IN (list of user names).
Our documentation provides basic information on how to utilize these functions using examples with hard-coded users and groups. In this article, we cover this topic at depth with complete data-driven row and column level permissions.
This is a common example of data tables that contain sensitive data. Let’s take SALES_REGIONS defined below as an example of sensitive data that requires restricted view permissions.
|SALES_VOLUME||Sensitive column. Only marketing people from that region can see that data|
|SALES_CONTACT||Sensitive column. All marketing people can see that data.|
This scenario takes into consideration the following conditions:
- People in the Marketing team can view the sensitive data only if it belongs to their region.
- There are three LDAP groups: Marketing that lists all people in the Marketing team, CA_Org *and *WA_Org that lists people in the entire organization that belongs to the correspondent region.
What to avoid
The easiest way to implement row-level permissions, is to create a SQL statement that would allow users see only rows that they have access to.
1 2 3 4 5 SELECT REGION, SALES_VOLUME, SALES_CONTACT FROM SALES_REGIONS WHERE IS_MEMBER('Marketing') AND ( (REGION = 'WA' AND IS_MEMBER('WA_Org')) OR (REGION = 'CA' AND IS_MEMBER('CA_Org')) )
In order to provide a more comprehensive view to that data, we can implement column-level permission we can create the following SQL statement:
1 2 3 4 5 6 7 8 SELECT REGION, CASE WHEN REGION = 'WA' AND IS_MEMBER('WA_Org') THEN SALES_VOLUME WHEN REGION = 'CA' AND IS_MEMBER('CA_Org') THEN SALES_VOLUME ELSE ' *** ' END SALES_VOLUME, SALES_CONTACT FROM SALES_REGIONS WHERE is_member('Marketing')
However, this SQL code contains hardcoded LDAP groups and this approach is not encouraged since it is error-prone and hard to maintain.
Implementing data-driven permissions
For this example, let’s take into consideration the table SALES_PERM that contains all required permission information:
|TABLE_NAME||Table in the scope of the permission record|
|COLUMN_NAME||Column in the scope of the permission record|
|REGION||Sensitive Data Key|
|LDAP_GROUP||Ldap group that has access to the sensitive data|
The new row level permission query now looks like this:
1 2 3 4 5 6 7 SELECT R.REGION, R.SALES_VOLUME, R.SALES_CONTACT FROM SALES_REGIONS R JOIN SALES_PERM P ON P.TABLE_NAME = 'SALES_REGIONS' AND P.COLUMN_NAME = 'SALES_VOLUME' AND R.REGION = P.REGION AND IS_MEMBER(LDAP_GROUP)
The column level permission query would look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 SELECT R.REGION, CASE WHEN P.REGION IS NOT NULL THEN R.SALES_VOLUME ELSE ' *** ' END SALES_VOLUME, R.SALES_CONTACT FROM SALES_REGIONS R JOIN SALES_PERM RP ON RP.TABLE_NAME = 'SALES_REGIONS' AND RP.COLUMN_NAME IS_NULL AND RP.REGION IS NULL AND IS_MEMBER(RP.LDAP_GROUP) LEFT JOIN SALES_PERM CP ON CP.TABLE_NAME = 'SALES_REGIONS' AND CP.COLUMN_NAME = 'SALES_VOLUME' AND CP.REGION = R.REGION AND IS_MEMBER(CP.LDAP_GROUP)
This approach does not have any data hardcoded and is a much better approach from maintenance and security perspective.
Considerations and Best Practices
Designing a permission table is a challenging task and depends on the concrete requirements. So far, the following approaches have been considered and implemented:
As per the example above, a single table can contain row and column level permissions for all tables. While it may look like a simple design that would also be easy to maintain, the number of columns in this table may start growing quickly. This approach may work well when the number of sensitive columns is small and/or the column names are consistent across all tables and databases.
Another approach is to use separate permission tables for different tables and columns. This approach would allow for more flexibility but has the potential to increase maintenance overhead.
A great approach to consider is to utilize a single key/value table where key could be a table name or a combination of a table name and and a column name. Value would be a permission flag, such as Y/N. This table can be pivoted with views to make it easy to consume in the virtual data sets (VDSs).
Instead if making the permission flag a Y/N, we can support multiple values, such as:
- 100 - no access, completely mask the data
- 200 - partial access, show last 4 digits
- 300 - full access, show data as is.
That approach would allow for various levels of security as well as easy potential CI/CD automation.
More normalized/advanced design can support additive and inverse LDAP groups, such as Marketing and WA_Org from the example above. This approach would most likely require LDAP_GROUPS table with many-to-many relationship to other permission table(s).
What to look for
Row duplicates: when following any of the design options above, there is always the risk of generating duplicate rows. It’s possible that a query user belongs to multiple LDAP groups listed in the permission table. When joining a data table, it might produce duplicate rows. Here are a couple of options to consider:
- Assign the DISTINCT clause. Depending on the data size, that option might be costly.
- The other, better option, is to prevent duplicates by ensuring that the permission table does not return more than one record. For example, when using a permission flag. We can use MIN function to support the “least permission” approach or MIN to support “most permission” approach.
This approach has been tested on sub-second queries with datasets as large as 50M records without any impact on performance. However, performance is still a significant design consideration. Improper design (using DISTINCT for example) can drastically increase execution time. It’s important to validate design from a performance perspective.
In this article, we walked through the steps of implementing security on your data lake using data-driven permissions using Dremio. We have shared a sample scenario that is very similar to what many enterprises face when they need to secure data. We have also talked about design considerations, best practices and things to keep an eye out for when performing this implementation.
We hope you find this information useful, to learn more about how you can enhance the security of your data lake using Dremio, checkout these resources:
- Dremio Data security documentation
- Dremio Security Architecture Guide
- Dynamic security controls - masking sensitive data using Dremio
- Apache Ranger integration in Dremio
- How to set up HDFS and HIVE impersonation
- How to use inbound impersonation
- Configuring Dremio to read S3 files leveraging AWS STS tokens