May 2, 2024

Demystifying Data Governance: How Dremio Enables Governed Data Sharing

Strong data governance policies and approaches are essential for data-driven organizations. Data governance helps ensure the quality and reliability of data to drive accurate decision-making. It establishes clear roles and responsibilities, reducing the risk of data misuse. And, data governance enforces compliance with regulations, mitigating legal and reputational risks.

Learn how Dremio helps centralize data governance. We’ll discuss how Dremio helps facilitate:

  • Governed data sharing across projects
  • Governing access to data sources and views
  • Regulatory compliance with Row & Column based access controls
  • Integrations with other popular security tools

Topics Covered

Governance and Management

Sign up to watch all Subsurface 2024 sessions


Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Alex Merced:

This talk, I’m talking about demystifying data governance. So the goal here is going to be to just kind of show you all the different ways that Dremio 1 allows you to share data and govern data. Because there’s two things we want to do. We want to share our data, but when we share our data, we want to make sure we don’t just accidentally give people access to all the data. We want to make sure we can govern that data. So how does that look like in the Dremio world? How does that become possible? 

Data Sharing

So again, why do we want to do data sharing? Because data sharing refers to the practice of making data available to others. Bottom line, whether we’re sharing it internally. A lot of people use Dremio to share data within the company. But a lot of people use Dremio to share data with partners, customers, other people. So how do we do that? And again, why do we do this? For collaboration, analysis, sometimes for transparency when you’re sharing it with regulatory agencies and giving them access to your data so they can do their audits that are regulatory required, et cetera. So how do we make this possible through all these different various places you have your data? 

So bottom line, Dremio at its core is built to really be the way you query your data lake. The idea here is to basically turn your data lake into your center of data analytics. But Dremio works really well, is able to read and write Apache Iceberg tables and read Delta Lake tables. So you have different ways you can connect your data. It’s, again, an open platform. We try to give you a lot of exposure to a lot of different ways of connecting to your data. So that way, wherever your data is, we can touch it somehow.

Connecting Data Sources

So again, first we’re going to talk about where all this data can come from. And then we can also connect to your on-prem Hadoop data. So if you have data on-prem at Hadoop or if you’re using some sort of S3 compliant layer, like a Minio or a Vast or a Pure Storage or something like that, we can connect to it. And right there, you’re starting to see some of the level of data sharing. Because if I can connect all the data to one place, then we can share among users right there on the platform. So not only that, I can connect to other databases– Mongo, Postgres, MySQL. And we can connect to data warehouses like Snowflake, Redshift, Synapse. So there’s lots of places that Dremio can connect to to make data available to you. And all that data can be turned into views that you can then share with your users and grant access to your users. And that’s exactly what you would do next. 

Essentially, I would connect all my data sources. And this enables data sharing. Because guess what? Some of the data I might use to enrich my data sets, I might be getting it from the Snowflake Marketplace. You want to use data from the Snowflake Marketplace? Go for it. And you can use that data to enrich all your other data sources, wherever they are at. Instead of saying, hey, if I want to use Snowflake Marketplace data, I have to use it against Snowflake Tables, you can use it against all your other tables, regardless of where they’re at. Or maybe you’re getting data from the AWS Marketplace, and that data is landing in your S3, or as a table in your Redshift. You can connect that data to Dremio, and you can use that to enrich those shared data sets against your other data sets in other places. So it makes that data sharing story that you use with other data sharing platforms much easier. Or maybe you’re getting data from Delta Sharing, and you’re getting Delta Lake Tables. We can read those, allow you to enrich your other data sets with that data. So we make it easier for you to take advantage of the wide world of data sharing platforms that are out there, along with being a platform, a very good platform for data sharing. But you’d connect all your data sets, build out either your virtual data marts. 

And generally, to me, when I think of the distinction between marts and products, it’s more of a who’s in charge kind of thing. If I’m a central IT team building out these business unit products, I’ll probably think of it more as a virtual data mart, versus if I have a separate team that’s just for that specific part of the whole thing, then that’s more of like data mesh data product. So otherwise, it’s the same thing. You’re creating that modeling of that particular set of data. In this case, starting with your raw data sets from all your data sources, then your virtual layers that are on top of it. But once I created those layers, I can then begin granting access to it. And there’s different ways I can do that. Oh, OK, side point. I can then also reflect that. So again, for those who’ve been joining me for the last two sessions, I’ve mentioned reflections a few times. But bottom line, I can use reflections to then accelerate those virtual layers, so that way I don’t have to actually make every layer physical. But I can begin controlling access to those. So I can say, hey, only people with certain roles can access certain data sets. So I can sit there and create a marketing role. And then basically, I can say, hey, everything inside the marketing data product, if you’re in the marketing role, you can have access to it. I could do that. 

I can do row-based access controls. Now, the way row- and column-based access controls work in Dremio is you do it through a UDF. You would create a UDF, so a function, a SQL function, that returns a true or false. And depending on that, we’ll determine, OK, hey, does this row or this column get exposed? And then you would apply that rule to that table. So– and the cool thing is, like, actually, this morning, I got asked the question, OK, does Dremio have cell-based access control, where you can limit access to specific cells? Technically, you could achieve that with row- and column-based access rules. So you can imagine a world where basically, if you know what is the logic of a particular cell that they shouldn’t have access to, all you have to do is just take the part of the logic that determines the rows, make that a row-based access rule, and the logic that applies to the column. And then technically, you’re controlling their access to specific cells at that point. So theoretically, while there’s not a specific, hey, this specific cell cannot be accessed, you can achieve that using a mix of row- and column-based access rules. OK? Cool. 

Access Controls

So again, you have these fine-grained access rules to really make sure that only the right people have access to it. So again, any data that’s curated within these products or marts, anyone you give access to will only have access to what you wanted to give them access to. OK? So– and again, at this point, a lot of people– this is what they’ll do is they’ll just create a user account on their Dremio for, like– they might have a customer who they’re sharing data with. They’ll just create a user account for that customer, give access to that customer, grant access to just the specific data sets that customer should have access to. And now that they’ve shared that data with that user, to only those data add to that specific data sets. And again, those data sets that they provided that user could be from all of this stuff. So again, I could take our internal data that might be sitting on my data lake and enrich it from data-sharing platforms like AWS or Snowflake and then deliver it to my end user in one place that they can easily access that’s perfectly governed without me having to move everything in one place and pay all those ETL costs and unifying that physically. OK? 

But then that user who wants to use that data– so if I’m that partner who now has access to that data, I can access that data pretty easily through JDBC, ODBC, which means I can connect pretty much any BI tool, because pretty much every BI tool uses JDBC, ODBC. REST API– so we have a REST API. And the cool thing about Dremio is that the REST API is one way to access the data, but more importantly, everything in Dremio can be done through SQL. So all the admin stuff, granting permission, stuff like that. So we can technically– you can automate pretty much the entire operation of the Dremio cluster strictly through SQLs into the REST API, which people do. But bottom line is, there’s another layer that people can access the data that you’ve given them access to. OK? In Apache Aeroflight, that’s going to be the fastest way to actually access the data, because basically, you are taking columnar data and pulling it in a columnar form. Well, versus JDBC, ODBC, you’re taking columnar data, turning it to row-based, and then reconverting it to columnar on the other end. OK, so it’s a little slower. 

Dremio to Dremio Connector

Now, here’s where another nice sharing aspect comes in. Maybe someone else has another Dremio cluster, and they want to share data with you. OK, you can connect Dremio to Dremio. OK? Another place where the Dremio to Dremio connector– so you’re actually connecting to somebody else’s or another one of your own Dremio clusters– is oftentimes when people have a cloud and a software cluster. Now, why would they want to have both? Maybe they really like using Dremio Cloud. They want all the features of Dremio Cloud, but they have some on-prem data sets. So you need to have a version of Dremio that’s running co-located with your on-prem data. So what they do is they’ll have that Dremio software next to their Hadoop, and now their Dremio Cloud can connect to that on-prem version of Dremio, and now they can query their Hadoop data sets even in the cloud version of Dremio through that Dremio to Dremio connector. But again, you could also theoretically have two companies that are using Dremio, and they could use that as another vector for sharing data with each other. So in that way, this is another option. 

So again, I can grab data from all these different sources that give me access to existing sharing platforms, but then I can share that data with users and then govern that in a very granular way across all those data sets. OK? So moral of the story here is it doesn’t matter where the data is, I can share it with people, and I can make sure I only share it in the right access. So I can hide PII. I can do whatever. One second. You can start seeing the effect of doing three talks back. Yes. OK. 


So and then last, integrations. OK, Dremio has lots of integrations. A lot of them are built on these existing interfaces. But for example, there’s a button right there in the Dremio UI where you can just click to open up Power BI or open up Tableau, and you’re automatically connected. So there’s plenty of integrations that work that way. So you can use those as other ways to connect to it. And again, all this is done is that if I’m a user, I can generate my own personal access token. And basically, when I use that personal access token to send a query to Dremio through any of these interfaces, that’ll uniquely identify which user’s accessing it, so it knows how to apply the right access rules. 

So in that case, once I give that person that user their account, and I’ve granted them access, no matter how they access Dremio, I know they’re only accessing what I give them access to. OK? Across all my data sets. So it gives me one place where I don’t have to– so it’s not like other times where I had to think through, OK, hey, let me give you a user on AWS, so that way you can access these S3 buckets. Let me create you a user on Snowflake so you can access these particular tables. And you have to give people like five different accounts with five sets of different access rules that you have to keep track of to give them access to all the data. I can do it all from one place. OK, making it really easy to collaborate, share data, and govern it. OK? 

Security and Governance

But again, just to kind of summarize, again, Dremio has a lot of different governance levers. Again, we have the fine-grained access controls. We have end-to-end authentication, because again, basically everything has to be accessed through that PAT token. And we also have OAuth integration and lots of other types of SSO type integrations, so that way you can use the levels of security that you may already be using. And we have all sorts of compliance certifications to make sure that you’re compliant with all the security standards. So we have all these different security certificates that will come up later. Again, these are how the row-based and column-based access rules look like. So the way it would start off is that you would create a UDF. So here, I create a function that returns the result. And then here, what I would do is I would alter the table and say, hey, apply this masking policy. And then I just specify the function. And then it will then apply that to anybody who accesses the table. 

So again, everything’s always very SQL-centric in the Dremio platform. But this makes it where fine-grained access control can be done in a very SQL-centric, accessible way. Because the idea is that SQL is generally going to be very accessible to people at a much wider level of skill sets or technical skill set. 

End-to-End Authentication

And again, so again, reinforcing that whole end-to-end authentication, the idea here is no matter how you access Dremio, because if you access– you can send SQL through the UI. But when you go to the UI, you got to log in, so we know who you are. If you want to access the REST API, you got to give us your PAT token. We know who you are. So that way, no matter where you are, you’ve authenticated, and we know what governance rules apply to you. So you’re never going to be able to sneakily access data you’re not supposed to, because we always know who you are before you access anything. And then you can take advantage of other existing tools like Previsera, PlainID that all have integrations. You can take advantage of Power BI single sign-on and Tableau. So there’s all sorts of different security– methods of security and authentication that you may already be using that integrate with Dremio. So that way, you don’t have to change the way you’ve been authenticating your users. And again, we have all these certifications, depending on what industry you’re in. So if you’re in the medical space, we’re HIPAA compliant. We’re AICPA compliant. We are ISO certified. Just to give you that extra security that you know that we’ve gone and tested our software against the things that need to be tested, so that way you know it’s secure. And that’s pretty much the story here.
So again, just to summarize it, the point of this was, one, know that you can govern your data in Dremio. And two, the data is shareable. And again, the beauty of that sharing is that you’re sharing it across so many different data sources. And you can take advantage of other sharing platforms within all of that. And that makes– it gives us a really unique place in how we can coalesce and unify your data here at Dremio.