March 2, 2023

10:45 am - 11:15 am PST

Data Governance at Scale with Microsoft

Scaling data governance practices while continuing to provide efficient access to the data lake is a challenge all businesses face. No one knows this better than Microsoft, whose exabyte-scale data lake platform supports analytics that empower multiple billion-dollar lines of business. Join this session to hear Mike Flasko, VP & GM, Data Governance & Privacy at Microsoft, chat with Read Maloney, CMO of Dremio, about their journey and lessons learned, as they built one of the largest governed data lake platforms in the world.

Topics Covered

Enterprises
Keynotes
Lakehouse Architecture
Real-world implementation

Sign up to watch all Subsurface 2023 sessions

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Read Maloney:

Guys, it’s a pleasure to be here to talk with Mike. Mike, thanks for joining us. I see an awesome Kraken jersey in the background there. For all of those, not from Seattle, go Kraken. We’re having a great season. Mike, could you just briefly introduce yourself and your role here at Microsoft and we’ll sort of jump in from there?

Mike Flasko:

Sure, it sounds good. First, thanks for having me. I really appreciate it. At Microsoft, I’m VP and GM for Data Governance and Privacy products that we deliver to our customers. The other part of our group is a team of practitioners and engineers that manage Data Governance and Data Privacy for Microsoft itself.

Read Maloney:

I know you guys are operating a massive lake at a huge scale. Maybe you can give us just sort of some idea of, Microsoft has all these different lines of business. You have Windows and X Xbox, and you have Azure. Is all of that in the lake? What sort of the scope of, what you’re managing from a governance perspective?

Mike Flasko:

Sure. Yeah, great question. Across all of Microsoft, from Windows, to Office, to Azure, Dynamics, Xbox, and kind of everything in between, the company leverages a Data Lake or a collection of Lakes as the backing store for a lot of our data. Last I checked, that was running north of about 13 exabytes in scale. You’ve got thousands of teams, tens of thousands of users interacting with it in different ways from data science through to BI and, and other workloads.

13 Plus Exabyte Lake

Read Maloney:

That’s just a massive scale. I assume you didn’t start by, Hey, there’s data everywhere, and now you created a 13 plus Exabyte Lake. Maybe you can help me understand how and why did the lake get formed? What was sort of the evolution of that? How did it even get to where it is right now?

How Did It Get to Where It Is Right Now?

Mike Flasko:

That’s a great question. I think we probably started like a number of companies. It started quite a while ago for us, kind of before Hooded was a word, but the idea was ultimately, how can we collect more at scale, derive new insights? That started in a couple of our major businesses, Bing and Windows and whatnot. As I think business values started to get seen uncovering new signals from data, then it started to take off across the company. You started to see different organizations get curious about what they could do with data. Then Data Lakes just formed this new opportunity to scale storage and data is at a cost that hadn’t been possible before.

Read Maloney:

Are you still seeing it evolve, like it’s still growing and you’re seeing more and more adoption of the Lake as, let’s say performance and technology has been improving? Or is it now the de facto standard for how analytics is being done across the company?

De Facto Standard

Mike Flasko:

I would say across the company, it is the basis for collecting data. Especially as all the, let’s call it raw data comes into the organization and starts its journey from refinement or enrichment or model training or whatever it may be. That really is the basis for collection of data. It helps us scale that across the org. I think the challenges now are of course, continuing to scale the infrastructure to keep up with the ever-growing desire for more data across the company, then also making it. I think the new challenges have come in, in terms of making it more efficient to be able to find the data you need, interact with others, ensure we’re doing it in appropriate ways around whether it’s a privacy need or an expectation of our customers, whatever it might be, it’s really about driving efficiency of use of the data in the Lake as we’ve matured on our journey. But it started out, and I think how many do, how do you get established with the infrastructure? How do you get the data in? How do you get the business buy off and kind of evolve from there?

Biggest Challenges

Read Maloney: 

Maybe we’ll talk about the scale you’re dealing with. One of probably the biggest challenges you have is the classification of data, just even the knowledge of what’s in the Lake. Maybe you can talk to me about how are you guys doing that? Then maybe how have you started to solve problems that you’ve come as you’ve gone on this journey. How the scale you’re doing it at, it’s not possible to just manually have people just enter in information and get the metadata of what’s landing. How are you doing this at such a scale every day?

Mike Flasko:

That’s a fantastic question. I think it’s one of the challenges that many of us have tried different approaches and tried to learn what works as this scales across your organization. For us it started out where we were collecting a lot of data, but we didn’t have that enhanced knowledge about it, classification and so forth. One of the approaches we’ve taken is, how do we apply techniques people use in the public internet against the Data Lake? We built a technology that allows us to look across the Data Lake and scan the data lake, if you will, and apply classification on an ongoing basis. We’re constantly understanding the Lake, or we’re understanding it in new ways as new business requirements come in. We’re scanning it at Scaleable twice a day to understand the semantics of what’s there.

From a majority perspective, we constantly want to be driving this more upstream, right. As tools are collecting data wherever it’s born and bringing it together, we want to be understanding as much about that data on the way in as possible that ultimately makes it more usable and discoverable for our users. But practical realities come in, right? There’s all kinds of tools and approaches for bringing data into a Lake. We’ve found kind of a two-pronged approach over time, works – trying to classify and annotate data on the way in, but also making sure we have automation and data scanning to make sure anything that gets in there, we have a better understanding of what and where it came from.

Read Maloney:

It sounds like one of the first pieces of technology I had to create and you started with, let’s just look at everything that’s new that came into a Lake versus from a certain time, and scan that and be able to then apply some metadata to that data, and then also apply the classification as part of the metadata on that data. Then you’ve been evolving that to try to get even farther upstream into the data pipelines. Is that right?

Scale and Analyze

Mike Flasko:

Yeah. I think that’s a great articulation of it, but the one thing that we learned though very quickly is: you can’t just start from here and move forward in time, because quickly you’re going to need to ask questions about historical data for deep analysis reasons or compliance reasons or whatnot. We ultimately have to build something that could scale to the size of our Lake and could analyze it, because if you decide that you want to classify data in a new way, oftentimes our users require us to classify it for all data in the Lake starting back from the beginning of time. So, we had to build something that could really scale and take advantage of the scale characteristics of a data Lake to work across such a huge volume of data. And then, absolutely, the ideal is you get it upstream into the ingestion tools, and the ingestion tools are able to label and classify the data as it comes in so that you have, if you will, the most mature governance stance the moment the data comes in. That’s a continuing journey for us.

Read Maloney:

Yeah, that’s fantastic. This is obviously something you guys built to solve your own problems internally. Just from a scale perspective, is this something that the customers or people watching here at Subsurface, is this technology that they’re able to take advantage of as well as they’re thinking about, Hey, how do I deal with governance across our lakes? How do we ensure we actually have knowledge of what’s in and getting the metadata and the classifications in as quickly as possible?

Interacting With CDOs, CISOs, and CPOs Across Organizations

Mike Flasko:

Yeah, it’s a great question. Short answer, yeah. One of the things that we were doing as we were solving our own problems, and one of the fun parts of the role is I could interact with a lot of CDOs, CISOs and CPOs across organizations, and in turns out a lot of folks had these common challenges. What we’ve been doing is productizing some of the work and generalizing it to work in different infrastructures, in different clouds, OnPrem, to be able to bring this scanning and classification capability to market. Currently, we deliver this through a product we call Microsoft Purview, which is our governance product for Data Lakes and databases and BI tools.

Read Maloney:

That’s great. In terms of that, is moving into the pipelines part of that roadmap, or is that just a different technology that you’re developing where both of these, like Purview and something else will work together? Or, are you trying to put that all into sort of one overall solution for customers?

Enriching Data With Purview

Mike Flasko:

That’s a great question. I think if I can step back and look at it from a general governance perspective, and I can put my practitioner hat on, all of us need to signal into however data’s being used, wherever it’s being used, whatever tool is interacting with the data. With that thought in mind, and hearing that from our customers, the way we’re approaching this is really turning Purview into a platform. We think of it as, it’s your metadata, not ours. What we’re doing is creating as many open APIs and plugin points as we can, so that anybody can participate in enriching the understanding of data coming into the Lake. Ultimately the customer can get the most value from it. Short answer, no. Not a closed stack in that it’s with our ingestion tools or whatnot. Instead, thinking about how can we enable customers to create a data map of all the data they have, wherever it came from, and then allowing anybody to participate in enriching and reading from that map? Ideally, the ingestion tools and other tools can both enrich a customer’s understanding of their data in the Lake, but also benefit from that in maybe helping automate some of their experiences.

Read Maloney:

There’s tools that’ll automate a portion of, let’s say the metadata, but then individuals, or even other tools, could come in and enrich that metadata. Then that is broadly available or is available through different governance policies that you’re setting. That’s great.

That’s outstanding. One of the things that I want to touch on a little bit is, obviously you guys have so much data in the Lake, and that could get expensive. If you’re scanning all the records all the time or you’re querying across these massive data sets, that could add a lot of cost. How have you guys thought about where’s the value? Which data has the most value? And then, how do you deal with that trade off between value, cost, and cost management across a Lake that’s so big?

Value, Cost, and Cost Management

Mike Flasko:

Fantastic question. I think we all try to manage that trade off all the time, right? What we’ve found as a governance Team is, at times governance can be approached a little academically, but it’s so important to think about the analysis we’re doing of the data in how can we create more insight to the Lake owner and the Lake administrator in terms of where are things being used the most? What are people querying over the most? Where’s the most sensitive data that you should keep a closer watch on than others? And we try to provide those insights back, both to our users across Microsoft and the various domain administrators that we have, so that they can optimize. Optimize their investment from a governance and metadata perspective, but also optimize how that data is stored. When you think about tiering and what needs to be in hotter, warmer areas of the Lake versus cooler areas of the Lake.

What we found is that governance layer can really enriched metadata layer, can really help both in cost optimization of query, but also cost optimization when you think about governance and Compliance and that site of it. We think of that within Microsoft as one of our primary tasks. I think the important thing is that often perhaps doesn’t get the rigor in certain organizations is defining those KPIs for your business. Showing how your investment in governing the Lake helps you improve both from a cost site and efficiency site.

Read Maloney:

So, you let the business owners end up having responsibility, and then you’re able, through some of the technology and work you’re doing to provide them with the reporting or the analysis, talking about, Hey, here’s essentially what your business is spending and here’s how your analysis is looking. These might be some, I don’t know if you go so far as to say, here’s some recommendations that we may suggest that you do for you to take ownership of how you are managing your costs within the lake.

Take Ownership of Managing Cost

Mike Flasko:

Absolutely. The other key things here are in addition to thinking about storage and whatnot. Some big insights we learned as we started to get more insight into the data. Especially when you can start to understand the relationship between data sets in your data lake through a lineage and whatnot perspective. You can quickly find where people across the company are doing duplicate work. It’s like you could be building on the work of others and expanding it, versus getting to that curated set for a second time. Especially as we see lakes adopted in mid to large organizations. Things are so distributed that it’s hard to know sometimes that there is interesting work being done in sections of the data, and so what we found is understanding individual data sets, but also the relationships between them in the lineage helped us shine a light for different groups. Hey, you know what? There’s duplicate of work here, and you could build off the work here and you guys could join forces. And that helped us really optimize how we thought about continually building on the work of others versus always starting, if you will, from the raw state in the lake.

Read Maloney:

Is that something that technology’s helping you identify, is sort of those overlapping areas or is that something that the team sort of just looks at patterns and you have analysts that are pulling that out? I don’t want to say necessarily manually, but with some help, or is that really, you’ve gotten almost full automation on that?

Overlapping Data

Mike Flasko:

For us at least at the scale of number of users, these analysts wouldn’t fly. We’ve really tried to automate this. We think of it from a governance perspective. You’re always going to have people in process to drive the program forward, but we’re really trying to bring as much technology assist. Where the more we can learn about the interaction with the data, how it’s moves, how it’s being transformed, the more we can suggest to folks, do you know what, there’s a piece of data that your colleague worked on across the company that may be a great starting point for you. What we’ve started to do inside the company is generate those reports for our different data owners in terms of what their’s looks like the heritage of data from what you’re pulling in matches a lot with what we’ve seen in other areas. And so we try to connect those folks across the company so they can collaborate. They can have a discussion and see if there’s work that they can build on for each other, or at least share ideas,

Read Maloney:

As you find those in your reporting, are you sending an email to the lines of business and connecting the people and just saying, “Hey, you guys should get together, we noticed this?” Or does that collaboration happen in a different manner? I’m sort of getting at like, how are you bridging this between like central team doing governance and then the lines of business operating, you guys find something out how do you get that connective tissue of those insights working in your favor and working in the businesses favor?

Collaboration

Mike Flasko:

That’s a great question. I think this is the key thing is really enabling collaboration in the tools that users are in and making it as seamless as possible. What we’re trying to do is, the tools that we provide our users to discover data and understand data, we’re trying to surface insights right there to them. It’s not like they have to start an email chain and interact and get connected. Our thinking is that the goal of the tools really is to help people be more productive with data. That means helping them find other interesting ideas that are similar to theirs, helping them get connected with other folks in the organization, right in the tool chain they’re in. In our data catalogs and other data discovery experiences, we’re creating collaboration capabilities that surface these insights and help them get connected and maybe start a teams chat or a Slack chat with those other folks that are working on common problems that they may not even know about. What we’re thinking about is we think a lot around, from the time you discover data to the time in which you’re using and interacting with it and collaborating with others, working in similar domains, how can we shorten that and allow people to iterate there as fast as possible?

Permissions For Data Use

Read Maloney:

That’s great. Mike, I’m going to shift because we’ve been talking a lot about the data comes in and going to apply metadata and we’re going to create this map. We’re going to flip to like the user side of the house, the data consumer. Let’s say somebody new starts at Microsoft and we got to figure out, like, we want to give them as much access as possible, so they can be agile and they can move quickly and they can drive their business forward, but that also has to adhere to a set of policies that you have. How does that work? Like if someone starts, there’s no way that can be like just a manual with a scale. You said you have tens of thousands of users. How do you know what permissions to give them? How do they get additional permissions? How are they going to get access to the data they need without a bunch of bottlenecks?

Mike Flasko:

I think that really gets at the crux of the challenge, right? I think for us, we think of governance as trying to enable as much agility around data while always ensuring right use. Meaning, data is being accessed at the right time under per corporate standards or regulatory requirements. To get to the heart of your question, we’re approaching it in two ways. One is, we really want to empower the, I’ll call data administrators, the policy makers of your organization to be able to articulate and how data should be used and how data may be used in your organization. Today, what we’ve found in talking to customers is everybody’s doing this in terms of, what are the policies for data use so that when new people join, they can automatically be granted certain amounts of access and then not others.

The challenge is taking those policies that are often expressed in English or in natural language and making it so on the data lake and enforcing it at scale. We’ve been spending a lot of time working on technology to enable articulating data policy that automatically is pushed down and enforced inside native data systems like data lakes and databases so that we can be clear about what type of access should be granted. But then how it translates down isn’t a manual process or schema upon schema or views upon views instead, think of it as compiling policy down to native enforcement. That’s half the equation. But of course, as people get into it and start working, they find other data that they’d like to access. We’ve also been working on a self-service experience for our customers and we’re trying to navigate and put that right into the data discovery tools so that as you discover new data and look at, ah, this could be fit for my purpose. Right there, you should be able to request access, get connected with the data owner. If the data owner finds that your access and the purpose of access and whatnot meets the requirements, should be able to grant access as quickly as possible.

So, we’re doing this in a lot of our governance tools as well. Because I think ultimately it’s from discovery, to understanding, to getting access as quickly as possible.

Read Maloney:

I can imagine that you guys are getting even with all that automation around this is likely the set of data, this particular new person or person who’s maybe changed roles should have access to. That’s probably changing literally every day. People changing jobs, people starting, people leaving, et cetera, all being managed. Then you have, I would imagine thousands of requests coming in for, I want access to this data, et cetera. Is that something that’s all done manually? Like you’re looking at that and you have somebody who goes through and says, yes, this person should get access to it based on information you have or you’ve been able to use the technology you’re developing to, to automate a lot of that.

The Important Thing Is Understanding Your Data

Mike Flasko:

The important thing is understanding your data. If you understand the semantic meaning of the data and the connection to business purpose and sensitivity level, then what we’ve found is establishing policy to automate access is possible. Short answer, yes, we’re trying to automate as much of that as possible, but if you don’t have a rich understanding of what that element of data is in your data lake, it becomes a risky proposition to automatically grant access. What we found is our maturity model is getting data in, understanding it deeply from both a business and classification context, and then using that context to drive access policy. Then ultimately that access policy is what automates things.

Automatically Grant Access

Read Maloney:

I think what I hear you saying is you’re only able to deliver that level of automation because you guys have put so much work and effort over time in ensuring you know what’s in the lake and the sensitivity classification that’s in the lake and who should have access to it so that then when requests come in, you can automatically, in many cases, probably not all, but in many cases actually just grant that.

Mike Flasko:

That’s precisely it. I think gaining that level of understanding is so key.

Read Maloney:

Do not grant access if you don’t understand your data in the semantics of the data. That’s like, start with step one and then you can move to step two.

Solid Understanding

Mike Flasko:

I think early on, I think everybody got so excited, us included about what potential data lakes offered, right? For the first time you could collect everything and the cost profile made sense. Then it quickly became, well, if I got everything, how do I get access? There was that missing step of a deep understanding of all sections of the lake. Then we started to hear , “How do I make use of it from the business? How do we turn this into business use?” What we found, I think you articulated it so well, is really making sure you have that solid understanding, not just physically of what’s there and the files and the size and whatnot, but its connection to business purpose and sensitivity level.

Read Maloney:

No, that’s great. One of the things that I’m sure that you’re dealing with is you have more and more people within Microsoft who want to use data as a part of their jobs. It’s not just the analysts anymore. It’s the business users. Assuming that you have that knowledge and you’re then able to pass that into saying, “Oh, well now these business users can even get access to this data.” Let’s say they don’t write SQL or they don’t have the ability to maybe interact with data the way that it’s been done in the past. Are you guys innovating in that way to make it more from a let’s just say , a search perspective or even a chat perspective? Is, has that been something that you’ve been working on and trying to evolve to help the organizations become more data driven?

Finding Data In a Way That Makes Sense

Mike Flasko:

Short answer, absolutely. I think early on I’m going to say lakes were for the technical half of the organization, data engineers, data scientists, et cetera. Increasingly, we’re trying to make everybody a data consumer in some form. One of the key things we’ve been working on is, how do we create an experience where our business analysts and whatnot can easily find data in a way that makes sense to them through the language of the business. A lot of innovation has happened recently with large language models and others, and so we’re spending a lot of time thinking about how we can leverage that technology into our discovery experiences so people can just reason about and look for data in a way that’s most natural for them. Then potentially also can create a consumption and interaction experience based on what they’re looking for. I think there’s a huge opportunity in connecting this with low-code and no-code experiences.

Read Maloney:

Got it. The idea is right now they could go and probably do some sort of discovery via search, right? They’re searching, but you’re likely to be searching in your business terminology.

Mike Flasko:

That’s Right.

Read Maloney:

The data still might be in its much more technical level terminology. Are you working to try to bridge that more automatically into business semantics? Or is that something that’s still, like different business groups will go and apply the semantics so that as you’re layering in search and chat and these other things, it’s actually contextually aware of whatever that domain is?

Bringing Additional Information and Data Classification Together

Mike Flasko:

Short answer, absolutely. We talked a lot earlier about obtaining additional information about the data classification and other. We think of bringing this all together into a data map, which is effectively a knowledge graph about your data. Not just physically what’s there and its partitions and its classifications, but also its business purpose, its business terminology, the entitlements and consent you have to that data, the health of the data and so forth. We think about feeding all of that in. Imagine you feed all that information into a model and then let people just reason about it with the language that makes sense to them, right? I think that’s ultimately what we need to get. It’s less about, do you have this facet or that facet on the data. And more just, you express what you’re looking for, whether you’re a data scientist or a business user, and then we’ve got all this rich context as a reason about that request.

Read Maloney:

That makes sense. It all starts with that initial, you have to have all the knowledge of the data because you’re almost saying, data comes in, we’re creating this rich set of, for lack of better, like structured metadata around it. Then on top of that, you’re creating the knowledge graph.

Mike Flasko:

That’s right.

Read Maloney:

Then on top of that, what you’re creating are all these different user experiences to try to enable more and more people to get access to the data. Is that essentially sort of how it’s worked as part of your evolution?

Mike Flasko:

Absolutely. I think you put it really well. I think at the end of the day, for those of us that work in governance and privacy and whatnot, it’s so important to follow a model like you just described in connecting it to the needs of the business. And can getting away from that thought of governing is somehow an academic or only compliance based thing, absolutely not. This provides the context so you can open up your lake to more users.

Read Maloney:

That’s great. Well, Mike, we got just under a minute left. If there’s one thing you’re talking to everyone who’s managing lakes right now, and you want them to be aware of like, Hey, this is, how you should get started in getting better at your governance at scale. What, what would that be?

Do More With Data

Mike Flasko:

The biggest thing we’ve found is, you’ve gotta think about what is your maturity model and how can you articulate an investment as enabling your company to do more with data? The moment we could do that in the types of ways you were just describing is what really empowered those investments and things to take off inside of Microsoft.

Read Maloney:

Well, Mike, thank you so much for being with us today and sharing that awesome story. I’m sure you’re going to get a bunch of questions about it. If you have other questions, you can ask them online. If you’re viewing right now or you can find us you can chat with us on social media and we’ll try to get those questions. Thank you so much.

header-bg