Subsurface LIVE Winter 2021
The 2021 State of Data Operations: Emerging Challenges in Expanding Cloud Data Ecosystems
More than half of organizations plan to adopt two or more cloud data platforms within the next two years, and the majority plan to use them for sensitive data analytics. Yet, despite the efficiency and deep insights that multiple cloud data platforms provide, what these organizations don’t know about data access governance in a cloud data ecosystem could introduce risk and void their data’s value.
Immuta’s recent survey of data professionals uncovered some of the biggest obstacles to unlocking the value of sensitive data in a data analytics environment comprising multiple cloud data platforms. What did they find? The convergence of sensitive data use, cloud data platform adoption and regulatory enforcement and evolution is approaching at breakneck speed—data teams that want to stay competitive can’t be caught off guard.
Join Sumit Sarkar, Senior Director of Product Marketing at Immuta, to learn more about what data professionals say about the current and future state of data access governance and sensitive data analytics, and how to avoid these challenges when creating a data governance strategy in a cross-cloud data ecosystem.
Sumi Sarkar, Director of Product Marketing, Immuta
Sumit Sarkar is a technology researcher, thought leader and speaker. He has worked in the data access infrastructure field for over 10 years enabling data engineers, data scientists, business analysts and app developers. Sumit's primary areas of focus include hybrid enterprise data management that supports open standards such as ODBC, JDBC, ADO.NET, GraphQL, OData/REST; and automation for privacy & governance for data analytics using privacy enhancing technologies such as differential privacy, k-anonymity, l-diversity, t-closeness and more. Sumit has presented 23 sessions live on stage to this audience at industry events such as Dreamforce, Oracle OpenWorld, MongoDB World, Modern Marketing Experience and Strata+Hadoop World.
All right. Hello everybody. And thank you for joining us for this session. I wanted to go through a little housekeeping before we start. First, we will have a live Q&A after the presentation. We do recommend activating your microphone and camera for the Q&A. You just click on there’s a little button on the top right of the dark black part of the screen where you can do that. Please do notice once we start the Q&a, we will be able to hear [00:00:30] and see you live. So just be aware of that. Also, before you leave the session today, we’d greatly appreciated it if you could fill out the survey. It’s super-duper short, maybe take you 20 seconds, but that’s the Slido S-L-I-D-O tab on the top right-hand side of your screen. So with that, I am delighted to welcome our next speaker, Sumit Sarkar who’s director of product marketing at Immuta. So Sumit, please go ahead.
All right. Thank you, Louise. Hey, everybody. [00:01:00] I guess just a quick intro. I’ve spent a lot of time in the data access infrastructure space. So a lot of my background is wearing a lot of different hats, and I really have a passion for data engineering relations and working with open standards around data access. And so today I want to share some of the research my team has done with data engineers and data architects with the subsurface community. And today’s national privacy day, so I’ll attempt to break Kendall Jenner’s [00:01:30] privacy or show you how we can do that to keep you guys glued, hopefully, or interested. Let’s just start with just a high-level quotes by 2022 public cloud services will be essential for 90% of the data and analytics innovation. You’re probably thinking that’s captain obvious that everybody knows that. From my experience, from my background, I’ve seen maybe two waves of disruption to cloud data teams.
First with the fast applications, if [00:02:00] you guys are old enough to remember the first analytics project with Salesforce data outside of Salesforce. And that really started that wave of disruption with lots of staff systems and APIs, created a lot of headaches. And the second wave, I believe we’re in now is more cloud platforms and services. And so some of you are old enough to remember when you first started playing with Redshift and it costs pennies for an hour. And I believe that really started this second wave that we’re in. And a lot of this research has pounded on some of the emerging challenges and disruption [00:02:30] to data teams.
All right. So first I’ll list levels of where are we in this cloud data journey? And from the research we ran, it was state of day engineering and operations survey. We ran it in the second half of 2020, and we had about 140 respondents and the concentration was in data engineering and data architecture. And so trend one was more diverse cloud platforms, and so 52% plan to [00:03:00] adopt two or more of these in the next 12 to 24 months. And we did a cloud platform in the context of this survey, I’ll show you the next slide what those options are, but it’s a platform or service for analytics like Snowflake or Presto or Dremio. And on the other side that 75% are collecting sensitive data with plans to use it. So that’s the second trend. I think they support that obvious stat about the 2022 prediction in public cloud services, which is at least directionally good.
[00:03:30] And we dig into trends one. The big trend is this expanding cloud data ecosystems or collection of platforms and services. And you start to look at some of the top answer choices, there’s different types of cloud native services for querying and processing data. So a lot of popular ones for cloud data warehouses, like Synapse, Redshift, and Snowflake, data science platforms like Databricks, and then ad hoc query engines like [00:04:00] Athena is another one in the top five. There’s other answers beyond that. There’s a lot of them. We got some write-ins for Gremio, so we’ll definitely add that to the canned answers when we run this in 2021. But I expect that many of you will be managing more types of these services going forward, and that’s one of the waves of disruption we’re in right now.
And when we start to look at the second trend, some more sensitive data, 75% [00:04:30] of the collecting and have plans to use it. The top to answer choice is around internal rules instead of primary. What that means that, or at least when I dug into some of these, a lot of times, data engineers and architecture teams, to get functional requirements in some cases and may not know the driver in all cases. But the other lens of that is seeing a lot of requirements when you have centralized data lakes and warehouses that you want to isolate tenants by their geography or business [00:05:00] unit. And that is there another rule that creates some things that make data sensitive or the conditions for sensitivity. Beyond the big three of data protection laws like GDPR, CCPA, and HIPAA. Other examples that I didn’t necessarily realize were so common are data use contracts with maybe third parties. Sharing data with partners to create some analytical products, or maybe they’re sharing with me for some product or service.
Employment [00:05:30] laws. This is more popular, I think more common in the EU. They’re starting to emerge in the US. Some states have rules on how to use biometric data, Illinois, Texas, and Washington. And there’s a variety of industry specific rules. And what I found really interesting when we dug into some of them, after we look at the data, we called a couple of folks and set up the calls. But if some of the danger with engineers working with sensitive data, because they start to unlock it, they have a lot of anxiety. They go to bed thinking of, did I filter [00:06:00] that dataset correctly? They feel like they may be personally liable under certain clauses of these laws, or they could even go to jail. I don’t know anybody who’s been to jail, but this is just anxiety that I heard.
And so with these trends we uncovered, what are the emerging challenges and some of the more diverse cloud platforms, more sensitive data, what could go wrong? And so the first, this is a set of challenges we asked about [00:06:30] across data management processes. And these were the top ones were around masking and anonymizing, monitoring, auditing, and classifying cataloging, for data management challenges. We dig into some of the rationale. I think there’s two dimensions of this around maturity. One is from the cloud vendors. When digging into the with practitioners, some of these challenges are attributed to a lack of maturity for data management tools and things [00:07:00] may work okay when you move a couple of use cases, but the data management tools in general are still being adapted to the cloud. And some cloud native tools are still in mason stages of supporting node native services. And so that was one dimension.
The second one was more around where you or the data teams and analytics professionals and leaders is user adoption. So as you start to move one or two use cases to the cloud in simplest form, moving on premise database, some tables to cloud, get [00:07:30] up to 50 users up. That challenges is somewhat challenging to challenging, but as teams mature and they expand use cases and more big data, data science, and maybe data sharing with more classes of users, that’s where we start to see very challenging responses and they move to the right. And so those are some of the digging and behind some of these chart numbers.
And then if we take that first challenge that masking and anonymizing and take [00:08:00] a super simple example and what that looks like with one or two platforms. And so here, let’s take a scenario where we need to implement some controls around masking PII for everyone except HR users. We have two platforms as Databricks and Snowflake for this example. And if you start to look at, on the left, you have an example of Databricks, and right it’s Snowflake, but at high level to mask the PII data across these, is you have to first find the sensitive [00:08:30] fields across the platforms. Then you need a way that maybe tag them or flag them, so you can remember that the rules apply to these. And with Databricks, it’s more of a procedural code approach. You might write Python that transform some of this data in sensitive columns and replicate that out and then grant users access to that de-identified set, and then HR can use the original copy.
And then with Snowflake you more of a declarative SQL approach. You can create a masking policy, [00:09:00] apply the policy to the PII columns, and then create a secure view to the non HR rules. And then you’ve got that working, but then how do you monitor new data assets that make it ingested? And what is the strategy to do this across hundreds of tables? And so these are some of the challenges when we take a super simple example. At least in the interviews I had follow-up, these were where some of those challenges were presented. And then just looking at that and [00:09:30] maybe predicting a little bit into the future, some of those challenges really multiply with more cloud data platforms. Of course, in the lens of user adoption is also important there, but in this we take two platforms, there’s a bunch of other dimensions that go from that super simple example.
And you can start to see that’s maybe two policies. And so when you start to look at things like the number of platforms, maybe the number of privacy rules, the geographies [00:10:00] you have to support, some of the number of tables and regulatory rules, regulatory concerns, things like data sovereignty and other things. And then lines of business, you have different customers, internal, external. It’s that combination of 50,000 as an example, but your multipliers may be different, but just to give you a scale of how some of these challenges are expanding from that research.
So let’s take a challenge with [00:10:30] masking or anonymizing. So I told you, I would bring in some celebrity power here. And so our CTO ran this demo just to show an example of the linkage attack, and it will attempt to break privacy of Kylie Jenner. And so in this picture from some kind of an online magazine, you have some information about Kylie Jenner, her location, bike ID, and date. And I’m not an expert on celebrities, [00:11:00] but I Wikipediad her, and she’s an A minus celebrity, I think. Something to do with Kim Kardashian, but re-identifying her is a for people who never more than I do is probably pretty cool, probably something she doesn’t want. And if you start to look at how do we do that? Just those kind of pieces of information. Well, there’s some public data sets out there, like the city of bike trip history in New York city, for example. That’s some dataset we can download and then start to analyze, nothing [00:11:30] too offensive in there. But as an example, we’ll look at some of these things like the details around ride history.
And so this is an example of a linkage attack, right? So you have information about Kylie Jenner and trip history location from this photo. And then you have a dataset with kind of the transactions across the public data set without any linkage to the person, it’s just those fields that were more indirect identifiers. [00:12:00] But there’s a lot of people taking rides around New York. So how hard can this be? Let me take a look at the data set. And we start to query by the bike ID, which we did in the picture. We get some number of records. We get 37 rows. And so it’s a little bit hard to make a reasonable guess as of one out of 37 chance that any of these records identifies the individual. So that’s too much to infer any information, [00:12:30] but if we start to dig into that we have those other pieces of information, and then we can start to add in the date we learned, the gender, and the user type. We end up with three records.
And maybe if you’re familiar with this part of New York City, I think is in Brooklyn, and you can kind of figure out the street. And so then the risk of re-identification increases from this data, nothing crazy here. I don’t think Kendall Jenner cares if we know this information, but there are a lot of challenges here in terms of other [00:13:00] use cases. I think the governor of Massachusetts is a famous story of medical record and he got re-identified, a Harvard researcher re-identified him, lots of the stories we hear from organizations about the HR attacks on finding people’s salaries. And there’s a whole host of these things. Insurance companies have celebrities and protecting that PII, whole bunch of areas that are more real to your organization. And so when you summarize what do [00:13:30] you do about it?
There’s really two challenges with these things that in these kinds of data protection strategies, but you can share the data, just mask all direct and indirect identifiers. But like in this example, it’s not too useful anymore, so why even share it? And the second challenge is maybe what happens if you share this and you do re-identify somebody, then are you violating some of the clauses of different data protection laws? [00:14:00] For example, did we properly de-identify that data since the linkage attack was possible? This gets into lawyer stuff, and I don’t like talking to lawyers. So these are just questions that are open in general of these challenges, but there are a lot of strategies to address these, using privacy controls and automation. I’m leaving this as an open-ended challenge. We’ll dig into some ways to solve it as we go on.
So that was one example of emerging challenges and different lenses of that may apply to you. [00:14:30] But after talking to some of the data engineers and architects who responded to the these challenges and what they’re actually digging into why they respond the way they did, we learned a little bit more, some patterns and reflecting in these kind of buckets here. And things like dynamic access control, you’ve got discovering classification and data privacy, which is a subset example of in the previous section. And so when we look at these and you [00:15:00] start to look at modern approaches to scale some of the data access security and governance, a good indicator of what people are doing is look at big tech companies and their engineering blogs. So one I kind of looked up was Uber engineering published their umetric cloud platform, and they published a case study on metric standardization.
Things like, how do we create standard metrics across different businesses like Uberx, Pool, Eats, things like that. And so this is each team is creating their own pipelines and [00:15:30] making their own metrics. And one of the challenges they run into is inconsistent fine grain access control. And so some of the things on the left and rights of dynamic access control and the data consistent data privacy, and that was part of the challenge. And so they would have to figure out a way to replicate consistent policies across raw data, accurate data and across these teams. So it’s a good project to look at if you’re looking for kind of another data point on some of how other organizations address some of these.
[00:16:00] When we look at digging into some of the more specifics around access control, more than 80% of the respondents are using more role-based access or all or nothing access control. So these are things that we’ve been doing for 30, 40 years from Hadoop and relational databases. And so this is not something that’s a new, and all or none is more around core screen. If you have a social security number in some resource, then you can’t see any of it, so it’s not very helpful. [00:16:30] And so some of the strategies that we’re suggesting and hearing from practitioners to scale is to look at different approaches, more like a dynamic way to manage policies. And so as an example, one of the organizations talked to is looking at more ranger based approaches. And so there’s a popular Hadoop based access control project worked great for the Hadoop ecosystem and supports more like some of the static attributes to drive policies.
[00:17:00] For example, companies shared that they have a policy for each state and combination of states and ends up with a lot of policies to manage this. And as they move to the cloud, this becomes very hard. And so the modern way to do this would be more variable approaches. It’s a variable attribute based access control model. And you can see that it’s making decisions at query time. And so you just have this using a variable, you can write one policy that just says only show rows where state is in this state. And so this is a way in the cloud where [00:17:30] you have a lot more diverse users and users accessing different services, the scale adoption in a dynamic way. So you don’t have to manage these and have risk of keeping a log in or managing roles and leaks.
And so the users, it’s just a very different workload from the previous ecosystems, from databases and warehouses and Hadoop, that’s one modern approach to figure to address some of these challenges. And then 45% of data teams either use a homegrown data catalog or don’t have [00:18:00] the catalog at all. A lot of platform leaders I talk to, especially in larger companies, these fantasy catalogs has being owned in a separate group sometime to pan or organize, or maybe they just really scraped some of the metadata using some API and try to operationalize some of that. So it’s not necessarily always something in the wheelhouse of the analytics team, more of an influencer or collaborator. But when we look at how some data teams are managing it, they’re managing technical metadata in their platform, maybe in the service [00:18:30] or at a higher level metadata store, but they’re embedded into some specific technology.
And on the left is an example. Big Query, you can discover and classify some of these and you manually tag some of the sensitive fields you want to drive policies on, but that’s specific to Big Query. And so we saw the stat earlier that everybody’s expanding to multiple services. And so how do you tag this metadata and use it and extend it to other systems across [00:19:00] your kind of cloud ecosystem and footprint. So the recommendation and approach that we see is to centralize some of the metadata stores so that it works across your platforms and services. In some cases that that centralized store may feed in data from kind of existing metadata for centralized med investments, like in Alation or Colibra or Catalog. And so the centralizing that, and operationalizing that across your cloud ecosystem really reduces the challenge of managing all this in [00:19:30] an automated way.
And third, 64% of masking anonymizing data is challenging to extremely challenging, probably not related to the Kylie Jenner example, but there are definitely challenges emerging in the cloud as you share more types of data and more types of people. So we take the example of the earlier, the consistent data privacy approach, and you’re just masking one table and a couple of columns with data bricks to writing the ETL code in [00:20:00] Python using Spark UDFs, and kind of hard coding specific columns. That’s kind of one approach, but then how do you take that logic and automate and replicate that policy logic across your platform, similar to the previous concept, because it just doesn’t scale when you introduce more columns and a lot of these data platforms and lakes have thousands of tables. And so you really need to think about more global policies and think more cross platform [00:20:30] security and privacy controls using this using, I think, the popular single pane of glass kind of concept of managing these across your data ecosystem.
And to wrap up here with the more of a public data architecture example. We talked about Uber a little bit here. We’ll talk about the center for new data, and they have a pretty green field cloud platform, and they needed to leverage some of the… Look at automation around the data access [00:21:00] governance capabilities. And so the COVID Alliance research platform, it’s a project of the center for new data, a non-profit group, and they empower really smart volunteers to accelerate research around the COVID-19 response, and their platform really democratize that data for scientists and experts studying the disease. So this data platform is built on Snowflake and it serves a very diverse set of data consumers. And so that’s really what’s driving some of these challenges we talked about earlier from the survey, the nationalization [00:21:30] classification and access control.
They happen to leverage Immuta, which is my employer, as part of the automation of the security and privacy controls. And so that’s really enabling them to onboard the processing of data use agreements and the workflows and the anonymization and security controls into more of an automated way, so they don’t have to worry so much with mathematical guarantees, they don’t have to worry as much about some of the risks that are outlined whenever you share data, whether internally or externally. [00:22:00] And so that’s more of a real live world example, and there’s a public article on medium from their engineering team. There’s obviously more to their end to end story beyond the kind of access layer where I focused on. It’s a good read. If you’re looking at some of these same challenges as you work with sensitive data across the cloud.
And finally wrap up with I’m employed by Immuta. We were a sponsor of the event just to give you a flavor [00:22:30] for how some of the companies work with, and some of our partners and recognition. The research we ran wasn’t specific to the Immuta problem space, so we don’t claim to solve all these challenges, but I do encourage you to explore Immuta as an event sponsor. And here’s my contact information. I thought it was clever. You may or may not. But you can email me, and I think there’s also a Slack channel where we can talk as well after this event. So I’m going to stop sharing [00:23:00] and pass this back over to Louise.
Yep. Thanks Sumit. Had a great presentation. So with that, we’re going to go ahead and open it up for Q&A. If you’d like to ask a question, please use the button in the upper right-hand side to share your audio and video. You’ll be automatically put in the queue. You can also, if you’re having problems with your audio video, feel free to go ahead and ask the question in the Q&A. [00:23:30] So let’s see here. All right. Any questions? Don’t be shy. A quick question for you while we’re waiting for those to queue up, Sumit. Where can we find the survey results? It was supposed to be you have it documented in a report somewhere. Is that on the Immuta homepage or where can we find that?
Yeah, I think there’s a link on the Immuta home page, and I think we have a booth that’s sponsoring this, so you might be able to grab it from there. Or you can go to the [00:24:00] Immuta site and go to resources and you can download it there. It’s not on the homepage, but you can find it.
Okay. Great. All right. Any questions, guys? Quiet group this time.
All right. Yeah, no worries. It’s a high level research project. So if you have questions afterwards in Slack and you have my contact info as well.
[00:24:30] Let’s give it another couple seconds here. Looks like we had a question from Jessica, but when I clicked on her she disappeared. So Jessica, if you do still have a question, please go ahead and put that into the chat. Sometimes the video doesn’t start. Great presentation. Good. So you’re getting great feedback. A question here from Jessica. All right, thanks Jessica for typing it in. Were respondents solely data engineers or a variety of roles? And then she just, in parenthesis, sorry [00:25:00] if I missed that.
Oh yeah. It’s good question. So it’s a mix of roles that leans heavily on engineers and data architects, but there’s other kind of data professionals, there’s some governance roles and security folks as well in there. I would say the majority of it was focused on data analytics teams, more on the supply side of the data value chain.
Okay. All right. That looks like it for questions. As a reminder, Sumit will be [00:25:30] available in the Slack channel. If you have more questions. Sumit, might be good idea just to post your link to the survey. Or as Sumit said, you can go to the Immuta booth and go get your free copy there as well. I’m sure they have fun giveaways, just like all of our vendors do. The expo is currently open. So please go check that out. So just a reminder before you leave, if you could please fill out the Slido survey in the top right there, we’d greatly appreciate it. Should just take you maybe [00:26:00] less than 20 seconds. It’s super quick.
We’re now going to have our virtual lunch break. So it’s a great time to grab a snack or go meet with the vendors. As I said, a lot of them have a great giveaways. They’re giving some fantastic demos, a chance to meet with our executives, that sort of thing. So big thank you to our speaker today. Sumit, really appreciate it. And thanks to Immuta for sponsoring. We will see you in the next session. And I think it’s about 20 minutes. In the meantime, enjoy the conference.
[00:26:30] Thank you, Louise. Do I get a lunch ticket?
Lunch is free. You can go to the kitchen.
Free on you.
Miss those days, huh?
All right. Cheers. Thanks everybody for joining. Really look forward to engaging.
Cheers. Bye guys.