Dremio Jekyll

Subsurface LIVE Winter 2021

Designing Performant, Scalable, and Secure Data Lakes

Session Abstract

In this session you will learn the do's and don'ts of building enterprise data lakes. We'll discuss the commonly used patterns, how to set up your pipelines to maximize performance, how to organize your data, and various options to secure access to your data. We will cover generalized best practices of data lake design, as well as specifics of how to implement using Azure Data Lake Storage.

Presented By

Rukmani Gopalan, PM Manager, Microsoft Azure

Rukmani Gopalan is a Principal PM Manager at Microsoft. She works on the Azure Data Lake Storage team where she helps customers build their big data analytics solutions on Azure.

Webinar Transcript

Emily Lewis:

Thank [00:01:00] you for joining us for this session this afternoon. A reminder, we will have a live Q&A after the presentation. We recommend activating your microphone and camera for the Q&A portion of the session. Please join me in welcoming our speaker Rukmani Gopalan, PM manager at Microsoft Azure. Let’s get started.

Rukmani Gopalan:

[00:01:30] Hello everyone. Great to be here with you all today. First off, a huge shout out to the Dremio organizers for making this a very smooth experience. What do you think does number 59 signifies? According to an IDC study, the amount of data that was created, captured or consumed in 2020 is estimated to be around 59 zettabytes. [00:02:00] And this number is estimated to grow to 175 zettabytes in just five years by 2025. For reference one zettabyte is one followed by 21 zeros, that many number of bytes. That is the kind of number we talking about when it comes to data proliferation and the opportunity that all those data presents. My name is Rukmani Gopalan. I’m a PM manager on the Azure Data Lake Storage team at Microsoft.

We work with our customers across various industries, [00:02:30] manufacturing, retail, financial services, consumer goods, and healthcare to name a few. Reasonably so, they all have come with slightly different problems to solve because they’re from different domains. But one thing remains the same, all of them want to gain valuable insights from their data, and they want to use these insights to both inform and transform their business. A large financial services customer build their big data platform on Azure to compute risk models over their entire [00:03:00] investment portfolio. One of our largest consumer goods companies use our modern data platform offerings to track food every step of the way from the farm to the fridge and using these powerful insights to minimize wastage across their supply chain. Especially in these very strange times that we live in during COVID-19, our customers are relying on data, even more to inform and drive their critical business operations.

As part of helping our customers [00:03:30] through their data analytics journey, we see some topics coming up repeatedly on how best to help design their Enterprise Data Lakes on Azure. In this session, I will be presenting a digest of these learnings that help provide a framework for you to help design your data lake. I’ve divided this talk into three sections. First, we’ll briefly go through the basic concepts of a data lake architecture. Then we’ll start diving deep into the specifics of designing a data lake from a storage [00:04:00] perspective. And finally, we’ll discuss strategies to optimizing your architecture to achieve the best performance and scale while maintaining cost effectiveness.

So first, let’s talk about what a Data Lake Approach is. Most of our enterprises have now evolved from their traditional data warehouse based architecture that always started with a business problem first. And based on the business problem, we then went and got the data that was required to solve this business [00:04:30] problem.

The design was based on highly structured data that was purpose-built. And this was used by business analysts primarily to gain visibility on a business problem. They have now evolved into a data lake architecture that always starts with a data first approach. The very premise of a data lake-based architecture is to ingest data in their raw natural form assuming all parts of all data is useful and having no restrictions on the source, size or [00:05:00] format of that data. This data is then schematized and processed via various computational processes to then extract high value insights. And these insights can now be used to power dashboards. We used to be queried by business analysts or used by data scientists for exploratory analysis or building machine learning models. And this can even be fed back into applications to build intelligent behavior inside your applications.

At the heart of this architecture, is Azure Data Lake [00:05:30] Storage. Azure Data Lake Storage is a highly scalable data store that can serve all the above scenarios without having to move data back and forth between different stores for different purposes. The way they accomplish this is, in addition to being a very robust storage layer, we also have very rich integrations with the ecosystem of partners, both Azure services and the very strong data analytics ecosystem that’s always growing.

We are tightly integrated with Dremio’s [00:06:00] data lake engine that allows you to get the querie performance that you want on a data lake, without having to move data into a specialized data warehouse, enabling you to implement what does now popularly getting to be known as the Lake-host pattern, running your data via host on a data lake. Having a single platform as your data store eliminates data silos, and it not only lowers your total cost of ownership, but it also enables you to correlate completely different datasets enabling [00:06:30] scenarios that you might have not otherwise thought of. If you want to take away one thing from the session, it’s essentially start your data Lake journey by landing all data in your data Lake and use this for any purpose that you want to after that. The possibilities are infinite.

Now let’s talk about the specifics of the various considerations in designing your data Lake. As I said, common set of topics come up in most of our customer conversations. And these are, how [00:07:00] do I set up my data Lake? How do I organize data within my data Lake? How do I secure data? And how do I do all this while maintaining cost effectiveness?

Now, before we start talking about how to set up your data lake, let’s think about how organizations are typically structured. You have various business units, say HR or marketing or finance, and they have their own kind of data and they have their own scenarios they want done on their data. In addition, there is a central team, it [00:07:30] could be called as a data infrastructure team or a data platform team, or a data engineering team, let’s just call it a data team, that is responsible for the data strategy for the whole organization.

Typically, these data teams take one of the two approaches or a hybrid of these two approaches, which is, should I have a centralized data lake or a federated data lake? What happens in a centralized data Lake strategy is that data management and the administration, in addition to the infrastructure and architecture, is all done by [00:08:00] the central data platform team. They basically own all the resources and they manage all the resources and administer the resources. And in some cases they also ingest data from the various sources and do the data preparation. So data is pretty readily available by your various business units for consumption.

On the other end, you have a federated data lake, where the central data team basically defines the architecture and a set of governing principles in terms of technology [00:08:30] decisions, what services to use, what platform to use, what kind of access management or security needs to be implemented. And based on those they define the blueprint and the individual business units then go ahead and implement and operationalize the data lake. From an Azure Data Lake Storage perspective, we support either of these architectures, just the same, and whether you choose a centralized or federated strategy, depends on factors that are… Some are technology oriented and others are [00:09:00] also how your organization is basically structured and what the culture is. Do you have the right set of teams in your business units? Do they have the right expertise to run their data? Or is it better for the data team to do it all and essentially have it ready for consumption?

We talked about business unit boundaries. We also have regional boundaries where, in especially in a global organization, every region comes with its own set of requirements and compliance needs. And you can also have a federated data lake architecture that is [00:09:30] typically based on a region boundary as well.

Regardless of what approach you pick, one key recommendation that we tell all our customers is promote sharing of data and avoid data silos within your organization. We had richly integrated with a rich ecosystem of data governance, providers including the Azure service that we released in public preview called Azure Purview. So, make sure that you catalog your data and enable data to be discovered [00:10:00] by all parts of your organization alike.

Next, let’s talk about how you organize your data within a data lake. A data within a data lake follows a life cycle. So let’s take a look at that life cycle. As we talked about, the very premise of a data Lake is ingesting data in its raw natural form. So this data lands in the data lake as raw data. And then there has to be a set of cleansing or preparation that needs to happen, say removing duplicates or [00:10:30] fixing or rejecting ill-formed data. And this data is typically called as enriched data. This enriched data then goes through a rigorous set of computational processes with say spark or hive where you are doing complex aggregations and filtering to extract the high-value data. We call those the curator data. And it’s the curated data that contains your insights that basically is used to power your dashboards or do all the advanced scenarios that we talked about.

In addition to this natural [00:11:00] life cycle of data, typically when you have especially data scientists in your organizations, they want to start working with data sets that they bring as well. So there is also a set of data in your data lake that is kind of your scratchpad or workspace data that enables bringing your own data scenario.

When you think about organizing data within your data lakes, you can think about doing organizing data as different zones. And for organizing data within different zones, as your Data Lake Storage has [00:11:30] a hierarchy that you can leverage. At the very top, you have a storage account, that’s a natural resource that you can manage. Within a storage account you have different containers that are organizational units. And within a container, you have a file and folder structure that lends itself friendly for Hadoop File System-based workloads. But typically most of our open source workloads are based on say spark but, whatever you name it. So, when you choose to organize data within your data lake, you can say organize your raw data [00:12:00] as a separate storage account, or you can have a storage account for your whole data lake and organize raw data as say a container or a folder within your data lake.

Now, let’s talk about how do you secure data within the data lake. We have a lot of storage perimeters that basically help you secure data within the data lake. And we take a layered approach here. At the very perimeter, we have network security that lets you determine what kind of traffic is even allowed to talk to your data lake. [00:12:30] Azure has this concept of VNets or a virtual network, which are a logical representation of your network within Azure. A good practice here is having all your computational resources behind the VNet and allowing our features such as service endpoints or private endpoints to allow traffic only from specific VNets into your data lake. So, the denied by default model of just not having your data lake be exposed to all traffic, is a good practice in general.

The [00:13:00] next layer of security is authentication. Is who gets to talk to your data lake. We support authentication based on who you are as well as what you have. The who you are, is accomplished by a tight integration with Azure Active Directory, where you can use Azure Active Directory identities, both user principles as well as service principles and managed security identities to talk to the data lake. We also support authentication in the form of shared keys and SAS tokens, [00:13:30] which are essentially based on having a secret, which is a shared key or a time-bound token with a SAS token to authenticate yourself to the database. We highly recommend using Azure Active Directory based authentication because one you’re able to leverage the rich set of security features that come with Azure Active Directory such as Multi-Factor Authentication. And also you have the audit ability of being able to determine who exactly talk to your data lake.

The next layer we have authorization, [00:14:00] which determines who has access to what resources. We support two forms of authentication. We support a cause green form of authentication with Role-Based Access Controls. And we support this at the container or the storage account level in the hierarchy that we saw just before. And continuing our friendliness with Hadoop File System friendly workloads, we also support POSIX ACLs, which typically most of our customers are used to having in their own premises or do based [00:14:30] bigger analytics platforms. So, we support POSIX ACLs, there’s a finding that says control for file and folder level security.

Finally, we offer a set of data protection capabilities on the data store in the data lake as well. All data in the Azure Data Lake Storage is encrypted both at rest and in transit. And for encrypting data addressed, you can use either Microsoft-managed keys or you can bring your own keys and use that to enter the data.

At [00:15:00] the heart of any security system, is having visibility into what is going on. We have diagnostic logs, which give you a lot of information on exactly what is going on in your data lake and you can use this diagnostic logs, which is now integrated at Azure Monitor to pity the data and also set up a loading to ensure that you’re getting information both from a push and a pull model on the activities that go on inside your data lake.

[00:15:30] Finally, let’s talk about a very important topic of how do you maintain cost. We talked about data lake as strategy as essentially storing all the data under the assumption that all data is useful. We at the same time run the risk of what we call as Data Swarm, where data grows at the uncontrollable extent, overwhelming us, superseding the usefulness we get out of this data. As your Data Lake Storage offers a tiered storage model to help manage [00:16:00] your cost. So we have hot, cool and archived tiers, and the price of data address goes down as the deals go down. You can have automated life cycle management policies. When you have data that you feel is useful at some point in your time but not immediate, you can move data into a cooler year and you can also set retention policies to automatically delete data after a certain time period to control this explosive data growth, avoiding Data Swamps.

[00:16:30] Finally, let’s talk about strategies on how do you optimize your data lake for performance and scale. When it comes to a storage system, performance and scale go hand in hand. We look at the word regardless of what computation happens on our data lake as transactions. And understanding these IO patterns and optimizing the storage for the right set of patterns is going to get great performance because your transactions are highly optimized. And that also lets you scale [00:17:00] to running hundreds of thousands of nodes of compute on a single data lake account in a heartbeat. While there is no 12-step process to optimizing your data lake, there are a few basic considerations that can help you understand your IO patterns.

One, are you optimizing your transactions for high throughput? What do you want to do is, your reads and writes. We want to target getting at least a few hundred Mbs, higher the better per transaction. So you’re maximizing the data [00:17:30] that you get out of storage with every transaction.

The second consideration is you want to be able to optimize for data access patterns. You want to get to the data that you want as quickly as possible, which means getting to the file that you want as quickly as possible. And also reading only the data that you want to read, even the bigger file. How do they accomplish this? Let’s take a look at a few key strategies. This is not a comprehensive state, but it gives you a really good starting point.

First, pay attention to your file sizes [00:18:00] and formats. When you have a system data lake with a lot of small files, you’re incurring a huge performance overhead. Why is that? Reading a file in a data lake involves a profile overhead with what we call as Metadata Operations. Because you want to be able to scan the file system. They will find that file, overhead one. And you’re also doing a set of checks like do you have the right access to the file, for example, before you get to read the data. And given [00:18:30] that you want to maximize throughput, you want to minimize this metadata overhead because for every file that you read, you want to get as much data as you want possible. So avoid small files in your data lake.

And second, even within a file picking the right format is going to give you better performance and lower cost. So, let’s take an example of a party parquet, which is gaining popularity in terms of having a columnar file format, but that’s rich integrations into [00:19:00] pretty much all our analytics tools, including Azure Databricks or even Azure Synapse Analytics.

Parquet is a highly column of the most format where, at a very simplistic level data with similar columns are stored together. So you have data stored in groups and data with similar columns are stored together. So what happens here is because you’re storing similar data together, you offer very high compression, lowering the cost of [00:19:30] the data that you want to store. And then addition, when you’re reading data within a parquet file, you’re not reading the whole file to seek the set of data that you want. You’re basically going through these loops of data, skipping these groups, depending on whether they have the right set of columns that you want or not. So when you are ingesting data into your system, make sure that you’re able to ingest data, to say, have the optimized format or the right sizes. And even if you’re not able [00:20:00] to do that, remember we talked about the data preparation phase. You can take that as an opportunity to go create better five formats with a lot of data avoiding these small fights to get better throughput out of your system.

Next, let’s talk about avoiding unnecessary scanning of data and getting to the data that you want as quickly as you can. One strategy that we use is partitioning. We heard lot of talk in Dremio Subsurface Conference [00:20:30] about iceberg, and you’ve heard about data lake for example. And all of these are semantic organizations of data. And when it comes to your file and folder structure… When it comes to your storage system, it boils down to a file and folder structure. All of these offer some kind of partitioning strategy. And what you want to make sure is you’re optimizing your partitioning strategy for what kind of access patterns are most common. Let’s take an example that you have a big data system, where you’re ingesting [00:21:00] data from various sensors and they emit metrics every so often say every few minutes. There are different ways to organize these data with a partitioning strategy. You can either optimize for organizing these data per sensor or by time of the day or by metric.

Now let’s say, if your access patterns are more optimized for a time-based metric, where you want to see some trends, organizing the data by sensor ID is going to incur you a lot of overhead because for every time [00:21:30] of the day, you got to go travel through multiple folders to get data from different sensors. So make sure that you’re partitioning your data depending on the access patterns.

And finally, I want to call out that we have inherent capabilities in Azure Data Lake Storage that help you for optimized access. One such feature that I want to follow is query acceleration, where you’re able to push down a set of criteria into the storage system to be able [00:22:00] to get the data that you want. So, similar to a sequel query, you can specify a predicate, which is similar to a [verb 00:22:07] clause. And you can ask for specific columns with column projections, and you can do both of these, specify that to the storage system in your transactions so that you’re only getting the data that you want. This is going to highly optimize your performance because the amount of data that’s going across the wire between your compute and storage is heavily optimized with query acceleration, doing the filtering layer in the [00:22:30] storage layer. And it also optimizes the cost because first of all, you’re optimizing the data transactions. And second of all, otherwise you would have to load all those data into a compute system and then do the filtering inside the compute system and you’re able to minimize this cost as well.

These are some of the strategies that we’ve talked to our customers about, and we’ve learnt a lot from our customers as well. There is a lot more documentation that I’ve attached the link to both with respect to [00:23:00] Azure Data Lake Storage, as well as some of these detailed versions of this guidance. Do take a look at it. Thank you so much for your time. And now we will open it up for any questions.

Emily Lewis:

All right. I’ve got a few questions, it looks they’re coming in. Let me try to share my screen so that we can see these folks. While that’s happening, we did get one question in the chat that [00:23:30] says, how small is a small file in a data lake?

Rukmani Gopalan:

Yeah. That’s a good question. Typically, what we see is… When you’re thinking about the raw data just ingesting in its raw natural format, these same data, that’s a few bites to a few kilobytes in a file, but typically the guidance that we give is a few hundred megabytes to store your data.

Emily Lewis:

All right. Looks the videos are not [00:24:00] working. Do you guys have video, excuse me, if you have questions, can you just ask them in the chat?

Rukmani Gopalan:

I want to see the question that’s pretty… I think somebody had asked earlier in one of our informal conversations and I think that’d be useful for the audience as well, which is, “Hey, with all these Apache Iceberg or Delta lake or Dremio, how important is it for me [00:24:30] to look at all these storage transactions?” The answer is, it is important. Because, as we said right, a data lake is basically a central storage layer that lets you offer a plethora of compute each coming up with their own patterns. And what happens is, when you think about storage you want to make sure that you understand your entire compute landscape and you’re optimizing your data inside the data lake in a format that is basically helpful to solve [00:25:00] this plethora compute.

In addition, with our partners like Dremio, we’re also working with them, very tightly. So, all of these parameters that we talked about that are storage primitives in the data lake completely invisible to the end-user of Dremio, Dremio we’ll be leveraging some of these capabilities to optimize the data access. So, think of these as building blocks where you have storage access speculums known about it, be informed so that you’re able to optimize for that. And you can almost take it as a guarantee [00:25:30] that we are working with our partners very closely to make sure that they are taking advantage of these invitations.

Emily Lewis:

Great. We’ve got one another question. Well, we have a few. It says, does Azure support Hybrid Data Lakes?

Rukmani Gopalan:

When you’re talking about Hybrid Data Lakes, are you talking about data lakes on a setup that is on-prem and Azure? So, the onset is yes. We have [00:26:00] the right to have offerings that enable you to do this scenario. When we talk about Azure Data Lake Storage it does an Azure only offering. Having said that, there are a rich ecosystem of Rtools, that we help support a hybrid architecture, foresee things such as data governance. And also when we are thinking about a migration strategy… our customers are thinking about a migration strategy of how they take their cloud journey, it’s not going to be a big bang lift and shift. So they always start with having [00:26:30] some of that infrastructure on-prem and some of that on the cloud. And we have a set of both Rtools and our partner tools that help the customers with this journey while running a hybrid infrastructure in the meantime.

Emily Lewis:

Got it. Chris asks, “What is the difference between Blob store and Azure Data Lake?”

Rukmani Gopalan:

Oh my God. I am so excited that somebody asked this question. So, Blob Storage is object storage. Azure data Lake Storage is a capability that’s built on Azure Blobs. [00:27:00] So, the one strategy that we took as Microsoft while designing a storage system was not to invent a service of its own if your only for analytics, but we wanted to build our capabilities on a common storage foundation.

So, Azure Data Lake Storage is basically a capability of having a hierarchical namespace, which is the file and folder hierarchy on Azure Blob Storage. And what this helps us do is leverage all the data management capabilities that Blob [00:27:30] Storage has while having the specialized analytics oriented protocol enablement helping with integrating with that Hadoop File System friendly workloads. And the best part is we also have a rich set of partners who are already integrated into Azure Blobs. What we have is a common storage foundation layer with specialized protocol layers on top. So regardless of whether you use the Blob APIs or the data lake APIs, you’re talking to the same storage file [00:28:00] system.

Emily Lewis:

Got it. Bob asks, “Is Azure Data Lake the only Azure capability that is well suited for data lake implementation?”

Rukmani Gopalan:

I would say it is a fundamental part of a data lake implementation on Azure. But mainly talking about a data lake implementation, we’re talking about storage and an ecosystem of compute capabilities. So, as Microsoft we have a [00:28:30] ton of offerings that depends on what customers want. We need them there. So for example, our most recent offering that is gaining a lot of customer popularity is Azure Synapse Analytics, which is basically a single pane of glass implementation of a modern database house on Azure.

In addition, we also have Azure Databricks that we offered as almost a first-class offering on Azure. We also have for folks who are interested in running core open-source workloads [00:29:00] on Azure, we offer Azure HDinsight. And in addition, with partners such as Dremio or [Bandascore 00:29:08], Informatica, we also helped them run on Azure. So, the possibilities are infinite. We want to help our customers. They want to be. And some of our customers run Cloudera Data Platform on Azure infrastructure services. So, the possibilities are infinite. Azure Data Lake Storage is a very fundamental platform of it all [00:29:30] as a storage layer.

Emily Lewis:

Benca ask, “Given the massive volume of data lakes, how critical is the metadata to the security strategy?”

Rukmani Gopalan:

I think it’s very, very, very, very, very critical. I cannot underscore the importance of metadata. It’s not just for security, but also for this data discoverability and elimination of data silos, right? Because data is going to come in at a volume that’s unprecedented, especially with COVID, there so many things are remote and a lot of decisions are based on data. [00:30:00] Having a way to describe your data and catalog your data is more important than ever. And also enforcing some of your access policies based on what kind of data that you have within data is also important. And metastory is going to be a foundational building block for all of this.

Emily Lewis:

Great. So it looks we have time for one or two more questions, but before we get to those, I just wanted to mention before people leave that there is a tab at the top of the chat called Slido, and in there is a three-question survey. [00:30:30] If you could answer that before you leave, that would be great. The next question is, do I need to copy my table if I like to use your mentioned different partition strategy?

Rukmani Gopalan:

Where am I seeing… I’m not sure if I understand the intent of the question, but let me try my best. So, I believe what I understand the question is if I have different read patterns, [00:31:00] do I need to make different copies based on different partitions? Assuming that’s the question that was asked, the answer is, hopefully not. This is where we have layers above the data lake that help you achieve this optimization. So Dremio is a great partner for us, right? Where they have data virtualization. They take care of this data by using the perimeters that we talked about. They manage these partitions so that you’re able to leverage that. So, definitely talk to us… There’s no silver [00:31:30] bullet really, but they definitely encourage you to find opportunities to not having to copy the data multiple times and there are opportunities to do that. Happy to check with you depending on what your scenario is.

Emily Lewis:

The last question we’ll take is, how do you handle sensitive data at different layers of the lake, raw versus enriched versus curated?

Rukmani Gopalan:

Yeah. That’s a great question. Again, we have a set of solutions within Azure, [00:32:00] as well as we have a rich integration with our partners that provide data governance. One of the key capabilities that our partners provide today is data classification and also helping scrub the data for sensitive data and help classify that. But with respect to the core principles of managing sensitive data, we are compliant. We are a compliant data store. So we have our customers from financial services or [00:32:30] healthcare using the Data Lake Storage for handling scenarios based on sensitive data. So we have a slew of HIPAA SOX and FedRAMP and we have… Azure Data Lake Storage is a compliant data store. And in addition, what matters here from a key is, how you are cataloging, classifying the sensitive data and how you’re managing access to this data based on the sensitivity.

Emily Lewis: Great. Thank you so much, Rukmani. That was a really [00:33:00] great presentation. A reminder to the audience that she’ll be available on the Slack Community to answer any questions for the next hour or so. I know we didn’t get to answer everybody’s question so, follow her over there and enjoy the rest of your sessions today. Thanks everyone.

Rukmani Gopalan: Thank you so much.

Emily Lewis: Bye.