Subsurface Summer 2020
Lessons Learned from Operating an Exabyte Scale Data Lake at Microsoft
Data lakes are the new best thing. Companies want to transform their business by harnessing the power of data in their data lake to get transformational insights and leapfrog the competition. Our team at Microsoft has operated possibly the world's largest data lake for more than a decade. In this session, attendees will learn about our journey in building and operating a multi-exabyte data lake, what users care about, and what went well and what didn't go so well. Attendees will also learn about what has changed since we started our journey, and how we've adapted as the big data ecosystem has evolved.
Raji Easwaran, Group Program Manager, Azure Data Lake Storage
Raji Easwaran leads the Product Management team responsible for the strategy, design, development and implementation for Microsoft’s data lake storage platforms and services. Raji and her team are focused on delivering core storage platform services for big data analytics that allow data scientists, data engineers and developers to successfully develop, deploy and manage big data applications on Azure. Prior to working on the product management team that builds analytics storage services, Raji led the teams responsible for capacity management, operations and cost efficiencies for Microsoft’s internal big data platform powering analytics for several Microsoft internal divisions such as Bing, Office and Windows.
All right, here we go. It is 11:35. Okay, welcome everybody. Really excited. We've got another great tactical session. I'd like you all to meet Raji Easwaran, she's from Microsoft. She's got a great talk to walk us through. So, let's dive in. Over to your Raji.
Hey, thank you. Welcome everyone. Thank you all for taking the time to spend the next 20, 25 minutes with me and hope you hang around, and let's make this a more interactive session with questions at the end as well.
A little bit about myself. Raji Easwaran, I'm a group program manager on Azure, working specifically on Azure Data Lake Storage. So, my team, we build the big data storage platform for Microsoft, and also for customers, like yourselves, to run big data analytics pipelines. On a little bit about myself, I joined Microsoft around 11 years ago, started off in the dynamics team, and then quickly switched over to the internal big data analytics platform, which was originally, at the time, under the Bing organization. And I've been in that team doing big data analytics stuff for the past decade. And so, super happy to share my experiences with you. So, let's just get right in.
Why is this session important to me? Why is it important to Microsoft? And what can I share with you on my team's experience as we built this? Like all organizations, Microsoft is no different. We own and operate a lot of platforms, systems, applications that either fuel our company, whether it's the HR systems, or the finance systems, or the products that we run and offer, which is like Bing, or Office, or Windows. All of these applications, they work on data, we get intelligence from that data, and they power the intelligence, and the decisions that we need to go and make. And so, it's super important for us.
How do we think about the data lake within Microsoft? We don't think of this just as technology. We think of this as the collection of teams and products that use the platform. We think of this as the ecosystem of developers, administrators, and engineers that use the platform. We look at this as the collection of workloads, including the diversity of workloads, whether it's batch processing, or streaming, or interactive. A collection of all of this is what we really think about as the big data platform for the company.
And, as you can see, we've come a long way. Like I said, we started off with Bing. We opened up the ecosystem, a few years after we started this platform, to the company. And we've pretty much got every team or a product within the company using the platform to run big data analytics. So, let's dive in on what our journey looks like over the time that we built this platform.
Let's start with the guiding principles that we used. The first principle that we wanted to use was simplicity in terms of the approach. We wanted to make this a big data platform as a service. And so, we, the analytics platform team owned, and operated the service on behalf of all of these teams. And we wanted to make this super simple and super approachable for all of the teams at Microsoft.
The second guiding principle was scale. High performance at massive scale, but no trade offs on cost. So, we wanted to give this service to our teams at the lowest cost possible, and scaling this platform to when we started off, it was a few petabytes and now we are a multi exabyte platform. And so, that was a very core principle that we used as part of building the platform.
And the third one, and this was initially maybe not a very core guiding principles, but we learned very quickly that data gravity and sharing of data assets within the data Lake is going to be very, very important. And so, we used that as one of our co-tenants and principles to say even if you're not democratizing the data, make access to that data super simple. And we have teams within the platform that do democratize data, if there's no privacy aspects to it. But even for the other kinds of data, the principle was make it super simple for the different teams at Microsoft to get access to each other's data, such that they can get the value or that data. So, that's how we started.
Let's start on what we did well, or what we think we did well. Number one was we made compute, big data compute, computation running on thousands of machines over petabytes of data, we made that super approachable. And the way we did this was we didn't put the burden of managing infrastructure, or machines on the developers that were actually trying to solve the problem. We made it super approachable, whether it was in the form of how do I create the resources? What's the language that I use? And I go into a little bit of detail on the language that we had built and actually the compute platform, I wouldn't even call it language. I go into a few details about why that made it so approachable.
The second one was the simplicity, whether it was the operational excellence of ordering capacity, or managing chargebacks to the different team, taking out the financials and the economics away from the teams that are using it. So, we built our own capacity management team and cost attribution team. We built tools to kind of look at how hot the data is, and how do people think about okay, do I retain this data? Do I get rid of it? We made that super simple for the teams. And so, as a combination of the approachability of the platform, and the simplicity with which they can onboard that platform we saw huge adoption on the platform.
And third, there is definitely this element of data culture, because you can offer the platform, at the end of the day, the quality of the data completely depends on the developers that are actually operating on that data. And how easy it is for them to bring the data in, and write the scripts that can compute against that data. And so, what we did was we had a vibrant community of developers that were using it. These developers would write scripts that can be used throughout the company because the data itself can be shared so easily. We had performance boot camps where not just our team, other teams within the company would actually come and share best practices. And so, the community was also very, very vibrant. And you see this pattern quite well adopted even in the industry today with the open source culture that's out there, but this is what we did well.
So, let's talk about the computation environment that we built. So, this was our secret weapon, our magic sauce, if you will. We called it SCOPE. And it stands for, as you can see on the screen, it stood for structure computations optimized for parallel execution. For people who've been in the big data space this is the improvement of MapReduce. So, it was a language, it was a runtime, and a parallel query execution framework that made it possible for people to run these big data jobs on the platform, very similar to the Spark engines of the world, and the MapReduce engines of the world today. And this is what we had started more than a decade ago.
It was inspired by SQL. And so, it made it super easy for people to onboard because they had to just write SQL like queries. Similar to select statement, we had extract statement to extract the data from the files that were stored on the platform. There were select statements that would select data from the extraction that was just done. Output statements, just write it.
The third thing we did also with SCOPE was a lot of .NET integration. We are Microsoft and we had a lot of developers that were very familiar with .NET And I'll touch on the last element of this as well, which is extensibility. And the combination of the framework that allowed developers to use .NET, and extend the language. So, people could write their own functions and kind of compliment the language and the capabilities that SCOPE already provided. So this was, essentially, what led to really brought adoption within the company. And now, as we've kind of progressed over time, we've seen similar things like Spark and all of the other engines, and we've tried to adopt some of that in our platform as well over time. And I'll walk you through what prompted us to go do that.
What are the challenges we faced? The first challenge is, like I said, one of our core principles was to make sure that we had scale as one of our primary goals. And, like I said, we started off with petabytes and we were growing so rapidly we needed to make sure that we were able to keep up with the demands of the Offices of the world and the Windows of the world. And so, we were growing in hundreds of petabytes to exabytes. And if you attended the keynote address earlier this morning, one of the things that people did, at that time, was build large clusters. And, at the time, even a 1,000 node cluster was a huge thing.
We actually were able to scale our clusters to tens of thousands of nodes, where our clusters are up in the 50,000 node range. And the reason we did that is because of the other principle, sharing data. So, we wanted to make sure that it was a huge multi-tenant system where different teams could access each other's data without increasing their costs. And we had to go and deal with that challenge of scaling to those exabytes.
Another challenge that we ran into was, because we were growing so fast and because we had to go and buy all of that hardware, we ran into data center limitations. There wasn't enough space in the data centers that we originally planned on going into. And, again, with all of the growth, back to the principal, we had to offer this at low cost to our customers. How do we keep the efficiencies going and keeping that low cost?
So, how do we do this? Number one, and this is a pattern that you've seen even in Dremio, for example. You have to understand the workloads and what they're trying to do. And so, we started looking at the workloads that were running on the platform. What were they trying to do? Was there repetition? Were there temporary or intermediate data sets that were being created? Such that if there were 10 jobs and the first job was creating an intermediate dataset, could we cache that, or store that such that the remaining nine jobs, or 20 jobs that wanted that intermediate data set didn't have to start again from that raw data? So, we had to understand all of these workloads. And then, we built these capabilities into the platform such that we could optimize and, again, for performance, and also for scale. So we, we did that.
The other one was we started looking at what it would mean to be a hyper scale platform. So, things that we did, number one, was a scale out. You've seen that concept in some of these HDInsight to Databricks type clusters. What we did differently, there was because we were owning and operating our own hardware. There were two elements that we had to keep in mind. There was the servers themselves. And do we want to stick with the co-located compute and storage, which was the trend back then? And then, how do we think of the hardware SKUs? Because you could pick SKUs based on high memory or high compute or high storage. And at the time everything was co-located.
And so, we kind of evolved our SKUs. That was the third element of how we did this. We evolved our SKUs, and we built heterogeneous clusters. And so, the same cluster could host a service from five years back, where it was more optimized compute. And then, if we saw that the storage demand was going up, in the same cluster, we could add more storage intensive SKUs, but we could still co-locate the compute and the storage to get that fast performance. What we also did, we worked with the data center team to give us the network infrastructure, such that we could go into the core, the distributed or disaggregated compute and storage. And so, we had to invest on that over time.
And the other thing is to avoid the data center space limitations. We really had to go and invest in, okay, where do we really want to be? Which regions do we want to be in? And then, we had to go and create plans to be in those regions, work with the data center planning group to make sure we had the power and the data center space to accomplish that.
So, those were the three top level things. There's a whole bunch of other technical challenges, but probably not too unique to us. It's probably the same challenges that any big data platform or data lake implementation crew would face. But these were probably the three that stood out in terms of what do we do outside of the core technology?
What did we learn from our customers? Like I said, SCOPE was a proprietary language. The formats that we used, we call them structured streams, very similar to what you've heard today, Parquet or Delta Lake or Iceberg. We had our own proprietary format, we called them structured streams. And we had the intelligence but, again, nobody could use that outside of our platform. And so, that was the one feedback, the top feedback that we got from our internal teams saying, "Give us open source, not just for the formats, but also for the other analytics engines that are out there." And we started getting developers who knew Spark, or knew even HBase, and they couldn't use those technologies against the data that we had on the platform. So, that was the first feedback that we got.
Another huge learning for us was two, three years ago when GDPR hit us governance was important to everybody. And for a team that was operating such a huge analytics platform, we had to support that on behalf of our customers. And so, on the platform we had to go and create capabilities to do data lineage, and create that catalog of what the data sets were, what that data lineage was, and how was data transforming over time. And, also, to even manage the DSR requests. And so, we had to go and create capabilities, such that if our teams had DSR requests, they could actually satisfy that on our platform. And so, that was a multi-year journey, and it was a huge undertaking for our team. And what we learned out of that is problems like this isn't solved within just the platform team, but everyone, the whole community had to come together. And that's what made our journey successful. It was us building the technology, but in collaboration and cooperation with the rest of the company.
And the last one was similar to the same feedback that we got around open source. People couldn't use Azure because we were a proprietary platform within Microsoft. And yes, we had the exabytes of data, but they couldn't use some of the other Azure services. They couldn't use partner technologies. Again, [Tom-ah 00:15:07] mentioned this earlier today, there is vendor lock out. That was the classic symptom that we experienced at the time. People couldn't use partner solutions, and they were asking for Azure integration.
This is just a very quick picture of where we've come over time. Like I said, we started off in 2008 with a few petabytes. Over the first few years, it was us working with Bing, and kind of helping them with things like building recommendation engines, or experimentation platforms, or understanding their user query patterns. We then, expanded that to include the Office team, building the big data platform for Office such that we can help them with their telemetries. How's the Office business doing? How do they do spam detection? Scenarios like that.
We then opened this up to the Windows team. And so, that's what essentially drew the hockey stick growth. And a lot of this was the teams within Microsoft trying to understand the value of each other's data sets. And that was essentially another reason why we grew so rapidly. And what you see at the tail end is, essentially, a result of what happened with the GDPR. And so, a lot of the teams, what they wound up doing was they had to really go back and look at the data they were storing in the system, and understand what they were storing it for. And that there was a lot of clean up, especially with the 30 day destruction policy.
A lot of teams wound up saying, "You know what? I don't think I need all of the raw data. Let me keep the curated data, which doesn't have the PII." And they started curating the data to what needs to be kept. And then, they started getting rid of some of the raw data. And some of the other teams did the opposite. They said, "Look, I need the raw data. I'm going to keep them very, very secured. And I will deal with all of the PII elements as needed." And so different teams different approaches, but there was a lot of cleanup as a result of GDPR.
So, where are we headed now? This is our vision and our mission. What we want to do, whether it's outside customers, or whether it's the teams that we support within the company, our goal is every organization needs to extract the maximum value for their datasets. We don't want them to just come and store that data. We want them to get the maximum value. How do you do that? And the way we are doing that is, essentially, our current strategy, which is we've built Azure Data Lake Storage. This is the central platform, a storage platform for building your data lakes. Everyone brings their data, diverse data sources, diverse formats, bring them into the data lake. And then, you get the rest of the Azure ecosystem. And, based on your use cases, you pick which analytics, or compute services you need. And so, we built that integration with all of the other compute services we have within the company. And we also took that outside and we said, "Look, we also want to make third-party solutions available to these customers.
And so, this is the vision that we are on. And we are slowly taking the internal big data analytics platform, and we're moving these customers over to the Azure world. And we've already started that journey. We've started moving exabytes of data already to Azure Data Lake Storage, which is the external facing service that all of our third-party customers use. And so, over time, we don't want to have these silos. We talk about data silos all the time. Within Microsoft we, unfortunately, have a huge data silo, where a lot of our intelligent data is stuck on a silo, which was historically more proprietary. We've made strides to make it accessible to the rest of Azure, but we're taking it all the way, and bringing that into our central data lake storage platform, which all of our customers use this one.
So, that's where we are going. And what I'd like to leave you with today is think about what is your data lake strategy. When I talk to a lot of customers, they are very focused on the technology. The technology is very important. You're hearing about that from all of the other speakers today. How do you think about hive? How do you think about data formats? How do you think about [inaudible 00:19:10] and optimizing the access to the data lake, all of that is super important.
But there is also another element. If you want to make your data lake successful, you need to make sure that the people who need the insights have accessibility, and there is no friction for them to onboard your platform. Make sure you're giving them the elements to manage their costs. And one of the reasons why it was so important for us to move to the Azure services was there was so many capabilities that we've built, over time, such as redundancy options, and tiering and things like snapshots. So, you have to give them all of those capabilities.
And so, evolution of your platform is also super important. And, again, the community. Are you leveraging your community? Within your company, you'll be surprised that there are pockets of expertise and experience within your company that you might not be even taking advantage of unless you open that community out, and allow your developers and your experts to come and teach the rest of your company on how best to build your data lake. So, with that, that's our experience. Would love to take the questions. Thank you all for taking time to come talk to us and that's open this up for questions.
Fantastic. Thank you. Can you hear me okay? Yeah, great.
Yes, I can.
So, I'm going to now start bringing people in one at a time. So, I have so far have one person in the queue. Adam, I'm going to click to bring you in, in just a sec. If anyone else wants to ask a live question, click that button in the upper right to add your audio/video. And then, just make sure, of course, you give access to hop in to your camera and to your microphone, and then I'll see you here. So, I'm going to give a try. Adam, let's see if this works, clicking to bring you in. It may not. We've had some trouble with this working before, so maybe while we're trying, I see Scott coming up next. Raji, there's some other questions that came in on the announcements. If you scroll up a little bit, you'll start to see, I think, maybe even want to hit one of those while we try to bring someone in.
I can read one off for you, if you want too.
So sorry, Adam, that didn't work. I'm going to try Scott too in the background here.
Let me see. There's one question in here and I'm not entirely sure. [J-am 00:00:21:40], I would love to take this offline with you and we can discuss this on Slack.
We have, at least in our platform, both on the internal platform side, and also on the Azure Data Lake Storage side, we do have petabytes of data, even in single files. We can do terabytes per second to put on the platform. And so, billions of rows per second is really not new to us. Because of the distributed nature of the platform, it's going to depend on how many compute nodes or VMs you have on the compute cluster, as well as how you have distributed your storage. But that's absolutely possible today. We run really, really large workloads within the platform. Now, 1 billion rows per second, I personally haven't tried that, but I can check with my colleagues, if anyone's actually tried that, but it doesn't seem too farfetched.
Great, okay. And we do have Scott. So Scott, if you want to go ahead with your question.
Hi, how are you? I, hopefully, have a little bit of time for this. Could you expand a little bit on what you meant by now supporting GDPR more with the platform?
So, let me give you the example of where we started. So, at the time, if you look at any analytics engine the way the data comes into the platform is an append only system. And so, you start off with a stream, and you keep appending it. And so, for example, if you wanted to go and support GDPR, we needed the ability for someone to take this humongous stream with bits and pieces of data inside that that stream, or file that needs to be removed. And so, we needed to provide the ability for someone, or for our customers in an append only platform to be able to do deletes. Because you, typically, can't update in between. And so that was the capability we built. And so, we created the capability of allowing updates on an append only system. That's essentially what we had to go and build.
And so, there were multiple things that we had to go and do. Number one, on the storage layer, or on the compute layer being able to give people the ability to say, "Hey, that's a delete, please delete these records." And then, another capability that says, compact these streams, such that if there was a delete, then when SCOPE, for example, was trying to read that screen, we knew that that element was deleted, and we will not return it back to the user. Then, we would have a compaction process to say, "We know there's a whole bunch of deletes that have already happened, let's go compact it." So things like that is what we had to go and build. And these had to be built as platform capabilities. And so, that's what we did in an append only system.
Great. Thank you.
Great. Thank you, Scott.
I don't see any other live Q&A questions coming in at the bottom. So, Raji, if you want to go just pick another one out of the chat.
Yeah, let me go look.
So, let's see. There's one from [inaudible 00:24:31]. "Do you use open source technologies to process data?" Like I said, when we originally started, it was a proprietary platform. We used things like structured streams, which was a proprietary format. We used SCOPE, which was proprietary. Over time, what we've done is even in that platform, we allowed some of the open source technologies, and languages to operate.
Now, what we're be doing is because we have a data silo problem, we are saying we don't want different platforms. We're actually putting all of this data on Azure Data Lake Storage. And Azure Data Lake Storage supports all of the open source technologies park, Databricks is integrated very well with Azure Data Lake Storage, HDInsights is. Take Dremio, we have actually been working very closely with Dremio. You can bring any open source platform, and just use the driver because we are an HDFS protocol on top of the data lake storage service. So, that's where we are going right now. And so yes, we do support all of the open source technologies to process the data.
Another one, give me one second. Where's my ...?
Yeah. Jerry, the name of the platform, we call it internally ... and there was a reason why I didn't really go into the details of this. We call it Cosmos and it could get a little confusing because of the overlap with Cosmos TV, but we actually call it, internally, Cosmos.
Let me see. Architecture diagram, I don't have a great architecture diagram right now. But think of this very similar to Hadoop, where you have a front-end layer, you have your compute layer. And similar to HDFS, you've also got the storage layer. They're all co-located on the same machine. And we also have metadata elements to this as well. There is a paper that you can probably look online on Azure storage, very similar architecture to that. But that's, essentially, what the architecture would look like, very similar to Hadoop.
There's another question, "How will you guys be consuming the data from ADLS Gen2?" ATLS Gen2 is going to be the storage platform, which is going to house all of these exabytes of data. And we consume that data through Synapse analytics, for example. If someone wants to go and upload this into their data warehouse, we can do that. If someone wants to just run Spark queries against that data, we can do that. You can hook up a HDInsight cluster to it. You can hook up Databricks clusters to it. If you want to use Dremio, or Presto, or any of that, essentially, the sauce is the driver. It's an HDFS driver, you use the driver, and you connect to the data in the data lake, and that's how we use it.
Let me see ... I don't fully understand this. "Any possibility and study was done in Amazon for GPU CUDA data processing?" I'm not aware of this. So, since I don't understand the question, I'm going to go to the next one. And I'm not aware of any GPU data processing study yet, but if we do one, we'll be happy to publish that.
"Are you building any enterprise data model or data warehousing capabilities on ADLS Gen2?" ADLS Gen2 is the storage part of the data lake. And you have things like Synapses analytics that actually have the data warehousing capabilities. And, if you've listened to, again, some of the sessions before, the trend now is, and it's amazing how we've been able to keep up with the platform with co-located compute and storage for this long. But we're actually moving away from that to the distributed or disaggregated compute and storage. There are definitely trade-offs, and benefits as well.
One of the challenges we had with the co-located model is we're always playing catch up with SKUs because there is a very fine balance between what's our compute demand, and storage demand. And if you don't get that right, you wind up stranding a lot of capacity, and it's always a race. You split that up, you can scale these two independently of each other. And in order for you to go do that, your service has to be super smart about, okay, which domain, which network domain is your data in? Which network domain do you want to put your compute resources in? And so, that's the investment that we're trying to do.
And so, going back, we do have these data warehouse capabilities. They will be in the disaggregated model because you've got the storage at the bottom, you've got Synapse analytics where your data warehouse can reside. You've got Spark that's also accessing the data here. You could run HDInsight which is, again, another compute engine accessing your data remotely because it's remote storage. So, that architecture is very much a pattern today. And a lot of customers are using it. And so, ADLS Gen2 itself doesn't offer the warehousing capabilities because it's part of the compute infrastructure that sits on top.
Great answer Raji. And, by the way, I'm going to jump in because we're basically out of time.
That open data Lake architecture, that's really the whole point of this. Bring those different compute engines to do all sorts of different things. Things we can do today, things don't even know about the future. So, with that, we're going to close down. But keep the questions, and the comments, and the discussion going over in Slack. So, I would invite everybody including Raji, go ahead over to Slack, and you can keep posting your questions there, and not just here at the event, but afterwards as well. Thanks everyone. And thank you, Raji.