Dremio Jekyll


Subsurface LIVE Winter 2021

Data Lakes Drive Decisions: A Virtual Fireside Chat

Session Abstract

Join us for a fireside chat with Mai-Lan Tomsen Bukovec, the global Vice President for Amazon Web Services (AWS) Block and Object Storage services, which include Elastic Block Storage (EBS), Simple Storage Service (S3) and Amazon Glacier. S3 has more data lakes than any other provider, and Mai-Lan will discuss emerging trends in data lakes and how they are powering the next generation of business analytics, machine learning and other business-critical applications.

Presented By

Mai-Lan Tomsen Bukovec, Global Vice President, Block and Object Storage at AWS

Mai-Lan Tomsen Bukovec is the global Vice President for Amazon Web Services (AWS) Block and Object Storage services, which include Elastic Block Storage (EBS), Simple Storage Service (S3) and Amazon Glacier. Mai-Lan has been an engineering and product leader of AWS storage and compute services since 2010. Prior to joining Amazon, Mai-Lan spent over 10 years in engineering and product leadership roles at Microsoft, as well as three years in early stage startups. She served as a forestry volunteer in the Peace Corps in Mali, West Africa.

Mai-Lan lives in Seattle with her husband and three boys. When she is not working on Amazon cloud services and spending time with her family, Mai-Lan trains as a recreational boxer. She also holds a green glove ranking in the martial art Savate.


Tomer Shiran, Co-founder & Chief Product Officer, Dremio

Tomer Shiran is the CPO and co-founder of Dremio. Prior to Dremio, he was VP Product and employee no. 5 at MapR, where he was responsible for product strategy, roadmap and new feature development. As a member of the executive team, Tomer helped grow the company from five employees to over 300 employees and 700 enterprise customers. Prior to MapR, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. He holds a master's degree in electrical and computer engineering from Carnegie Mellon University and a bachelor’s in computer science from Technion - Israel Institute of Technology, as well as five U.S. patents.


Webinar Transcript

Announcer:

Ladies and gentlemen, please welcome to the stage, Dremio co-founder and CPO, Tomer Shiran.

Tomer Shiran:

All right. Thank you everyone for being here with us this morning or afternoon or evening, wherever you are. I am very excited today to be joined by Mai-Lan, who is the VP of Block and Object Storage at AWS. That includes both the S3 and EBS. Mai-Lan, thank you for joining me.

Mai-Lan Tomsen …:

Thanks for having me.

Tomer Shiran:

[00:00:30] Yeah. I’m curious, maybe just to get started, you’re responsible for S3, which was created back in 2006 and EBS, which I think was created in 2008. Together, those services are serving millions of customers and companies on AWS and really storing so much of the data that everybody here with us today has. I don’t know how you sleep well at night, given so much [00:01:00] responsibility. I’m curious if you could just share with us how you got to this point and what is it like running the world’s largest storage service?

Mai-Lan Tomsen …:

Well, it’s been amazing to work on something that is just so foundational for, as you called out, millions of customers. If you think about it, the patterns of computing are going through a fundamental transformation and it’s happening quickly. AWS has led the way since S3 launched in 2006. I’ve worked [00:01:30] on AWS for about over 10 years now and have to say that I come into work every day, just as energized as I did way back in 2010. We’re super mission focused at AWS. We talk about how 90% or more of our roadmap is driven by customer requests. If you take that deep belief in helping customers transform their industries all over the world, you add pretty much the most interesting, [00:02:00] large scale distributed systems and our Amazon culture, and it makes for this incredibly exciting opportunity to transform how customers use storage.

To be honest, part of it is that we know that what we’re building helps our customers make innovation leaps that frankly change people’s lives. For example, last year in 2020, when demand for ventilators spiked during the COVID-19 pandemic, one of our customers, [00:02:30] the medical device manufacturer Vyaire Medical, found itself with an order volume that as you might imagine, went up by a factor of 20 times. Before the COVID-19 pandemic, Vyaire was producing 30 ventilators a week. And just a few months into the pandemic, the company was producing 600 ventilators a day. For Vyaire’s biggest project, a single government order for over tens of thousands of ventilators, they had to create [00:03:00] a new, highly customized manufacturing process for those ventilators. They had to innovate on the spot and they had to do it with urgency because how much the world really needed that supply. It was 20 times beyond what had ever been done before by this company and they did it in six or seven months.

That ability to scale was only possible because Vyaire had just modernized their analytics solutions [00:03:30] using among other AWS services, Amazon S3, to store the data. That data lake helped Vyaire quickly analyze ventilator quality in real time so they could adjust and refine that new manufacturing process. And it was all to speed up the delivery of those ventilators.

So when you talk about how we sleep at night, when you know you’re coming in everyday to help customers like Vyaire help the world, you know that what you’re building [00:04:00] is making a difference. That’s what we’re here in AWS storage and AWS as a platform to help customers do. You can’t develop a critical vaccine in record time or transform the different processes for either technology or retail delivery, without a modern data architecture built on AWS, a data lake in S3 and as you know, the many purpose built analytics engines offered by either AWS or Dremio.

Tomer Shiran:

[00:04:30] What have you seen in general with this pandemic? Has it had an impact on how companies are migrating to the cloud or using data lakes? Obviously, the companies that are involved in developing vaccines and other medical devices, but maybe more broadly as well, in the industry.

Mai-Lan Tomsen …:

Yeah, we’ve seen an acceleration. It’s pronounced. We believe that in time, all companies will move to the cloud. [00:05:00] That is the only place where you’re going to be able to deliver what your business needs at the pace of today’s modern world. That has been true for years and we have many customers that are 100% all in on the cloud. But what this last year with the pandemic really showed, is agility matters. It matters on how you think about your business, not just now in these week to week changes that we see, but really what are the implications [00:05:30] for your business in the future? It has accelerated many customers’ move into the cloud. Whether that is speeding up consolidation of data projects that were started before the pandemic, or really tipping a lot of thinking within the company to say now is the time. That has been a pretty significant change.

Tomer Shiran:

Yeah. I mean, it must be hard for companies, especially now. I mean, it’s always been hard to manage your own data center [00:06:00] or run that operation, but with all these additional complexities that come with this situation, that must be an accelerator for many.

Mai-Lan Tomsen …:

It’s complexity. When every business is managing their own data center in regular, normal times it’s hard because you have to manage hardware, you have to manage networking. You have to be able to see into a crystal ball and predict the future because you have to sink capital into a set of appliances that are going to have a lifetime [00:06:30] of years. That’s hard. But in a pandemic when you’re adding on different dimensions of safety in your data center or your workforce that’s now either remote or wants to work remote, it just casts a whole different view on things. Not to mention, for certain of our industries or certain of our customers, the demand has spiked tremendously.

Tomer Shiran:

Yeah, that makes a lot of sense. I recently got my own Peloton as an example of that type of a demand. [00:07:00] Back in 2010, I was one of the first employees at one of the Hadoop companies. Back then data lakes were the hottest thing. Then five years later, 2015, I think some people started to declare them dead. That was the result of a lot of complexity and how hard it was. People started to say, “Well, maybe we should just go back to data warehouses,” and data warehouses in the cloud where becoming a possibility with unique capabilities. To now 2020, [00:07:30] and data lakes are the hottest thing again and everybody’s looking to build a cloud data lake. What do you think is driving this new interest now that’s really taking off?

Mai-Lan Tomsen …:

Fundamentally, every business today is a data business. If you look at it, business leaders in every industry all over the world know that a business decision, whether it’s refining a product like we talked about for the ventilators [00:08:00] with the pipelines, or pivoting a whole organization, those decisions need to be made quickly and they need to be right. They need to be correct. I think what’s happening now is you just have more companies realize that the path forward to fast, correct business decision is paved by data. A data lake is the foundation, not just for analytics, but also machine learning and artificial intelligence because ML and AI [00:08:30] thrive on large and diverse datasets. So we’re seeing data lakes fundamentally change how customers think about data architecture, no matter what the application.

Rather than tying those individual legacy hardware appliances to a data center and a specific application, data lakes are breaking down those silos. They’re getting that different data into a single shared dataset that can be analyzed and extended with machine learning [00:09:00] and AI by dozens or hundreds of applications. So every business needs that shared dataset for agile decision making because it’s the only way to stay on top of a rapidly changing world. That means S3 is foundational for all of that data. As you know, customers have been building data lakes in S3 longer than any other provider, so we have more data lakes at scale running on AWS than anywhere else.

[00:09:30] If I were to try to illustrate the breadth of what customers are doing on AWS and S3, let’s take the example of something that is very top of mind for all of us, which is the COVID-19 vaccine. As you know, bringing a new drug to market normally can take up to a decade. Certainly we’ve been hearing a lot about vaccine development, but really the reason why it takes that much time [00:10:00] and frankly costs often in excess of a billion dollars, is because scientists need years of basic research. They have to screen thousands of compounds and then they have to move on to lab testing and trials. If you look at that from a data perspective, when you have siloed data and you have limited scalability of those on premises system for this type of scientific and high performance computing, that will slow down progress. That was [00:10:30] and is unacceptable when you’re talking about vaccine development last year and into this year.

As we’ve seen with Moderna, Moderna just so you know, they focus on specifically on a method. They focus on using Messenger RNA, which is called MRNA, a specific science to create their new medicines. Moderna is one of the leading companies that’s delivering the COVID-19 vaccine all over [00:11:00] the world. To build that vaccine and the other drugs that they have in research, they had to invent proprietary technologies basically, and methods that run on AWS to create that MRNA construct so that cells can recognize it as if it was produced in your body. This helped Moderna experiment rapidly on virtually any MRNA sequence because their data lake gave them enough information in real time where they could switch between [00:11:30] research priorities.

They run their whole drug design studio, that’s the name of the system that they use to develop vaccines, on AWS’ compute and storage infrastructure. The whole goal there is to quickly to speed up the design of that MRNA sequence for protein targets. It then takes both analytics and machine learning on their data lake, and they optimize their sequences for production so that the [00:12:00] company’s automated manufacturing platform can convert the results of that data into physical MRNA for testing. They, Moderna, depend on AWS to establish a single source of truth for data in the data lake that the company uses to speed up bringing this vaccine to the world in record time.

That’s one example of how data lakes speed up innovation. Another example [00:12:30] that’s quite different, but still highlights how data lakes shine a light and speed up innovation is Domino’s.

Tomer Shiran:

It’s a very different example.

Mai-Lan Tomsen …:

Very different example, but I’ll tell you, if you’re anything like my family, pizza has been part of our diet-

Tomer Shiran:

All right, fair enough.

Mai-Lan Tomsen …:

Over the last year. I have three teenage boys so I’m quite familiar with [crosstalk 00:12:57].

Tomer Shiran:

How does a pizza chain [00:13:00] use cloud data lakes?

Mai-Lan Tomsen …:

Well, I’ll tell you about Domino’s Pizza Enterprise, Limited. It’s the largest Domino’s franchise holder. It’s basically the Domino’s that operates across countries like Australia, and New Zealand, and Belgium, and France, and Japan, many other companies. More than 70% of Domino’s pizza sales comes from online orders. So Domino’s wanted to differentiate particularly with, as you might imagine, quite a spike in demand over the last [00:13:30] year or so from families like mine, to have a pizza ready for pickup within three minutes and safely delivered within 10. To deliver on that promise, Domino’s had to make some pretty significant changes. It had to create efficiencies in how they cooked, so make changes to their oven, as well as transportation. They had to physically open more stores closer to customers.

Here’s the really interesting [00:14:00] part too. They had to come up with how to build predictive technologies that helped reduce time for pizza making and delivery. Obviously, you’re going to invest in technologies to increase the speed of ovens and make drivers more efficient with eBikes or scooters. But they also depended on a data lake and their data lake consisted of key order information. It was stored in S3 and they used Amazon SageMaker to build and train [00:14:30] machine learning models to predict the likelihood that an order would be placed so a store could begin making that order right before it’s placed. Because Domino’s doesn’t want to give you a pizza that’s been sitting around for 30 minutes. They want to start making that order using predictive reasoning right before it’s placed so you can get your pizza delivered within 10 minutes or pick up for three.

Two very different examples, but it’s [00:15:00] illustrative I think, of both the breadth of what a data lake can do, as well as the innovation in two very different industries.

Tomer Shiran:

That’s really fascinating. Just the breadth of use cases, from vaccine manufacturing, to accelerating pizza delivery and creation. Werner Vogels talks about, there was a blog post I read last year, where he talked about how Amazon or amazon.com internally is using [00:15:30] data lakes. One of the benefits being all these different compute engines and analytics services that they can take advantage of. We also at Dremio had the privilege of working with that team and enabling BI on some of those use cases. That’s clearly one of the benefits of the data lake, is being able to use all sorts of different services that exist today, and that might be created in the future.

What are some of the best practices and maybe challenges that companies should be aware [00:16:00] of when they’re creating that cloud data lake or going down that path?

Mai-Lan Tomsen …:

I think when you’re thinking about cloud data lakes, the most important thing is to start with the core, which is S3, to allow for that rapidly expanding dataset. Then beyond that you have a wide variety. You have more tools than ever to unlock insights into your data. Now what’s interesting about the design patterns that we’re [00:16:30] seeing for data lakes, is that when you start with a core of petabytes or exabytes of data, you end up realizing that, wow, you can use that data in so many different ways. Whether it’s for compliance or if it’s for ad targeting, or if it’s predictive analytics for where your industry or your business is going in the six months or three year timeframe. The key there is that you have more applications than ever that you can build on that shared dataset [00:17:00] and customers are using many of them at the same time.

If you think about the growth of data, it all starts from making the right decisions about putting it in the right place, which is S3 for millions of customers. The amount of data under analysis has increased from terabytes to petabytes and now to exabytes. IDC says that digital data is growing at a rate of 40% year over year. [00:17:30] The first choice you make is where do you put that data? Traditional on premises did analytics approaches, they just can’t handle those data volumes, and they can’t handle it in a cost effective way because they don’t scale well enough, and they’re so expensive. Then you have the data whether you need it or not.

So the first choice is where do you put your data? And customers start with S3 as the core of the data lake and they are able to then tap into this incredible variety [00:18:00] of tools available across analytics; ML, AI, to unlock that insight. Certainly Dremio has been a part of that movement for years now. It’s not only more cost effective than ever to store exabytes of data using AWS, but it’s now easier and faster than ever because of so many purpose built solutions like either AWS analytics or powerful solutions like Dremio to unlock that data.

**Tomer Shiran:
**[00:18:30] That’s very fascinating. I remember when in those early days of S3, when AWS was in its infancy, S3 was created to store images, right? It was that thing that you would put your website images on, on S3. Obviously, it’s evolved so much and you’ve been a big part of that journey at AWS of making S3 this I’d say, infinitely scalable system. I use that term with so many of our customers and our partners [00:19:00] and nobody ever challenges me on that term. We all know nothing’s infinite in the world, but for all practical purposes, S3 is infinite.

I want to ask you, what are some of the innovations that you’ve been creating in that platform in S3, and maybe some of the things that you can share about what’s coming?

Mai-Lan Tomsen …:

Absolutely. Our innovations are driven by the wide variety of usage on S3. That’s the first thing to maybe keep in mind as a framing [00:19:30] reference. As you say, we are 100% committed to making sure that customers can trust S3 for their data. But it’s really not just trusting us with the data, it’s also trusting us with being able to let you access your data at any time, whether that’s availability or if it’s frankly, dealing with the scale of the egress of your data lake.

I mean, just as an example, S3 regularly peaks at far [00:20:00] over 60 terabits per second of traffic in a single day in a region. We have a huge amount of work that’s being done on S3. As you said, when S3 first launched, customers including the Amazon website, stored a lot of images for product pages. But if you fast forward to where we are in 2021, what we’re seeing right now is all different kinds of datasets. Of course, parquet files for many different forms of analytics, but customer care [00:20:30] audio recordings, medical records, and other things. So if you look at what customers are doing with that, it’s a wide variety of interaction for all the industries that you can think of.

I’ll give you a few examples. Morningstar uses a machine learning model that incorporates the decision making process and past rating decisions of their analysts and they use that model to rate funds not covered by their analysts. What they found is [00:21:00] that the Morningstar ML model rates, they were able to do it quicker and more accurately. Six times more funds than human analysts. It was more efficient and in some cases it was more accurate.

NuData, which is now part of MasterCard, uses historical data to analyze fraud patterns and find and predict fraud before it happens. For NuData, they had this growing area of fraud in takeover [00:21:30] of accounts. So NuData was able to take a look at the historical data in their data lake to see how this was done, and then develop a model that was able to block almost 100% of fraudulent traffic passively. Another example is GE Healthcare is using ML that was trained from historical data of radiology images in S3 to figure out which patients should get attention first, based on their radiology results.

You think about that and you think about the fact [00:22:00] that customers store terabytes or hundreds of petabytes or exabytes. When we take that view of the customer and we sit down and we build S3, we go down to the guts of S3 and we think about how can we constantly reinvent what S3 is to make it the best data store for all of those different varieties of use cases, as well as the original one [00:22:30] for backup and images for websites. We’re constantly thinking about how we can not just build better storage, but better storage for those customers.

An example of that is how we introduced strong consistency at the end of last year. Right now, every request you make to S3 is strongly consistent. You don’t have to make any change to your customer application. It doesn’t cost you anything more. There’s no change [00:23:00] to performance. It was a major change to S3’s architecture. We launched S3 in 2006 with eventual consistency. And that basically means that after an object is added to storage, in some rare cases, the metadata for the object may take a little while to show up, even though the object’s there. And that works just fine if you’re a human being pressing a button to refresh a webpage if your image happens to not show up once in a blue moon, but it doesn’t really work with applications.

[00:23:30] As you fast forward to now, much of the data in S3 gets accessed by applications like analytics and machine learning and data lakes instead of people clicking on a page. Applications have a lot harder time reasoning about eventual consistency. We thought about that and we said, “Okay, we could build it as a feature,” but then we talked to all these customers, some of which I’ve shared some examples, and they said they just want a part of S3. So we went deep into the core of S3. We made changes to hundreds of microservices. [00:24:00] As you know, building complex systems, it’s not just about changing the code. You also have to make sure it’s correct. That the consistency and the strong consistency deliverable is correct.

So we’re very focused about correctness in AWS storage and we actually have mathematicians that sit in our engineering teams to help us validate correctness of protocols, or in this case, the correctness of strong consistency of requests. [00:24:30] These automated reasoning experts build basically mathematical proofs of formal validation to help us make sure that we’re capturing all the edge cases and raise conditions that can show up when you have millions of customers building applications on your service. That’s the depth of how we go in under the hood for your storage and build and innovate and constantly change what storage means because of the applications that are [00:25:00] are using it.

We also do that in terms of features and capabilities. We have a storage class called intelligent tiering. Which by the way, is the default storage class used all the time by data lakes. Customers of data lakes love it because the pricing of an object is automatically adjusted based on the activity of an object within the last month. Late last year, we added a capability where you can configure the storage class to say that if you have a [00:25:30] single object and whether that’s a transaction log or something like that, that hasn’t been touched in 180 days or more, it’ll automatically archive it to deep archive tier, you save 75% cost on the object. That type of innovation, again, comes back from looking at how data lakes work and all those different diverse datasets and the fact that they all have different patterns and then building storage classes for that.

Another [00:26:00] example is Storage Lens, another thing that we launched last year. We have customers who have tens or hundreds of accounts, and they’re working with multiple S3 buckets across over a dozen, or very many AWS regions. Customers managing those environments find it hard to figure out how their storage is growing. How’s their larger organization using their storage and how can you have a visibility lens across all of that? So we built [00:26:30] the first cloud storage analytics solution with support for AWS organizations so you have organization wide visibility into your object storage, point in time metrics and trends lines, as well as actionable recommendations.

Now with Storage Lens, you can look at anomalies in growth, you can get recommendations for cost efficiencies, and every customer gets a default dashboard so they can access their own storage growth analytics. As you might imagine, we built it so [00:27:00] that we put all that raw data into parquet files in your S3 bucket so if you want to do machine learning on that same raw data, you can do that too.

Tomer Shiran:

Yeah, I think that’s very interesting, very powerful as well. In fact, internally at Dremio, we use our own engine on a variety of different data streams that come from AWS. Even the usage of our own product, we analyze it, that gets dumped into S3. So many of these Amazon services now use S3 as that centralized store. In fact, the AWS [00:27:30] data exchange now makes available thousands of these publicly available datasets, ranging from financial services to COVID data. It’s something that our customers are benefiting from as well, in terms of data sharing and the ability to consume public datasets.

I think the kind of innovation that you’re talking about here really resonates with our customers because they’re all looking to simplify how they think about their data and just do more with that data lake. [00:28:00] We’re helping them serve BI use cases now with technologies like Apache Iceberg coming into place. That ability to optimize cost, for example, you said with intelligent tiering is super valuable because there is no such thing as cold and hot storage really. There kind of is, but you can’t really anticipate it because there’s the right to be forgotten and somebody submits a request to delete their information. Maybe it wasn’t touched for the last two years, but now it has to be touched. Having all of that automated, we’ve seen that really [00:28:30] resonate with our customers.

You touched on a few examples. It’s been really fascinating to hear just about the diversity of use cases that you have with the AWS data lakes; ranging from Moderna creating vaccines to how Domino’s is delivering faster pizzas. Actually this afternoon, we have JP Morgan Chase partnering with AWS and Dremio to create their own cloud data lake and powering analytics. So [00:29:00] all sorts of different things here. This has been really, really fascinating and maybe one last thing before we wrap up, since we’re running short on time here, if there was one thing you could leave the audience with, what would that be?

Mai-Lan Tomsen …:

It all starts with putting that first dataset in the cloud. You don’t want to wait to do that. A data lake is built one dataset at a time. The [00:29:30] thing I think is remarkable is, I’ve talked to a lot of customers. I’ve said how exciting it is to be part of their journey. And one takeaway that I have from a lot of these interactions, is that once customers take that first step of putting the first dataset in the cloud, there’s this momentum that occurs. And they put the second and the third and the fourth, and often it’s the first time that [00:30:00] our customers have seen their different data in one place. I’m sure you’ve seen this with your Dremio customers, where once they get the data in one place and they figure out how to unlock those different insights, it is a spark of innovation. It is the beginning of the ideas that will start flowing because it is the foundation for being able to connect the datasets and use them in a business application for the mission of the business.

[00:30:30] My strong recommendation for whether you’re either starting a data lake or you’re expanding a data lake, is get the data first in the cloud. And even if you don’t think you’re going to use it today, you will. And whether you use it in six months, or you use it in two years for a new machine learning model, or as an AI app that you’re building, that data will be there for you at your fingertips for whenever you need it. That [00:31:00] nucleus of data is the future of your decision making, so get the data going now.

Tomer Shiran:

All right. Start with the data. Couldn’t agree more with that. Mai-Lan, thank you so much for joining us. Everybody, wherever you might be, please give Mai-Lan a virtual round of applause.

Mai-Lan Tomsen …:

Thank you.