Extracting Insights through Data Control and an Open Data Lake Architecture

Session Abstract

Join Jurgen Willis (VP, Product Management, Microsoft Azure Storage) and Tomer Shiran (Founder & CPO, Dremio) as they discuss how Microsoft Azure enables organizations to make more impactful data-driven decisions. From Azure Data Lake Storage to Dynamics 365, Jurgen and Tomer will explore how Microsoft empowers customers to powerfully leverage their data to provide meaningful insights while providing the flexibility of an open data architecture.

Video Transcript

Speaker 1:    Ladies and gentlemen, please welcome back to the stage Dremio co-founder and CPO Tomer Shiran.

Tomer Shiran:    All right, well, thank you everyone for joining us this morning or afternoon, wherever you might be in the world. And today I’m super excited to have with me Jurgen Willis, who is the VP of Product Management for Azure Storage at Microsoft. Jurgen thanks for joining.

Jurgen Willis:    Yeah, thank you for having me, Tomer.

Tomer Shiran:    [00:00:30] So maybe we just to kick it off. We’d love to hear more about your role at Microsoft and then also what it’s like being responsible for such a foundational part of Azure.

Jurgen Willis:    Yeah. Yeah. So first off my role is I lead product management for our object and data lake services our Azure NetApp files service, as well as the work we do with our ecosystem of partners on top of all of Azure Storage. And in fact, as you, [00:01:00] as you might anticipate, Azure Storage is the most widely used Azure service. Everybody has durable data needs. And so it’s massively used service both internally and externally. You know, I think the latest stats were something like 500 trillion transactions per month running against our Azure storage platform. You know, if we look at an hour [00:01:30] session here in the conference, maybe, I think that’s on the order of 700 billion transactions. So it’s a very high scale service. Obviously having customers trust and entrust their data with us is sort of responsibility we take very seriously and customers are running their mission, critical apps. You know, we have first-line responder applications running on the platform. [00:02:00] We have FedEx distributing vaccines using analytics capabilities on the platforms. So we take the responsibility pretty seriously.

Tomer Shiran:    That’s incredible, incredible scale and billions of a transactions, I guess, while we’re talking here. So that’s a lot going on and you know, everyone here is a, this is a cloud data lake conference. So everybody here is interested in data lakes. And one of your services is called the ADLS, Azure [00:02:30] Data Lake Storage. And curious if you can share a little bit of the history of kind of where that came.

Jurgen Willis:    Yeah. Yeah. So kind of one of the Genesis of our analytics capabilities in ADLS was really some work that started over 12 years ago as internal work, providing a analytics platform for all of Microsoft to run on, essentially. You know, and that includes services like Bing and Office 365 and X-Box and Dynamics and Azure [00:03:00] itself. A internal analytics service that is used by all of those properties to really kind of run the business. And of course along the way more and more conversations we have with our customers outside of Microsoft and who needed very similar capabilities in terms of limitless scale, in terms of hierarchical, namespace capabilities, authorization mechanisms, and so on. And so really we took kind of all of the run the business, learning [00:03:30] from the internal work together with all the customer conversations we were having in terms of what our external customers needed to kind of bring to life what we wanted to expose externally from an analytic storage platform.

Tomer Shiran:    Interesting. And I know at least from, from our integration Dremio, and we deal with a lot of companies on Azure and some of the world’s largest kind of data lakes that our integration is very similar from an Azure storage [00:04:00] and ADLS standpoint. Right. There’s a lot of commonality there. What is that relationship between ADLS and blog storage?

Jurgen Willis:    Yeah. Yeah. So that is sort of the second part of though our origin story, if you will, is that we also have been building out our Azure blog or object storage service that, we provide as a core part of the Azure Storage platform overall, and building that out, building the scale and the performance and the feature [00:04:30] sets and some new innovation on top of that service and so on. And so the way we brought to markets ADLS and Alix capabilities is by building, in fact, on that object storage platform. And so you can think about ADLS as essentially being a layer on top of, and well integrated with the core blog storage capabilities that provides capabilities like hierarchical, namespace access, and ankles and query [00:05:00] acceleration, data format awareness, and those sorts of things that are very valuable to our analytics customers. But leveraging all of the scale work, all of the data protection and redundancy were all of the programmability of everything that we invested in the blog storage platform.

And, and just to add to that a little bit, just to give a little bit of insight in terms of really kind of, one of the strategy elements we have is that we want to add a lot of customer value by eliminating what historically, particularly [00:05:30] in the cloud, has been data silos that customers created. And that’ll probably be a theme through our discussion here, but we sort of observed that from sort of the cloud storage standpoint, customers were having to make choices in terms of, oh, is this file data, is this object flat namespace data, is this analytics data and make some choices that really sort of siloed that data. And we’re really kind of break down those silos and provide a mechanism where can land their data once. Use that data [00:06:00] over multiple different protocols, whether it’s flat namespace, hierarchal namespace, support over HDFS. Most recently we shipped support for NFS V3 protocol. And interoperably customers can access the data in the way they want to through the data life cycle for all of their workloads and not have these silos, not have data duplication and so on. And so this is really just sort of one representation, one example of that strategy.

Tomer Shiran:    Yeah. It’s interesting [00:06:30] having all these different ways of accessing the data plays into kind of one of the reasons that people choose these open architectures is that ability to use different engines, different services when you have different protocols, of course, I can run one app that uses an NFS Mount to access the data. And another app might be running a big Spark or SQL Korean Jamia or something like that. And, but beyond just the protocols the data silos of course, is something we’re very focused on helping companies eliminate. You’re very [00:07:00] focused I know as well. You know, what are some of the reasons that you see when you talk to obviously a lot of customers? What are some of the reasons that cause them to end up in this world of silos?

Jurgen Willis:    Yeah. Yeah. There’s a number of factors. You know, I think other contributing factors are, let’s say segmentation of the data based on data type, whether the data is unstructured or semi-structured or structured data has, as you know, has been another cause of that data being, being siloed. And [00:07:30] so that’s another thing that we’re working on in terms of really being able to kind of capture all it within ADLS data that is unstructured really sort of used metadata and schema data awareness capabilities were both semi-structured and structured data can also live in a common data lake where datasets can really sort of rendezvous into for example, single ADLS accounts that can represent different kinds of data types. And that isn’t a form of data silos. [00:08:00] Probably another key factor that we’ve seen is a lot of times ISV applications will write data in proprietary formats, maybe to their internal sort of data stores, but even if they are kind of writing to an external store in proprietary formats, and that automatically kind of creates a another sort of silo. And really the customer can only use that data in ways directly exposed by the ISVs application. And I would say that’s another sort of form of how silos form.

Tomer Shiran:    Interesting. You know, I was, I was actually [00:08:30] on the drive into the office today for the session I was, and we are back in the office now, at least somewhat, I was thinking about how storage has changed so much over the last, over the last decade. You know, 10 years ago, storage was really tied to one application, right? You had some kind of filer in your data center. It was tied to some applications in that area. But we didn’t have this concept of global stored services Azure storage being in dozens of different regions [00:09:00] around the world can be access with one API called regardless of what region it’s in. It just enables a new kind of architecture really that really wasn’t possible with stories from a decade ago.

Jurgen Willis:    Yeah. No, no, that’s right. And I think this sort of growing notion of I’ll say enterprise data lakes and, and customers bringing all of their data sets together, being able to do the interesting correlations now is allowing for much deeper insights that they can get to really kind of drive [00:09:30] their businesses, reduces some of the data duplication, reduces data management challenges, reduces all the challenges associated with sort of what is the data of record, if you will, by virtue of maybe replications and imports and exports and so on. And so I think this notion of really kind of bringing the data together as both reducing some of those data management challenges, but also really driving a lot of business business value to our customers.

Tomer Shiran:    [00:10:00] And I know also your team and other teams at Microsoft, including some of the application teams are really looking at kind of that data lake storage as a key part of their strategy around kind of where data is located and and how to kind of break down these silos right now. I know you’ve prepared an interesting demo here to show kind of what that actually looks like.

Jurgen Willis:    Yeah. Yeah. I think we have a demo here where we’ll see multiple different source [00:10:30] systems bringing in and accessing data against a single ADLS accounts as sort of an example of breaking down these silos and enabling new kinds of insights. So I’d like to invite Jason Seuss in from our product management team to walk us through the demo.

Jason Seuss:    Thank you, Jurgen.

Tomer Shiran:    Did we lose Jason? Is Jason there?

Jason Seuss:    Can you guys not hear me?

Jurgen Willis:    [00:11:00] I can hear you, Jason.

Jason Seuss:    Oh, okay. All right. Well, let me start again. So in the sample scenario, we’re going to show in the demo today, we’ll be working with some data from a fictitional coffee company, and they have two objectives they want to accomplish with their data. Number one, they have a bunch of disjointed customer data from many different sources, which they want to prep into one cohesive data set. Then using the curated [00:11:30] data, they want to extract insights they can use to better understand those customers and feed direct marketing campaigns. Things like customer segmentation and predicted churn models. They also have some hypotheses about how weather data impacts their sales. So once they have their customer data in a good state, they want to connect to some weather data so they can explore those relationships. So to get started, our coffee company has [00:12:00] created an ADLS account and has loaded some customer data into it, which we can see here within this container.

So they have a number of things like their customer contacts, which probably came from a CRM system. They have some point of sales data that’s coming from their brick and mortar stores. They have some online orders, survey results, customer service calls, et cetera. [00:12:30] Lots of great data that can fuel the insights they need. But right now it is in many different islands of data. There’s quite a bit of data engineering work that needs to be done to first scrub, enrich and join it together. Then data science work and data analysis work that needs to be done to extract the insights that they need to drive their business.

Those tasks could be accomplished in a bunch of different ways. For instance, [00:13:00] they could build some spark jobs to do that data preparation work. However, our coffee company has an extremely small IT group, so that’s really not possible for them. Instead, they’re going to use a managed SAS application from Microsoft’s Dynamic suite, called customer insights or just CI for short CI is highly optimized for customer data and has many low or no code capabilities to help [00:13:30] them prepare and enrich that specific data set. It also provides a number of out of the box, AI models and reports for extracting insights. So in other words, CI is the analytics engine our coffee company is going to use specifically on the customer data portion of their leg.

Tomer Shiran:    And so I have a question about this before, before you keep going here for a second, is this creating another kind of silo by [00:14:00] creating this, this data in the data lake?

Jurgen Willis:    Yeah, I think we’ll see here Tomer, that in fact it’s not. And in fact that the Dynamics customer insights is going to be able to directly use, consume and write back to this account. So it isn’t again, sort of an import export, a copy of data with costs associated with that and data management installed. So on it is really this sort of first step in taking a [00:14:30] much more open data architecture approach where customer insights can really be directly using and contributing back to the dataset.

Tomer Shiran:    Okay. That’s interesting. Keep going, Jason. Sorry.

Jason Seuss:    All right. So to get started with CI, the first thing that we need to do again, because we’re working with that data that our coffee company has already loaded into their data lake, as we essentially need to point it towards that data. [00:15:00] So here you can see that customer support CSV file that we were just taking a look at in the data lake and the work that we’ve done to basically attach to that CSV file and teach customer insights about that data. The next thing we need to do is teach it about some of the relationships between the various data sets. [00:15:30] So here, we’ll take a look at the relationship between online orders and our customer contacts. That’s a, obviously a one to many relationship that we’ll teach customer insights about.

So now that we’ve pointed CI towards all of the coffee company’s data and their ADLS account, it can do its job. So what it’s going to do is it’s going to clean that data. It’s going to enrich it with some other sources. If we, [00:16:00] pointed it at any additional source systems. And then lastly, it’s going to join it all together into that cohesive data set that we need in order to extract insights from it. So since that can take several minutes to complete, we’ve done a bit of a Julia Child here so that we can show you some of the output that work, which we’ll pop over to now. So using the curated data set, CI provides many out of the box [00:16:30] customer insights. For example, it has built in AI models for making predictions like customer churn, which we can see here.

So it’s showing us the customers that are likely to churn as well as some of the attributes based on the modeling that it’s done are highly indicative of customers that are likely to churn. Things like the number of products that they purchased, the frequency of their transactions and the overall value of those [00:17:00] transactions. CI also has some AI models built in for making predictions about segments. So for instance, if I go over here and take a look at these segments that it’s produced produces some pretty valuable insights. So our coffee company will definitely want to follow up with some direct marketing or some direct outreach to these 20 high value [00:17:30] customers that it predicts are also highly likely to turn away from us.

So there are a lot of great capabilities in CI for helping our coffee company gain key insights about their customers. But the key point for our discussion is that all of the work that it does to prepare that data to extract insights are all written back into the customer’s ADLS account in open formats.

So if we move back over [00:18:00] into the storage Explorer and we go to the customer insights folder, or excuse me, container, we can take a look at all of the data that is outputted back into the ADLS account. So we can see the conflated data set that it created through all the work that it did on joining the data set. We can see some of the measures, the models that it created. We can see the segments that it created as well, including the one that [00:18:30] we were just taking a look at, high churn customers. And so the important part here is that obviously it’s written all this data back into the data lake, but very crucially it’s written it all back into the data lake and open formats like CSV and Parquet as well.

Tomer Shiran:    I think it’s worth pausing here for a second, because I think what we’re seeing now is maybe representing kind of the future, right? Like something that people aren’t necessarily used to seeing, [00:19:00] which is their SAS applications working like this, outputting the files directly, reading files directly. So maybe you can highlight why this is so significant.

Jurgen Willis:    Yeah, no, that’s exactly right, Tomer. You’re seeing here that the data isn’t sort of locked inside the customer insights application. This is a general purpose storage tool that we’re using here to look at the customer’s data account, their ADLS account. [00:19:30] This data is now available to them to use in the ways that they would like, to do more interesting things beyond just what they can do within the customer insights application.

Tomer Shiran:    That’s cool. So you’re basically taking advantage of open file formats, open table formats, having this global accessible, infinitely, scalable storage as the kind of the native store for this application, which then allows you to do other things with that same data .

Jurgen Willis:    [00:20:00] And leverage all of those sort of best of breed tools available to them. Leverage great tools like Dremio to take advantage of the data also.

Tomer Shiran:    Cool, cool.

Jason Seuss:    So I mentioned at the beginning that our coffee company has some hypotheses about the relationship between the weather and their sales. So now that CI has given them a much better understanding of their customers, including many [00:20:30] of their customers attributes and all that curated data is now in their data lake, they can join it to other data to explore those relationships. Our coffee company’s data engineering team has loaded weather data from NOAA, which we can now see in this container here, and to make the join between the customer data and the weather data, our coffee company is indeed going to use Dremio. So what we’ve done here is we’ve [00:21:00] attached Dremio to the same ADLS account we’ve been using for customer insights and within Dremio semantic layer, we built up some tables that we’ll now use to make that join. So if I go in here now, I can see all the weather data from that CSV file that we were just looking at within the data lake.

So here we have information about weather from all the major cities in the United States, the dates, [00:21:30] temperature on that date, precipitation, et cetera. And we also have the data from customer insights. So in particular, we produced a table that has a row for each transaction, which customer made that transaction, some key attributes about them, their age, their gender, all that kind of information. So what we’ll do now is we’ll go ahead and join this [00:22:00] data from customer insights to that weather data. So I’m going to make a custom join here we’ll grab that weather data that we were just taking a look at. And what I’m going to do is I’m going to join based on the date the purchase was made, and I’m going to join based on the city in which the purchase was made as well. So if I go ahead and apply that join. Again, we can see all of [00:22:30] the transactions and the attributes of the customers that we’re making the transactions.

But now we have information upended to each one of these rows about the weather that was going on outside when that transaction was made. And so by using Dremio here to make this join our coffee company, we’ll be able to get the kind of interactive query performance that they need in order to [00:23:00] build a dashboard where their data analyst can really explore this data and look for those relationships and glean some valuable insights. So what we’ve done here is we’ve prepared a power BI dashboard based on the data that we saw in that join. Yeah. And there’s some interesting things that pop out based on sales of our cold drinks, any kind of iced coffees and things like that. You [00:23:30] would typically think that folks are drinking hot coffee when it’s cold and they’re drinking cold coffee when it’s hot, but actually the data doesn’t really show that. It shows that our cold drink sales are pretty consistent up until the point where it gets pretty hot up around a hundred degrees outside.

This is one interesting insight between males and females. It stays pretty consistent. But there’s one interesting data point that really pops out here. And [00:24:00] that’s that folks of the very fine vintage of 50 to 59 are overwhelmingly coming in for those cold drinks when it gets hot outside. I guess those of us in that age category are particularly looking for refreshments, that’ll cool us down when it gets hot outside. And then if we go down and look at the products that they’re purchasing with different types of precipitation, you can look at it and see that interestingly enough. And I guess we could probably predict this, that [00:24:30] when it’s snowing outside, people are buying a lot more hot chocolate than normal, but probably not as obvious as the fact that they’re pairing blueberry bagels with that hot chocolate. So probably when we see snow in the forecast, we can start doing more as far as the signage in our store, or what have you, to point people towards those preferred products when the snow’s flying. So that’s it for me Jurgen, I’ll turn it back over to you.

Jurgen Willis:    [00:25:00] Great. Thank you. Thank you, Jason, for walking us through that. So I think in sort of short order, we saw a hopefully interesting, good example of this open data architecture approach. Allowing customers to rendezvous their different data sets to do these interesting correlations, get more insights, leverage multiple tools, leverage customer insights for part of what they were doing, leverage Dremio for additional sort of deeper [00:25:30] insights across the data set, not having these data silos, just really sort of lights up a lot more capability for our customers and with all the work that we’re doing in ADLS from a scale standpoint, from a data management standpoint, being able to do these kinds of things at scale. Scale from a data size standpoint, scale from a business standpoint, it just opens up many new opportunities for our customers.

Tomer Shiran:    That was fascinating. [00:26:00] I think the other thing we talked about yesterday morning is really how data lakes are becoming easier and easier. And one of the things that’s making it easier as the transition from just engines to really services, right. You know, different SAS applications or other kind of processing engines. And I think just looking at this demo here in 10 minutes, we’ve seen all these different things being done on a data lake at scale and Power BI dashboards, insights, [00:26:30] ingesting data, cleaning it up, correlations. I mean that’s pretty cool. I think that really shows kind of where this is going in and what’s going to be possible for companies out there in all industries, all sizes. Yeah. It’s mind boggling. What’s what’s now possible with the cloud data links.

Jurgen Willis:    Yeah, absolutely. And I think there are going to be a real flywheel here as, as we we have these open architectures, customers are seeing the value of that and kind of [00:27:00] building the ecosystem, if you will, in terms of sort of like minded applications and engines that again can just open up all kinds of power and opportunity for customers.

Tomer Shiran:    This is great stuff. I want to thank you, Jurgen, and you, Jason, for joining me today and also for the great partnership with Microsoft that we have working together. Before we wrap up any [00:27:30] pointers for folks that are joining us today, or listening to the recording where they can learn more.

Jurgen Willis:    Yeah, no. First thanks Tomer for having us. Appreciate the partnership that we have. Appreciate everyone in the audiences as well, attending the session. In terms of pointers, so we do have a virtual Azure booth. Encourage people to stop by. We’ll have members of the engineering team there available, answer any questions you have, as well as access to a free Azure [00:28:00] trials to get started on this.

Tomer Shiran:    All right. Well, thank you. Thank you again. And please join me, everyone in the audience for a virtual round of applause for Jurgen and Jason. Thank you very much.

Jurgen Willis:    Thanks everyone.