Dremio Jekyll

Creating a Cloud Data Lake for a $1 Trillion Organization

Dremio

Transcript

Robert Maybin:

All right. Hello everyone and welcome. We've got a really exciting Webinar for you here today. We're going to be talking about creating a cloud data lake for a trillion dollar organization and if I can start off with some quick introductions. I'm Robert Maybin with Dremio and I'm the Director of Professional Services here at Dremio. I'm responsible for a team of field engineers focused on getting our customers successful with Dremio, quickly. With us today we have a couple of really interesting and exciting speakers and presenters.

Creating a Data Lake for a Trillion Dollar Organization

Robert Maybin:

We've got Donghwa. Donghwa Kim. He is the Director of Application Engineering at NewWave. NewWave is one of the premier technology partners for the Centers for Medicare and Medicaid Services, known as CMS. He's also currently the lead technical architect implementing next generation enterprise data management platforms using cutting edge technologies, which include Dremio, Azure, Databricks, Looker and Snowflake. He's got over 19 years of IT experience with IBM, Lehman Brothers, JP Morgan Chase and FINRA. He also oversees NewWave's data science and data engineering practices. He's going to be taking us through a use case and implementation of using Dremio as a data lake engine for a customer.

Robert Maybin:

We also have ... Joining us, we have Jeff King and he's a Senior Program Manager at Microsoft on the Azure Storage team and owns the big data and analytics partner ecosystem for Azure Storage. He spent the last nine months or so onboarding over 20 ISPs on ADLS Gen2, and he's currently working on a bespoke program to bring the best tooling cloud platform and SI expertise together to create a winning data lake migration experience for those customers who are poised to embark on their hybrid or cloud, and, or cloud journey. So very excited to have both Donghwa Kim and Jeff with us today.

Robert Maybin:

Let me begin. First let's talk about how you can ask questions. You'll notice at the bottom of your Zoom interface screen there, there are three little icons. One is the chat icon, one is raise your hand, and the third one is Q&A. Just to keep things orderly and flowing well, we'd prefer that you ask your questions in the Q&A area. Let's leave those other two options untouched for today. That'll let us keep a smooth flow when we take in your questions and when we go to respond to them.

Robert Maybin:

One other point to note is that if we don't get to your question today, then we'll do follow up. We'll actually follow up with an answer. We'll have someone reach out to you with answers to all your questions. So feel free to ask and if we miss your question, no problem, we'll get back to you with an answer. So just rest assured that even if we run out of time we'll get back to you.

Robert Maybin:

Let me begin just and quickly introduce Dremio as a company, tell you a little bit about us. We were founded back in 2015, little over four years ago now. We were in stealth for a couple of years and we came out of stealth and released Dremio 1.0 in 2017. Dremio is really focused on being the data lake engine and we'll talk a bit more about what that means and I think when you see the presentations from Donghwa and from Jeff, you'll really understand more of what we mean by that. I'll introduce it briefly and we'll save the need of that for later in the presentation.

Robert Maybin:

Company's headquartered in Santa Clara, California, and we've got companies that are using Dremio that really cross a broad spectrum of industries, verticals, different sizes. We've got some really, really big customers, some that are in the Fortune 100, that are using Dremio for their data lake initiatives. A few notable ones are NCR, in Atlanta. We've got New York Life. We've got Microsoft, TransUnion, and then we also have some great customers in [AMEA 00:25:55] as well. We've got DB Cargo in Germany. We've got UBS. So we're really crossing a lot of geographies and a lot of verticals here with the product.

Robert Maybin:

So, really the problem that Dremio is focused on solving is how do we make data that's in a data lake queryable directly by analysts and data scientists? How do we get the data out of the data lake, ideally directly? So we have a lot of our customers and I think generally there are a lot of data lake initiatives, where people are bringing up data lakes in cloud storage, such as Azure ADLS. They may have data lakes that they have built out in Hadoop. They may have both. The reality is, is that this data is really difficult to query. Typically you have to go through some machinations to get the data out of the lake. So, there are solutions to do this. If you try to query directly on something like Hadoop, you would use Hive or some sort of a solution like that.

Creating a Data Lake for a Trillion Dollar Organization

Robert Maybin:

I think if you think about typically all of the components that have to be put between data that's sitting out in, say a cloud store, and actual end users of that data. They might be using Tableau, they might be using Power BI, or maybe they're on the data science side they're using algebra notebooks or even just writing SQL queries. Typically, there's a lot of movement that has to happen and a lot of components that introduce a lot of complexity into the picture when you think about having to load data, into either a data model or a warehouse, right? So data is in the lake and maybe a CSV or a JSON or Parquet format, and so typically there's a lot of components in the middle where we've got to go load and transform the data into the warehouse.

Robert Maybin:

Maybe we have to, also for further acceleration, we have to create cubes or something on top of that. We may have to create extracts in something like Tableau or Power BI to make that workload actually be interactive from a dashboarding or analysis perspective. and so what we do here at Dremio, is as the data lake engine, we give you the ability to basically sit directly on top of data lake storage and we give you lightning fast queries directly against your data lake. So if you have files on ADLS, if you have files in some other cloud or back end data lake like a Hadoop cluster, Dremio sits right on top and we give you lightning fast queries on that data lake.

Robert Maybin:

We give you the ability to build a self service semantic layer that's really not so much like an ETL tool or something that you'd be used to with, building cubes or something. We allow you to go directly at that data and then through a process of building up what we call virtual datasets that can refer to one another. You can create semantic meaning from your data and then publish that to your end users very, very easily. There's no need to load the data. So, we can basically go after the data directly where it lives, and you don't have to actually make copies of the data and load it and go through all of that pain to get it into the tools that you would prefer to consume it in.

Robert Maybin:

One of the important things as well to note is that we use open file formats and so there's really no vendor lock in. You're not loading your data into Dremio. Dremio is actually using the data where it lives and that can be in open formats like Parquet, like even CSV. So there is no vendor lock in here and when your data is already in your data lake or your cloud store, you can go after it with other tools. So you could also run Spark Jobs against the data or any of these other tools that you might be using against your cloud store.

Robert Maybin:

The last piece that's really important, is that in addition to being able to query your data lake directly, we allow you to join other relational sources and non-relational sources into the queries and the semantic layer that you're building inside Dremio. So if you do have legacy relational data warehouses, if you've got other relational stores, if you have no SQL stores, Dremio has connectors that can connect to those data sources and actually allow you to write queries that span both your lake and those traditional relational sources very seamlessly and transparently to the user.

Robert Maybin:

I think that with that quick overview what I'd like to do is hand it over to Jeff King, who will take you through some of the features and the benefits of the Azure platform.

Jeff King:

Hey. Thanks Robert. Good morning and good evening everyone. Yeah, really, really happy to be on this webinar and really excited about working with Dremio and NewWave. I've been working with both of them and along, some very interesting scenarios and customer journeys and really excited of what we're going to be doing ahead in the next 12 to 18 months.

Jeff King:

Let me just dive into the concept of a modern data state. Let's start thinking about this a little bit more. It's almost like a modernizing data state in the sense that it's a work in progress. The industry is at this precipice where cloud has become mainstream or specifically the cloud native capabilities or big data and analytics have become more mainstream, and given some of the industry signals with some of the big names and consolidation of various Hadoop based platforms, that is also becoming a catalyst for what we're seeing as to be a pretty sizeable wave even at the near horizon of enterprises looking at cloud in the next six to 12 months and they're saying, "Hey, Microsoft, how do you help me get into the cloud and really take advantage of everything that you have to offer?"

Jeff King:

So we have that conversation and with each customer and their journey is their journey. It's similar. There are some similarities but it's also unique. But nevertheless, this notion of a modern data state is where you're going to have existing on-prem and perhaps even traditional resources either integrating directly with a cloud, a modernized cloud native capability to that cloud native service now existing to either give the business competitive advantage or it could be a cost reduction. There are many reasons why someone would pick cloud, but the architecture and the mindset is, okay, we're going to have multiple data sources performing crosscutting capabilities, serving similar and disparate business organizations, and we just need to make sure we wrangle and make sense of it all.

Jeff King:

We have a couple of ... Let's actually talk about a data lake actually. You go to the next slide please. We're going to dive into this a little bit more. This is actually why it's really exciting for Dremio. Can we go to the next slide please? What makes it great data? Real quick. It's scalable. It's cheap, it's performance in the sense that it's purpose built for analytics workloads. It provides enterprise level security that has granularity all the way down to the file or folder level. And yes, I am using ... I'm intentionally saying file and older and not say object like you would otherwise think of like a Blob store. Those are object storage. You'll see that ADLS Gen2 is not really object storage, but a great data lake is supposed to understand files, file system semantics, posits, ackles and so on.

Jeff King:

And then cutting across the data. Any successful data lake implementation in the organization, like the whole notion of the data a bunch of data. Some of its unstructured and raw and some of it's processed and ready to be ingested into your data warehouse of choice. Nevertheless, you're going to need to have some visibility of the life cycle of that data for various reasons, most of them operational or security in nature and you're going to want to have some of taxonomy and just be able to organize and make that data accessible. And so you'll have some data governance strategy and various tooling to support that strategy cross cutting these capabilities.

Jeff King:

All right, next slide. Let's talk about ADLS Gen2. ADLS Gen2. Yeah, it's scales like the Dickens. There is no amount of data that we will not support, I really will take personal bets on that. It is secure because it is backed up by Azure Active Directory and as we think about that modern data state, the vast majority of customers today use active directory as their local LDAP, their on-prem LDAP solution. So guess what? Azure Active Directory has a synchronization capability that we'll call the AAD Connect that allows you to synchronize and keep all of your identities and that whole security layer intact. You need that. You're going to need that as you enter into hybrid land.

Jeff King:

So yeah, it's fast. It supports atomic transactions and asset transactions up-to-date. It is strongly consistent unlike some other object storage systems out there that require extra tooling and therefore cost and port per taxes as a company with that, in order to get strong consistency and those who are running Hadoop made of data lakes on-prem understand that you need strong consistency, you need asset transactions. And guess what? ADLS Gen2 supports that out the box.

Jeff King:

We really do want data lakes, the whole notion of a data lake to truly take root and flourish within the Azure cloud ecosystem and customer base. What do I mean by that? Democratize data access. You do not require to. We will not force you to have a specific engine or environment in order to just to access your data. You should be able to access your data any way you see fit. We allow that by having different protocols and different API to support that.

Jeff King:

It is cheap. It's cost effective, it's the same store, same cost to as Blobs to store that data, and just recovers a little bit of extra cons on our side. There is like a small uptake in just the transaction costs, but if anyone's seen the Azure transaction cost pricing, you can get a lot for very, very little.

Jeff King:

All right, go ahead, next one. Just going to skim through some of this other stuff here. The whole notion that ... Let's talk about the architecture real quick. The architecture. When you think about, how do I get data into and out of ADLS, you've got two API's to choose from. You've got the native, ADLS native API, which we also sometimes called the TFS API. That's going to understand all of these file system semantics. That is a part of Gen2, but there are some parts in your whole workload that don't need file system semantics.

Jeff King:

I'm thinking a lot about the ingestion point. If you think about all of the different data sources and all that, wherever they're coming from, some structured, some unstructured, IOT, whatever, they just need a place to land that data. So, yes, you don't need the ADLS API necessarily for that, you can just continue to roll in along with the Blob API, which the vast majority of everyone already integrates with. So sitting on ... Gen2 sits on top of Blob storage and from a service capability, which is great because now you get all of the rich features and that rich feature set of Blob storage that has been built over a decade. You get that for free with Gen2, which is great. You got various replication strategies, whether it's local zone or geo-redundant, you get that for free now in Gen2, all of those blog, there's a whole box of blog features. I encourage you to check them out. You get that with Gen2.

Jeff King:

Now, sitting on top of that is the hierarchal file system, and long story short, this is what gives you the file system semantics. This is what gives you file and folder capability. This is what gives you asset transactions even when you're doing something as benign as a rename.

Jeff King:

Next slide. We're just summering about Gen2 put all your data in one place, give you multiple places, multiple ways to access it in multiple environments, both inside Hadoop and outside Hadoop. We have a very rich partner ecosystem, first and third party alike and that sort of commoditized and clown economic model. You pay for what you use, pure opex, no capex. We should be pretty familiar with that.

Jeff King:

Real quick, just to round this out with the ... Go to the next slide please. Round this out with the ecosystem. Next slide please. Thank you. All right. Here are a lot, but not all of the partners that I've been working with in the past, say eight to 10 months to onboard. As you can imagine there are actually a bunch more Azure services, but I just don't have enough space. I really actually wanted to focus on the third party and ISV solutions here. We have a bunch of partners across the board and you see Dremio over there in the bottom right.

Creating a Data Lake for a Trillion Dollar Organization

Jeff King:

Just a little insight for those. We were ... It's interesting how would I bucketize Dremio and sometimes I call them a data virtualizer, but to me that seems so 2008 because they're a heck of a lot more, so right now I'm calling them a data like wrangler. They do a hell of a lot of things as you'll hear about in the rest of this webinar. They provide querying and so on and governance. But Gen2, we've got a vast ecosystem and I'm continuing to onboard more and more partners. So by all means if you've got any questions about any of these partners or anyone else that you don't have, you may have questions on, by all means just give me a ring. Find me on LinkedIn. I think that's it. Thanks.

Robert Maybin:

All right, Jeff, thank you so much. What we'd like to do now is go ahead and bring on Donghwa Kim, and have him walk us through an actual use case that he's worked on with Dremio. So with that, Donghwa.

Donghwa Kim:

Yeah. Thanks Rob. Thanks everyone for joining this webinar. Today I'm going to be talking a little bit about how we are leveraging Dremio and Microsoft Azure services including, ADLS Gen2, which we're really excited to provide a service for one of our government health customer. As you can see on the screen, these are some of the customers that we're serving, CMS, HHS, NOAA and FEMA and CDC. As you can see, most of our customers are very centric. Next slide, please.

Donghwa Kim:

Just to talk a little bit about NewWave, we are a mid size of business. We have about 400 employees. We do 11 prime contracts in CMS supporting their seven centers. I'm really proud to say that we are CMI level 4 for both our services and development. So, that's ... I know that there are only few other companies out there who have achieved that kind of status. You can see some of the contract vehicles in the screen. Next side please.

Donghwa Kim:

These are some of our technology partners. We are working closely with, as you can imagine, Dremio and Microsoft. They have been really supportive of our initiative for the customer that I'm going to be talking about today. And then we leverage more innovative tools such as Databricks, Looker and Snowflake as well. Next slide please.

Donghwa Kim:

Our customer is very unique in the way that they oversee many other health care programs, including those that involve health information technology. They are the ones who govern the space. There are many different programs within the agency providing health care for a spectrum of US population. Next slide please. We will go over more about our customers.

Donghwa Kim:

Some of the challenges that our customer is facing, they have a lot of data. They store, they analyze and they disseminate a large amount of data and they need to integrate from different type of data sources, including administrative, transactional and medical records of the beneficiaries. They host information for over about 100 million beneficiaries or 100 million population, US population. They collect over 2 billion data points per year in just one of their programs. So as you can imagine, they have very large data they need to manage.

Donghwa Kim:

Since they are in healthcare space, security is a big must, it's not an option, but it's a must. They contain PHI and PII information and the challenges that they're having is that providing secure mechanism of storing patient's information, but at the same time they must not lose data so that they can provide better services to their constitutes or citizens by leveraging information that they have locked in their data. So that's a constant challenge that they're facing, and because they have different centers and they have different organizations, having a centralized view of data that is available in multiple systems, is a constant challenge for them. We will cover how we can leverage a tool such as Dremio and data lake after we'll condense our problem. Next slide please.

Donghwa Kim:

Our objective for serving our customer is to provide simple and reliable technology. So we need to ... I think we covered some of this, center-wide shared data services and then providing robust data governance that provides both data agility and security at the same time. And then leveraging our cloud-native architecture so that they have a scalability and agility that cloud services can bring. Next slide please.

Donghwa Kim:

I just want to put this slide up here just to reflect on what system integrators must do. We're not here to just make a system for just leveraging innovative ideas, but the solution at the end of the day must be usable by our customers. It must be simple, yet it must meet the requirements that our customers are facing. Next slide please.

Creating a Data Lake for a Trillion Dollar Organization

Donghwa Kim:

This is like the bird's eye view of the solution that we're providing to our customer. As you can see on the right hand side, we have a data as a service layer and Dremio is at the center of that. Dremio connects to different data sources including, Azure Data Lake Gen2, no SQL, database is just a MongoDB relational database, SQL server, Redshift, Oracle, you name it. Then what it does is that it provides a virtual dataset, which I'll get into soon. On top of Dremio we run data analytics visualization and we do heavy data modeling utilizing Looker, which is a very unique tool, that's been very instrumental to us. Once we have done the data exploration, utilizing data as a service layer, which is there to provide the data generally so that a business analyst and then data analyst, they can come together, they can explore data from different data sources and then come up with a good business value and business proposition that they can use to serve the beneficiaries.

Donghwa Kim:

Once that's done, we productionize it and then create a BI analytics layer utilizing Snowflake and again, ADLS Gen2 and we're running Databricks and Looker as our analytical layer on top of it. Next slide please.

Donghwa Kim:

This is just the same graphic that we saw before. I don't think I need to cover this in detail. Next slide please. Dremio has very unique concept called virtual dataset and what it means is that underlying data sources, they are immutable physical datasets, and based on those physical dataset, you can layer different, a stack of data transformations to come up with a virtual dataset. Each virtual dataset is ultimately described utilizing a SQL query, SQL language, which is very popular language and changing of datasets, both virtual datasets are possible so they can build on different virtual datasets on top of other virtual datasets and they provide a data lineage so that you can see where the data comes from, what kind of transformation has taken place.

Donghwa Kim:

In addition, Dremio provides a data catalog and curation capability so that it provides a centralized view of what kind of data is available and then providing the collaborative toolset so that different members from different organizations, different technical background can collaborate with each other on those datasets. Next slide please.

Donghwa Kim:

This is an example of creating a virtual dataset. As you can see, it's pure SQL and if you look at the curry, what I'm doing is I'm joining data that is living in ADLS Gen2 or data lake, as a CSV file. I'm joining that data, patient information data, with allergies data that's in JSON Format. So, there is no ETL that's necessary to have that data available, only thing you need to do is just drop the dataset, data files into a data lake and then you make connection into data lake, utilizing Dremio and then you start fully utilizing SQL, which is very powerful concept. Next slide please.

Donghwa Kim:

Since we're using SQL, everything is simple. Like I mentioned, you can join data from multiple data sources, including JSON, CSV, Parquet and relational database and no SQL. You can aggregate data from different data sources and then create virtual datasets on top of that. Next slide please.

Donghwa Kim:

Yeah. I think I touched upon this, there is no need for me to touch this again. Skip to next slide please. What we're excited about in Dremio is, it provides security. By utilizing SQL and some functions that Dremio provide, what we can do is, since we're working with health care data, security is number one concern. While we are providing security, we don't want to minimize the agility of providing data agility. We want our leverage tools like Dremio to create a virtual dataset, pulling data from different data sources so that we can run other tools on top of Dremio such as Apache Spark, Databricks. We can even do exploration using SAS and other type of analytical tool using JDBC.

Donghwa Kim:

When we do that, we don't want to necessarily get sensitive information out to data scientist or a business analyst who may be looking at the data. We can do some data masking using Dremio's virtual dataset technology so that we're only giving information that's necessary to carry out those analytical workload. So, these are two functions that Dremio provides for a user and his member are functions that we have leveraged. These functions should not be unfamiliar. Transact SQL from SQL server and then MySQL also provides similar capabilities. Can you move onto the next slide?

Donghwa Kim:

I know it's hard to see, but this is type of transformation that we can do. For example, if the person who is working with this, I mean who's logged in has a user ID of super duper user, and if he's in a privileged user group, then the person can see social security number. By the way, this is synthetic data, so these are not real datasets. And then, if a person does not belong to privileged user or does not have the user ID of super duper user, then what they see is they see a mass data, as you can see in the screen. Next slide please.

Donghwa Kim:

This is just a further a view of how to create virtual datasets. So, once we have come up with the SQL query that we can use to build a virtual dataset and we can create a virtual dataset using the SQL query and that virtual dataset is given to other members of the organization such as a business analyst and data scientist. As you can imagine, they don't have access to secure and sensitive information. Those are masked when they query the virtual dataset. I'm going to get into this further when I do the live demo in few minutes. Next slide please.

Donghwa Kim:

As I mentioned before, Dremio provides data creation and data catalog capability, which is very useful to provide that 360 view of the data that an organization has. It provides tools such as Wiki and tagging capabilities so that different people from different organizations can collaborate with each other on that particular dataset. Next slide please.

Donghwa Kim:

Data lineage. This screen is pretty interesting because as you can see, the nothing side is the source of the data, that's the physical dataset. That's out, in, Azure data lake. From that what we're doing is we're converting the physical dataset into a virtual dataset, and we're doing transformation on top of the virtual datasets from two different files. One was a CSV and the other is JSON. What you see in the middle in that blue box area is the virtual dataset that we have created. It doesn't stop there. You can also do for ... You can also create other virtual dataset from that central dataset, by doing some transformation using SQL.

Creating a Data Lake for a Trillion Dollar Organization

Donghwa Kim:

As you can see, this prevents a scenario where there is a data perforation. If you need ... Some of the organization what they do is if they want to give up nonsensitive information for analytical data scientist or other analytic work at the place, they need to do a lot of copying of the data, which creates a scenario where it's hard to manage and then you lose the data governance and then different copies are passed around and you don't even know the quality of the data at that point. So it quickly becomes a data swamp, but since we are keeping the fiscal dataset immutable and then just doing transmission on top of the physical dataset by creating virtual datasets, the data is very much governable and then you can add a different type of security sort of transformations on top of those to provide the datasets that are necessary for different analytical work. So managing the data becomes very easy using such tool. Next slide please.

Donghwa Kim:

As I told you before, you can run different analytical tools such as Apache Spark on top of it. You can build machine learning models and do further data engineering on top of it. So this is just showing you how it can be done, just adding JDBC driver into environment and then being able to have connection to the dataset from Apache Spark is truly valuable. Next slide deck.

Donghwa Kim:

Just a little more of configuration there. So, I think that's the end of my presentation. Rob, do we have a time for demo?

Robert Maybin:

Absolutely. We do. Yes. Let's go ahead and we'll stop sharing and let you drive.

Donghwa Kim:

Sure. Okay. Can you see my screen?

Robert Maybin:

We can.

Donghwa Kim:

Okay, great. Thank you. So, I'm logged into Dremio as two different users. I'm just refreshing the screen just to make sure that my session hasn't timed out. On the left hand side I log in as a privileged user, which is me, Donghwa Kim. Then on the right hand side, I'm logging as non privileged user. As you can see, I have connectivity into this data source, which is ADS Gen2. So if I can go to my Azure Blob storage explorer, this is what I'm connected to. As you can see, I have uploaded different synthetic datasets that are healthcare related and those are reflected on the screen here.

Creating a Data Lake for a Trillion Dollar Organization

Donghwa Kim:

A non-privileged user does not have same access. What I'm going to do is I'm going to create a virtual dataset so that non-preferred user can have access to only data sources they should be able to see, and I can do that by creating a virtual dataset. Let me just copy and paste the query. So I do your query and then I'm just going to copy and paste and I think we covered this earlier within a slide deck, so I'm using user, if it's Donghwa Kim or if the user is in privileged user group, then I'm going to be displaying social security number, otherwise I'm going to be masking it. Same thing with, the first name and then the last name. So I'm just going to run this query.

Donghwa Kim:

Your dataset is created. So when I come back here, there is my new virtual dataset. I click on it. Since I am logged in as a privilege user, I'm able to see the social security number, and then first name and so on. By the way, this is synthetic data, so there is no sensitive information here. What I'm going to do is I'm going to look at the graph, to see the data lineage. So as you can see, the source of the data comes from ADLS Gen2, and then I created a dataset out of it and then I have a private transformation so that now I have a virtual dataset that includes some of the fields from the original data source, which is a CSV file, and then I have a selected a number of rules from that.

Donghwa Kim:

I come here and then I have given access to that dataset, virtual dataset. So even non-privileged user has access to the virtual dataset that I just created. When I click on the dataset, this time what I see is, as you can see, the social security number field is a mask, so are the first name and last name fields. They are hashed so a user does not have access.

Donghwa Kim:

I'll look into ADLS account so that we are sure that there is no other physical dataset that I created. But I think, I think you can ... As I mentioned before, you can also do also create further, different virtual datasets on top of this dataset by applying different transformations and then we can even go into the data catalog and you can start adding Wiki page and then start collaborating with other members of different groups or within your group. That's pretty much, the end of the demo. One thing that I want to point out is that I've used only one data source for this demo, but as I mentioned, you can mix data coming from different data sources, relational database, SQL and different file formats such as, Par, JSON. You can aggregate the data and then come up with different virtual datasets. That's the end of my demo. Rob.

Creating a Data Lake for a Trillion Dollar Organization

Robert Maybin:

Great. Donghwa, thank you so much for taking us through the application and also giving the live demo. It's a really interesting application of Dremio that you have there at NewWave. Really appreciate you taking the time to show it off live as well. We have been receiving a number of questions in the Q&A section. We've also gotten a few questions that have come in on the chat and the request is if you did submit a question on chat, if you could actually repost it in the Q&A area. That'll allow us to make sure that we have your question and that we can follow up with you later if we don't get time to answer it live here today.

Robert Maybin:

So just that quick administrative note, but if you did put questions in chat, you could instead post them in Q&A and that way that'll give us a way to keep a record of it and make sure that we get back to you with an answer.

Robert Maybin:

Yeah, this has been a really great overview of both ADLS Gen2 by Jeff and also the NewWave application by Donghwa and wanted to actually get to some of these questions. There are a number of them that are for Dremio or are really related to Dremio features and functionality. There's some questions that have come in for Donghwa and Jeff as well. What I'd like to do is pull one from each of those categories and take a few minutes to address them and then see how many more questions we have time to get to today.

Robert Maybin:

We did have a question come in for Donghwa that really the question is, can you really relate to us the experience that you've had as an ISV, working with Microsoft and ADLS and as well as Dremio. Can you just recount your experience there, how these platforms and technologies have been to work with from your position as a systems integrator ISV?

Donghwa Kim:

Yeah, sure. We, NewWave, we're in a very, very unique position to provide service to our further government customers, but we're also in the healthcare space, meaning that, we get to play with a lot of different data sources, different datasets, different types of data. At the same time, as I mentioned, we need to provide security and robust governance, so that we're protecting information of different beneficiaries. Also at the same time, we provide the agility of being able to explore different datasets so that we can come up with, for example, a machine learning model that can predict some health episodes and how we can improve the delivery on the healthcare.

Creating a Data Lake for a Trillion Dollar Organization

Donghwa Kim:

That requires a lot of exploration and quick turn around, and we cannot rely on traditional methods of doing ETL because as you know each your texts a lot of time and traditionally there is a separate organization that's just dedicated to doing ETL and you can take months and weeks before the dataset is available. But leveraging tools like Dremio and ADLS Gen2, they allow us to pull data from different data sources, just have them stored in ADLS Gen2 or pull the data without going through that whole ETL process and be able to gain insights from the data, that information that's contained within the dataset. So it's been pretty exciting. Our customers are happy that we have this kind of capability where they can see what information is available, what data sources they have, and then be able to provide a more of a business value for serving the citizens of the States.

Robert Maybin:

Great. Thank you Donghwa for that really neat answer. We've got another question here that has come in and it was on the chat channel and I think that Jeff did some reply via chat, but we want to pose this question out so that everyone can hear it and hear some of his answer. I think it would be really beneficial for the group. Jeff, I'll pose this question and if you could take like a minute and a half and just give a shortened version of the answer to this. I know it's a big topic, but the question is what is rich data management and governance for you and how does it relate to a data lake?

Creating a Data Lake for a Trillion Dollar Organization

Jeff King:

Yeah. Thanks Robert. You're right, it is a big topic. Let me see if I can get to and infer a whole lot of things from where, I guess, where that question's coming from? Rich data management means ... It understands you can manage the data in a sort of physical and presence in terms of, okay, it's a block and bits and bytes and on some storage substrate and we need to make sure it's resilient and secure and all those things. Managing the data for the sake of data. Then there's also managing the data in the context that the data really provides value to the business. So it's really like opening it up and actually saying, all right, what does all of this data actually mean and how do I manage the meaning and ensure that that meaning is kept in alignment with the organization?

Jeff King:

Now, data governance in the sense of the Azure and sort of [inaudible 01:11:04], ADLS Gen2, we provide a lot of the foundational capabilities of that. All of the data governance solutions rely on, the file system, granular, ackles, on the files and folders. The ability to a tag and add or tier the data from, say a cool, hot archive tier, throughout the life cycle. A good data governance tool, whether it's first party or third party should be able to take advantage of those native capabilities provided by the underline data lake store.

Jeff King:

Tags, a tag should be able to have GDPR nomenclature and support for that. All of the data lineage and scanning and profiling and all that stuff that's in data governance. Any solution worth its salt needs to be able to provide those capabilities at cloud scale and cloud economy.

Robert Maybin:

Great. Yeah, thank you Jeff. That was actually for such a big topic, very succinct. So I appreciate that answer.

Jeff King:

Shameless Microsoft Plug, these manifest in Azure data catalog, and just generally the whole Azure data governance story all up, and for anyone that's kept up to date with some of the latest acquisitions that we've made, BlueTalon should be top of mind. I highly encourage everyone to look up BlueTalon, but we're really excited that they are part of the Microsoft family. Matter of fact, I've been meeting with them for the last week or so, on getting some nuanced BlueTalon love for ADLS Gen2. But of course it won't be BlueTalon part of magic governance. All right. That's it. Shutting up now.

Robert Maybin:

Thanks Jeff. Really appreciate that. A few questions ... Actually quite a number and we're not going to be able to get to them all unfortunately. But we will keep in mind, if we don't get to your question today, your Dremio question, we will absolutely have somebody follow up and get you those answers. It looks like I've probably only got time for maybe two or three of these given where we are in the schedule. The first one that came in actually was, someone asks, could we elaborate on the difference between Dremio and a date virtualization product? There are other data virtualization products out there. We don't think of Dremio as just data virtualization, so we do data virtualization, so that is a component of what we at Dremio offer, but that's just a piece.

Robert Maybin:

So really the entire offering of Dremio is, we do allow you to connect to various backend data sources and actually query those virtually as though they were all effectively part of a virtual data warehouse. That's one of the features, but what we also offer is acceleration and we can talk in great detail about acceleration and there've been a few questions asked about that as well. We'll get into those in just a moment, but we offer also numerous ways of accelerating user workloads on the platform. We'd also offer data catalog and data lineage, which are features that really are not traditionally part of many strict data virtualization solutions.

Robert Maybin:

We also give an environment, which you saw in the demo by Donghwa, a way to come in and actually build out entire virtual semantic layers and layer on top, security and masking and row and column level, permissioning. Some of the things that weren't actually shown explicitly in the demo as well, is that you can come in for each of the objects of each of the various layers, you can actually set view and edit permissions as well and not just row and column, but as you build out your semantic layer in Dremio, you can control permissions at a number of layers. So it's really an offering that covers a lot of territory, Dremio, and it isn't just data virtualization, although that is a component.

Robert Maybin:

There was another question that came in, that asked, is Dremio in the cloud or is it on-prem? The answer is really either or both. We have customers that exclusively use Dremio in the cloud, and one notable one we have is using ... Basically the entire platform is in Azure. So all of their data lands in ADLS data lake. Customer is actually, RCCL, which was on our title slide. They have deployed the entire Dremio cluster in VMs and Azure. They're querying the data lake directly. So that's a complete cloud deployment.

Robert Maybin:

We also feature, for those who don't want to do installs on VMs, we have the ability to deploy in Kubernetes and cloud environments pointed at cloud data storage. So very strong cloud story. We will also be coming forth in the future with a service offering as well. So our cloud story is very strong. We're very heavily focused on cloud and heavily focused on service, and yet we also have many very, very, very large customers who are on-prem. So the financial customers that we work with, we've got customers that are very large enterprises that have large Hadoop deployments, data lakes that are on Hadoop. Some clusters that are running into the four or five, 600 node cluster range, in Hadoop and were deployed on-prem. They're typically in a yarn queue running on top of Hadoop.

Robert Maybin:

So there's many ways that you can deploy the platform and cloud is a very common way and a place that we're really focused and we also have a number of very large enterprises that deploy on-prem.

Robert Maybin:

I think we've got, one more question I think that we can take, and I think that we'll basically close with this last one, which was, does Dremio provide APIs for creating connections and virtual datasets? We do, and in fact, we have a robust rest API that will allow you to do basically all the things that you can do in the UI with Dremio, which are things like creating virtual datasets, you can create accelerations. You can work with the metadata catalog. So we have pretty much all of our big enterprise customers will typically have some level of integration. This is something that we do here in the professional services team at Dremio, is that we help our large enterprise customers really integrate Dremio well into their various systems. So it could be everything from hooking in Dremio to the end of an ETL process to maybe create some virtual datasets automatically. It could be integrating with APIs to do monitoring.

Robert Maybin:

So there's a whole host of things that we use the API for in larger deployments. With that, I think that, that's all we have time for today. I want to thank our two presenters, Donghwa Kim and Jeff King. Thank you guys very much for all the content that you shared and the questions you've answered. I want to thank everybody for taking their time today to come and be a part of this webinar with us. It's really an honor and a pleasure to be able to talk about Dremio and our great partners, on a call like this. Thank you all for your time and attention today. I think with that we'll close. Just keep in mind that if you did post questions in the Q&A, that we will be sure to follow up with answers to those. Thank you everyone. Have a great rest of the day. We look forward to seeing you on another one of these events soon.