January 10, 2023
Getting Started with Dremio’s Data Lakehouse
Every organization is working to empower their business users with data and insights, but data is siloed, hard to discover, and slow to access. With Dremio, data teams can easily connect to all of their data sources, define and expose the data through a business-friendly user experience, and deliver sub-second queries with our query acceleration technologies.
Watch or listen on your favorite platform
Read is a cloud, data, and AI marketing executive, with a history of building and leading high-growth marketing teams at AWS, Oracle, and H2O.ai. Most recently at H2O.ai, he served as the SVP of Marketing, leading all elements of marketing for the late-stage startup. Prior to working in the technology industry, Read was a Captain in the United States Marine Corps, serving two tours of duty as a Platoon Commander in Iraq. Read holds a Bachelor’s Degree in Mechanical Engineering from Duke University and an M.B.A from the Foster School of Business at the University of Washington.
Brock Griffrey has over 10 years of experience leading projects in the big data space. As a Master Principal Solution Architect, he is helping customers build efficient data solutions with the open data lakehouse.
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Hey, everybody! My name is Alex Merced, and I am the host of Gnarly Data Waves. This is the inaugural episode of our new Gnarly Data Waves program. Again, this is will be a weekly program where we bring you the latest data trends, data insights, [and] data knowledge with presentations from people from all across the industry. And in this first episode, we're gonna be talking about getting started with Dremio's Data Lakehouse. Okay. We'll be having a great presentation, talking about: what is the Data Lakehouse? What are the use cases for the Data Lakehouse? What problems does it solve? All sorts of really exciting details, and showing you examples of this architecture at work.
Pull Requests to Watch
But every week, what I plan on doing is bring you some tidbits before our presentation about interesting things that are going on in the industry. And this week what I'd like to do is just kind of highlight some pull requests regarding some of our favorite technologies, especially regarding the Data Lakehouse. One of the key decisions you make as a data lakehouse is a data lake table format. Again, that's going to be the format that allows that all those individual Parquet files that you have on your data lake. How do your engines recognize them as individual tables that you can do updates, deletes, etc. on? And the way that is done is through your data lakehouse table format, and there's different ones, including Apache Iceberg, Apache Hudi, Delta Lake. And a big topic is, how do you migrate between them when you're using one or the other? And with Apache Iceberg pull request 6449, they are building out a module for Delta Lake to Iceberg migration, and this whole Delta to Iceberg migration is also going to be the topic of an upcoming episode of Gnarly Data Waves.
Another really exciting project is Project Nessie, because this enables something called data-as-code. So the idea is, you have your table folder, so I have the ability to treat the data in my data lakehouse as individual tables. But what's really interesting is that in the world of code, we have certain practices that we already use. We use git, and we do things like version control, where we have different versions of our code, where we can do branches to isolate work on our code, where, if we make a mistake, we can do rollbacks and things like that. And basically what project Nessie is doing is bringing that to the world of data, bringing those sort of code-like practices. And this is called the data-as-code paradigm. And basically that Project Nessie catalog allows you to have a catalog of your data, of your Iceberg tables, where you can do things like branching, merging, etcetera. And the amount of engines that support Project Nessie are expanding. And now with a pull request 11701 from Trino, Trino will be added to the list of engines that support Project Nessie, including Sparks, Link, Dremio, Presto, and again, now soon Trino in that growing list.
Now, in Apache Iceberg, in our data lakehouse table format, what happens is that you use partitioning, and that just goes for tables in general. Why do we use partitioning? Because it allows for faster queries to be able to divide up our data in sort of nice logical ways. And Apache iceberg has these really nice transforms when it comes to partitioning your data. And right now you can use them in your SQL. You can use them using the Java API. But soon, with pull request number 3450, you can use those transforms ,so that way, you can easily partition a table based on month or day, or do a bucking partition.
A lot of these really great patterns and last pull requests or series of pull requests, I'd like to mention, are the Apache Arrow Gandiva UDF Pull Requests. So essentially what happens is that Apache Arrow is a standard for how to represent data in memory. That's basically what powers a lot of the performance improvements you've seen in the processing of data over the last several years. But part of that project that Dremio brought to the table was Arrow Gandiva, that basically functions so that [it] runs, and instead of processing them in Java, compiles them down to native code for better performance. And basically every time another Gandiva UDF gets added, that's just another sort of SQL function that's being brought down to native code for even better performance. And you tend to see more and more of those get added. That's always an exciting thing for that project because that benefits all engines that support Arrow.
Dremio Test Drive
So these are all exciting things. They're just going to continue improving the speed, the performance, the flexibility of the Data Lakehouse. So keep an eye out for all of those. But when it comes to Data Lakehouse, why don't you want to get your hands on the Data Lakehouse and go take it out for a Test Drive? And that's what Dremio Test Drive is for. So if you head over to dremio.com, you can go start the test drive, [and it] doesn't require you to put down a credit card or anything. It just gives you an opportunity to just get hands-on with Dremio real quick, same day, and run a few queries. Try out a few things, so you can get a feel for what the data lakehouse provides. So head over to dremio.com and try out the Dremio test drive.
Subsurface Live 2023
Also very exciting, and I'm very excited about this, is on March first and second, our annual conference Subsurface Live is going to happen. And we're gonna be doing some new things this year. So basically, Subsurface Live is a great conference where we bring you all sorts of great talks on data lakehouse architecture, open data technologies, all sorts of really exciting things. But this year, along with having the virtual conference, we're gonna have some in-person locations in San Francisco, New York and London. So if you're interested in being there, virtual or in person, make sure to register over there dremio.com/subsurface/live/2023. So there's the URL, right there at the bottom.
Dremio Data Hops Tour
Another series of events that we'll be doing is a Data Hops Tour. We'll basically be going to a city near you and creating an opportunity for you to socialize with your data colleagues and share strategies, techniques, and just have a good time, so join us at an event near you, whether at Dallas, Santa Monica, Boston, Chicago, London, they're all listed there. Go to your local Data Hops tour event. You're gonna have a good time, because it's always nice to get out and socialize right?
Gnarly Data Waves Upcoming Shows
And last bit is, we're gonna be doing Gnarly Data Waves generally every week. And so we have a lot of great content coming for you next week. We'll be doing: Migrating your BI on the Data Lakehouse from Apache to Apache Superset. So how could you use Apache Superset to do your Daily Lakehouse BI? Then I will be presenting about migrating from Delta Lake to Iceberg, discussing different strategies––how you can do so. Then we'll have a presentation on January 31st, first about how to optimize your Tableau dashboard with Dremio, and then on February seventh, we're going to be having Apache Iceberg office hours, where, basically, if you got Apache Iceberg questions we got Apache Iceberg answers, and we're going to be glad to bring them to you, and it will be a fun time. We've done one before. It was a great time. So hope you come join us for our next one.
With that, we're going to kickstart our presentation, which again is about getting started with the Dremio Lakehouse, and for this presentation we’re going to have Read Maloney, Chief Marketing Officer of Dremio, we're gonna have Brock Griffey, Principal Master Solutions Architect here at Dremio. Again, make sure, during the presentation, to put your questions in that Q&A bar at the bottom. We'll be here answering them as those questions come in, and then we'll also take time at the end after the presentation is done to answer them live. So make sure you check that out––we'll be monitoring that. And basically, with that, stage, Read, is going to be yours.
Getting Started with Dremio’s Data Lakehouse w/ Read Maloney
Thanks so much, Alex. Really appreciate it, and I'm very excited to talk to everybody about getting started with Dremio, what are customers doing with it, how customers are getting started, and what are their use cases. And so that's gonna be a big element of the focus as we get going and rip into this today. And so we're really gonna divide it out––we got Brock here, because he's gonna run through a demo. We're gonna go sort of into the major product stuff, the features, capabilities, etc., as we run through the demo.
And at the top here, I'm really gonna start setting the stage––where does Dremio fit in the overall market? How are customers using us? What are those use cases? And we'll sort of do that interchange and drop it in. So a few questions along the way––we'll get [to them] after. We're also gonna have a couple of poll questions for you. You know, we really wanna understand––what are you guys trying to do? Maybe we can tailor this thing on the fly a bit to what you're trying to do, or come back to you with another episode. You know, the show is gonna run every week. It's for the community. It's for you guys.
Whatever you guys tell us, we're gonna spit back at you to make sure we're getting the information you need.
Data Analytics - A History
All right. So just really running through this, you know, everyone on this show today is sort of attending live here are familiar with this right? We've been trying to become more data driven as businesses continue to move forward. How have we made that progression? And a lot of this continues to track with performance. How do we actually get access to the data and query it faster? And then, how do we get more people to have that access? And then, as that's happened, how have we tried to secure it, and so forth? And we've really gone through this, and where we fit is in the open data lakehouse movement.
We're really coming in after saying, look, we now have all of the performance, [and] most of the functionality of what you could do with a warehouse. But now you have this sort of infinite scale and flexibility from the data lake, and that combination is really what we've been calling the data lakehouse, and our vision is that your data should be in an open table format, and you own that. We'll talk more about that later. But that starts to bridge the best of both worlds between: you want the speed and performance and access, and then you also want the scalability and flexibility of the data lake. And that's where the lakehouse fits in.
Competing Data Priorities
And this really speaks to these competing data priorities that most teams go through, whether you're on this call and you're directly a data consumer and you're in a business team, or you're a data engineer in a business team, or whether you're an architect or an engineer and a central team, you probably are feeling some sort of this pain, which is that the business wants to move faster all the time. They want more access. They want straight-up agility. They just want to be able to go, whether that's building a product and they need access to data to build the product, or whether that's building an analytical data mart. In our case, we call that a departmental lakehouse, and they're trying to move really quickly. And then on the centralized side, trying to say, well, I got to govern that, I got to make sure sensitive data doesn't go out. I gotta make sure that it's secure and their data is encrypted, and so forth.
Companies Want to Democratize Data…But How?
And so what a lot of companies are left with when they look at that is something that's sort of like this: I got all this data and all these places, I got all these different business consumers, and all these different ways we're trying to use data between data science, pulling the data, wrangling it, building the models, operating those models, all the dashboards that we're doing in Tableau Power BI, etc, and all the direct data driven applications. And in the middle you're like, okay, well, how do I do this? How do I manage all the security risks in the middle of it?
Data Warehouse: Expensive, Proprietary, Complex
And so architectures often look like this––I was just emailing with a customer a couple days ago, and we were going through their architecture, and their use of Dremio was this massive consolidation of all these ETL jobs you see down here, all these data copies. So it's landing in all these repositories, and then it's getting copied into these warehouses, typically for some sort of performance gain. And then there's extracts and things like that going on, on top to help further performance gains, and you're managing all of that. So every time you wanna make a change, it has to cascade through the whole system. And then this [becomes] a mess, right? [It] becomes skyrocketing costs. It's hard to help the business move fast because everything has to go back to an engineer, so that relates to no self-service.
The Dremio Advantage
And this is where Dremio comes in. We're really working to bridge that. We started with ease of use in mind from the start. And that's why we have a semantic layer built into the query engine––that's all part of Dremio Sonar, and we'll talk a little bit more about that later. And so you can get this unified view of the data, you can open up that access and discovery to all the different business units. It's really easy for apps and dashboards and data science users to connect directly through. And so you don't have to have all these copies everywhere, and you don't need to do all these different extracts, for example, to manage performance. It actually just works through our query, acceleration technology, and a set of other technologies that we built. Brock will go into a little bit more detail as we go through the demo.
And so where are we different between, let's say, other data lakehouse providers or other query engines that are on the market? And the first one comes to this: focus on ease of use and self-service. The business and the data consumers––they want to be able to move fast, and they want to be able to accomplish their projects, to get the information they need to either make decisions or to build new applications. And so that starts with a really simple and easy-to-use interface. It also starts with a unified view of data, and so having the semantic layer built in with the query engine allows us to do that. And then we worked really hard to have lots of different connections. You can federate queries across on prem and cloud. You can federate queries across RDBMS, Hadoop and Cloud. And some customers that we have [are] really big, where they're managing, you know, dozens of petabytes of data––they have huge numbers of data sources. We'll talk about one of them later––I'm referencing transunion right now. All of that comes together, and then everyone on top, all the business users, the applications, etc., can go and access that data. And so it enables all these different teams to be able to self-serve.
And then, obviously, by the way, we have security and governance both baked in. So all of that competing priority stuff starts to sort itself out. And I think the buzz term around data mesh, which I think probably most people have heard about––there's a sort of this technical way to view it in terms of federated governance of that data. And I think our general view is that most organizations are just trying to deal with this: “I want the business to be able to move fast and have access.” And then we have security and governance on the central side––how do I bridge that most effectively? And that's where we're focused on not just building products now, but also our vision as we're going forward to help our customers.
The next one's open data, we feel really strongly [about]. I think many, many organizations continue to be burned over and over because they're basically held hostage by their data––someone else essentially has their data. And it's your data! But it's in their format. And they say they're going to charge you more money. And then you start paying more money, and obviously that started largely with Oracle and then that's changed over time to––I think a lot of people are starting to feel that with Snowflake today. And what we want to do is give you a choice. You have the choice. So we're based on open standards like Apache Iceberg, like Apache Parquet. It's your data, and we sit on top of that. And if we're no longer the choice for you, you can move. There's no vendor lock-in. You can use a different query engine. You can use a different technology with your data, but we're not holding your data hostage. We really strongly [believe] that that's the movement, that level of flexibility is required for organizations, and not have to continue to rebuild and migrate and do all this stuff with their data every 3 to 5 years.
And then we can't have a great self-service experience if we don't have world-class performance. So we have sub-second performance, we have a set of query acceleration technologies––Brock will talk more about that in the demo––and we do all of this at one-tenth the cost. You can imagine if you don't have all these data copies to manage, and you don't have all these warehouses to try to deal with performance, you don't have to pay for any of that. So those are the things that drop your cost way down, where you just have really inexpensive object storage, such as S3 or ADLS, and then you're using us on top. That's it. That's the full extent of the architecture. So it simplifies everything down, and because of that the costs are much, much lower.
Dremio Use Cases
Let's talk about customers and [the] use cases that they use this for. So there's a whole wide variety, as you can imagine. We are a data infrastructure technology, and so you can get started in a lot of ways. But there are some common patterns, and they usually fit into 2 buckets. The first bucket is––I got a bunch of stuff today and I wanna make it work better. And we'll just say, that's the modernization and migration bucket. And then on the other bucket, there's new projects. And it's like: “Hey, I wanna watch this new application. I need this new functionality. I need a new data mart for customer 360, or I need my quants to be able to get better access to trading data.” All of those things happen. And we'll talk through customer examples for each of these. Now, there's even more potential use cases––the ones on the screen, or that we're talking about now, these are what our customers actually use us for and start with us for.
Data Analytics Modernization
So let's jump in. We'll talk about the modernization piece. First, as you can imagine, there's lots of ways you could start, depending on where you are on your journey.
Migration to Cloud Journey
And when we talk about modernization, some customers might just say: “Look, I have all this data on prem, and I'm planning on staying there and modernize.” That's not typically what we see––we typically see the modernization with us as some path to migration, whether they are gonna move everything to the cloud, or whether they're gonna move parts to the cloud, or whether they're gonna move parts to the cloud over a series of years. And with us, you can have a seamless experience for modernization starting right away, which is the data where it sits. Okay. So what you do is you bring in Dremio, and then you can use that with HDFS, RDBMS, the data you have in the cloud, if you have any. Across all these different data sources, you get that unified view of data, you get the performance acceleration and speed. And we think of that as modern data virtualization. And in the cases where you have Hadoop HDFS, we'll call your existing on-prem data lake. You're getting a huge performance jump on that right away. And so a lot of customers start to say, "Look, I just need to modernize Hadoop, as I think about migration,” and then, as part of that, they're like, “Oh, I can connect in SQL server. Oh, I can connect in Oracle. This is amazing. I can connect in this.” And they start connecting all these other data sources, and then they have what we'll say, data virtualization sort of promise[s] could happen, they actually have that, and now customers can query that. They can query it really fast.
And we'll talk about some things we do to try to accelerate queries through systems that are not Hadoop, if it's on premise. And then, as customers move forward to that now, they still get the benefits of this high performance and the self service analytics that starts right in stage one.
Then in stage 2, what ends up happening is they start migrating the data. And with Dremio, what you can do is you can have Dremio Cloud running for your cloud data. You can have Dremio software running on your software. So you now have compute close to the storage. And then we have a Dremio-Dremio connector––so that same view that your customers have in Stage one, they maintain in Stage 2, all that [is] happening under the coverage of the migration they don't even have to see. It's totally seamless. And so now, what you're getting from a data team is you're reducing that infrastructure management cost because you're offloading that to the cloud. And you don't have that impact on your users.
And then some customers stop somewhere in Stage 2, and they say, like, “Look, that's where it's gonna be.” Other ones say, “No. All of this is gonna go to S3, ADLS, etc.” And so then they go, and they actually have a full lakehouse running with a relatively simple architecture because they've consolidated to do that. That's typically, for many enterprises, a multi-year journey. It's totally fine. Again, the customers get to modernize. your data users get to modernize in stage one, and then it starts moving on from there.
And now you have the best of all of it, because you have the best performance when it's running out of cloud object storage, and there's many other things that customers start to layer on as they get there. We'll talk about that, such as moving in from just open data files to open data tables like Apache Iceberg, which enables even more functionality, brings you to what we'll say is closer to a full lakehouse or warehouse environment on the lake, and then even some additional things that that we're doing from an innovative perspective that we'll talk about around managing data-as-code a little bit later on.
The Hartford Unified Architecture with Dremio
Here's one example of Hartford doing this. Hartford––everyone's aware [that] it's a very large financial services institution, and what they ended up doing is saying, “look, we have all this data in Oracle and Hadoop,” okay? And their plan was to move to Amazon S3. And so we're like, okay, that's that's no problem. You can start with us. And so they started with Dremio, in this case this is all Dremio Sonar, and so we come in on top. We're able to rapidly expand the performance of what they're getting on Hadoop. So their business users get happier just from the performance gain. Then we're able to bridge together both Oracle and Hadoop. So now we've changed the access and discovery game, and then we put it in a real user friendly environment that has led to massive value for them. And then, while they do the migration to the cloud, that's seamless to the business user. And in the end their offloading management work because it's going to the cloud, and they're offloading costs because it's going to the cloud, and they're getting even better performance as it goes to the cloud. So that's just one example [of a] really large company doing it.
Another one's TransUnion––in this case, again, they have data from more than 90,000 sources, and they're managing 30 petabytes, so this is a massive outlay. It was very similar. It's like their Hadoop experience was very slow, and you'll see this a little bit as a trend, and and if you're there and you have on-prem data you probably invested in Hadoop at some point, and that is probably slow with a bunch of different query engines you're using, whether you're doing it with a Apache Drill, which is the case we'll talk about here, or whether you're doing it with Spark, or you're doing it with Presto, Trino, etc., you're gonna get an order of magnitude performance improvement using Dremio. So it's a logical starting place. And then you can just hook together other data sources, and you're off and running. And so, as they did that, they were able to empower their analysts with self-service. And you'll see that's currently a trend that keeps coming back because we have the semantic layer built in, and we have the UI design from the start. As you can see here, they got a 5 to 10x performance gain right away. And this is that stage one we'll talk about.
Current Architecture at TU
Okay, so this is what their architecture looked like coming in––Hadoop, this was a map R implementation, and so then, they built this custom web app called PRAMA on top of their data lakehouse with Dremio. And that's what they ended up exposing to older data consumers. And then as they move through and forward in their stages.
Migration to AWS Architecture at TU
They're now starting their migration to the cloud. So what ends up happening is [that] map R and Hadoop are replaced by AWS and Amazon S3. And then we stay the same, right? Dremio is in there, and PRAMA stays the same, and so they get all the advantages of that migration. But they keep business user continuity and that self-service that they've been able to bring into the company very early on.
JPMorgan Chase Migrated Legacy
But JP Morgan Chase is in a similar environment where they said, “look, we have Sybase IQ, is what they had been using on premise. They wanted to modernize and migrate this. And by the way, this was all in their credit risk department, so this is largely to do with all the analysis around, trying to determine, like who should get credit and who shouldn't get credit, and other things that they can flag, saying, “You know, these are different credit products or products we could deliver to individuals.” And they just had a bunch of cost issues, etc., and they decided to migrate all of that into a cloud data lakehouse. So they just basically did that all in one motion. And that's Amazon S3 plus Dremio, replacing a host of others, [but] mainly replacing Sybase IQ. And then that provided a significant ROI by replacing that expensive EDW with a data lakehouse.
7-Eleven Modernized Customer
And then lastly, we also have customers, in this case, 7-Eleven, who were already in a cloud data lakehouse environment. And as they had ADLS already set up, all the data was already there, and their starting point was like, “Hey, we got Databricks and Pinot and Presto, and we're still managing all of these to make PowerBI work.” And we can simplify the whole architecture. And so this is really around their customer 360––they have 71,000 different convenience stores. There's a lot of different ways [that] they want to pull that information together to create a single view of the customer. They were able to do it really easily with Dremio. Now they could have the data that was already there in ADLS, but they're able to provide that view, using our semantic layer, to all of their wide variety of data consumers they have across the business, in this case through PowerBI.
And then, as I said, there's a whole set of new projects––I'm not gonna cover all of them. We have customers going through each and every one of these. The one that probably surprised me the most––I started at Dremio about 3 months ago––was customer facing analytics apps. So we actually have a blog out with Tata about a mobile app they built that runs directly on cloud object storage through Dremio, which I think is super cool. We're seeing that more as a trend as that goes through.
Probably the most common one, though, is these departmental lake houses. So InCrowd, and it's a business where they're basically working with sports clients on how their fans are engaging, interacting, and spending across all their touch points. And so they basically had a legacy data warehouse. They argue that they were using that as a data mart.
They replaced the data mart with a data lakehouse, and you see, they just get a lot better access to the data. They have more control over security and governance. Again, Brock will talk a little bit about that in the demo that we have with Sonar baked into the product already. And that was able, with the improvements in imports they were able to increase revenue and loyalty.
Merlin networks––this is a business where they're dealing with digital rights, and so they have to have a lot of data coming in for like, who to pay for different people listening to different music. And they basically had this difference where they had BigQuery and S3. And so they were like, “Okay, well, we gotta get everything on one of these,” and they decided to go with AWS, and as they did that, they started to look at Athena. But Athena was really having trouble when they were dealing with large amounts of small queries. And this is something that we really excel in––high concurrency, high performance, interactive queries. And so when they evaluated us, they came in and they went with Dremio, and it enabled them to run this application. It's an internal application that's running directly on data that's brought into Amazon S3, and enables them to to help operate their business.
Products and Ecosystem
I'm going to talk at a high level about the products and ecosystem before I turn it over to Brock. We keep building out our ecosystem––so you can imagine, there's parts that we're gonna do, but like we're not Tableau, right? We're not gonna build the BI tool. So we want to be able to have an ecosystem so that you just have connectors to connect into the same thing with all of the cloud and data infrastructure pieces that you'll see on the bottom. And so we're continuing to build and grow that ecosystem out. And really all of the main providers that you'll work with across the data analytics ecosystem we have as partners. And then we have 2 main products––we have Dremio Sonar and Dremio Arctic.
Sonar is really our Lakehouse engine, right? This is where it has our simple UI. It has our semantic layer, and it has our SQL engine with query acceleration technology. And then on Dremio Arctic––this is still in preview––this is our new product coming out where it enables customers to manage their data-as-code. It does automatic data optimization, and it has a data catalog. And we'll talk about both at a relatively high level, and then again, in the demo, Brock will go into Sonar a bit more.
So this is what Dremio Sonar looks like. So again, just to reiterate, we have a bunch of ways to connect into BI tools, notebooks, and editors. This is ODBC, JDBC, REST, Arrow Flight. So again, Arrow Flight is what you'd use if you have Jupiter Notebook, and you're writing Python, and you're trying to query the data, and all these sources through Python.
We have an intuitive UI––our SQL Runner's awesome. Brock will show you that. And then we have a lot of different functionality that analysts can do or other consumers can do. From a no-code perspective, such as like, you can create calculated columns. We have a whole space where you can share data and work in a collaborative environment. And so that's all baked in.
We have the semantic layer, ao you get this unified view of data in business terminology that the business can actually understand across this huge broad range of data sources.
And then we have a built-in data catalog that also has data lineage. That's all part of the semantic layer. And then we get into the query engine where a lot of people are like, “Oh, it's a query engine.” Well, it's a lot more than a query engine, because of the way that we focus on the UI and the UI in that overall experience. And because of the semantic layer we have built in, we use Arrow––it's downloaded 70 million times a month. Now it's for vector-based execution. It is really, really fast, at least the highest performance core of an engine that you can have. And then we do accelerations on top of that, using something called data reflections––Brock will talk to you a little bit more about that––and then we're able to federate those queries across all these different data sources. We've also done some things around multi-engine architecture, workload management, etc., to make the experience even better and more efficient and faster. And then we have a way to read a wide variety of data. So we can read from Iceberg and Delta Lake, and we'll talk a bit about with Iceberg, we can actually also do DML and all these write operations, and then from a file perspective Parquet, ORC, Json, CSVs. And then we also have ARP connectors.
And then on the right hand side is, we have security and governance built-in. We have RBAC, we have 5-grain access controls, authentication, auditing and query history. That is all within Sonar. And that's why our customers are getting so much value, and we're seeing so many use cases explode with Dremio.
On the Iceberg side, this is what's gonna enable [you] if you're going into Iceberg from an open table format perspective, the other table format that's out there is Delta Lake. And what we're seeing right now is from a read-only perspective––you'll see we support both of them, we know companies are gonna figure out which one's the best for them––but what we've seen with Apache Iceberg, which is the Apache Open Source project, is that we have more contributors going in, and that we're seeing a much diverser range of people who are contributing to Apache. It's far more diverse. And so this is actually what it what it looks like from the data perspective––people like Netflix, Apple, AWS, Tableau, ourselves, we're all in this, helping iceberg, you know, develop and grow very quickly. And so on the Delta Lake side, you have Databricks. So it is our belief that Iceberg is going to become the open source format, the standard. And so you'll see that we have a sort of Iceberg-first mentality in terms of a lot of the different write operations and other functionality that we're building into Dremio. If you have more questions about Iceberg––it's a hot topic right now––we have a lot of content on our blog about it. We'll be doing additional sessions––Alex is super involved. We're very involved in the Iceberg community overall. And we'll talk about the––well, I'm talking about it now, it's important. If you want to get into the right functionality and into managing data-as-code with Arctic, that you would have Iceberg as table formats, and we're gonna make that super easy for customers to have their their open files turned into Iceberg tables.
Write Operations with Transactional Tables
Okay, here's just an example. You can do record-level data mutations, automatic partitioning, instant schema and partition evolution, time travel, all through the use of Iceberg with email. So INSERT/UPDATE/DELETE with any of these engines. And so if you want to use that functionality that is available in Sonar, you need to have the data within Iceberg tables.
Okay, Dremio Arctic––something also that you ned to have Iceberg tables for, right now. This is really going to become the easiest way to manage your data lake, and that really starts with managing data-as-code. And so you think about a lot of people out there thinking about data products like, how do I create data products? How do I create these hubs or sandboxes, if you will, for all the different business teams, which really is a branch, right? You can think of it as a branch off the main core, where you can bring in data, and then it can get checked in, and then it can be merged back through. There's a huge amount that goes into governance and security
and flexibility. And so, as we're trying to bridge those divides between access and control and businesses can bring their own data in and all of this, but also being able––I'm sorry, access and agility––but also the security and control all that comes together. All of that can come together with data as code. And so I think this is a big game changer for how organizations are going to look as sort of creating that environment within their organizations. Obviously, you're going through, and you have all these tables, you want to be able to optimize them. And so we will actually do that automatically. So we'll do automatic data optimization through your Iceberg tables, and then you need a catalog to manage the whole thing. And so we have a data catalog that's all baked into Arctic. And this whole set of features is really designed to just make managing the Lakehouse very simple, improve data quality, improve governance, etc., through the whole process.
Okay, this is what it looks like I talked about the pieces, but just so you can see a little bit more of what's in there on the top. Here there's branches, tags, and commits again, everything. You just think about it like, git for data on the catalog side. You have a unified view of tables. We have data lineage, table views, hierarchies, data discovery, that's all in the catalog on the governance and security side similar to [what] we talked about with Sonar––fine-grain access controls, audit commit histories. And then on the operations side, automated data optimization.
optimization garbage collection, data ingestion. And this is a little bit more sort of going into data-as-code again––it's a git-like experience for the data lake.
A Git-Like Experience
So you have data version control. And then, on the other side, you have data branching. And this just enables a huge amount of, we'll say, innovation to happen in data management.
Okay, with that I'm gonna turn over to Brock, and Brock's gonna run through, again, like, how do you actually get started hands-on, keyboard? If you want, right now, if you're not already in a test drive, you can do that, you can follow along. It's gonna be a ride, and he'll talk you through some of the features that are enabling that as he goes. Thanks, Brock.
Awesome. Hey, everyone. Thanks for having me here today. So I'm gonna start off with just showing you, Dremio Cloud as a whole, and then I'll jump into test drive so you can understand what Test Drive offers you.
So this is Dremio Cloud––for you not familiar with this, it's easy. Once you've set up an organization and sign up for Dremio Cloud, you'll be presented with something very similar to this view––this is our dataset view. When you first log in, you'll see a blank space for spaces and blank sources. You can easily start adding objects to Dremio by clicking the plus source button, and this will give you the option here to add different sources. You can see different options around data-as-code, again, Arctic catalog, start with Arctic right away today. If you already have existing data in Glue or S3, you can easily just add those. And now you have your data lakes available for you, and databases.
You can easily add any sources by just clicking on them, going through, filling out your credentials, and saving that. One of the great things about this is, you have full permissions on everything in Dremio, so you can set what permissions every user has, or within Dremio. We also have the ability to set roles. Those roles can be pulled in through different aspects, using LDAP single-sign-on, and we have skim support, auto support as well. So you can pull in all those roles all those privileges from existing authorization/authentication mechanisms that you have today, and allow them to filter in here for you to set permissions on. Who can access what objects––again, that can be done on anything within Dremio. So you see here, I have these spaces. They're like logical grouping, an organization of data sets within Dremio. So I have these spaces and in any space, I can set who has permissions to do what in those spaces. And that's a hierarchical type permission. So I go down to every object within Dremio in this space, particularly that we're looking at now, I actually have some boulders for organization. But before I dive too much into this space, right into this view here, just wanna show you guys that we do have a couple of different things here on the left hand side. What we're seeing now is the data set view it allow you to browse data sets and manipulate the Dml on those datasets. We also have a sequel runner view that will show you the actual sequel, and you can start building out your own queries from scratch here.
You can, inside this view, do things like create scripts, save those scripts, and share them with other users as well. We have this jobs view that shows you every job that's ran within Dremio. If you're an admin, you see everything, if you're a user without admin permissions, you only see your queries. So here I can see everything––I'm an admin––so I can see all those things. The last thing is a project settings. I'll hit on that a little bit later.
But back here, inside this view, I wanna do something like actually start working with some data. I wanna pull up this business space. I'm gonna go into the transportation folder and I'm gonna open up New York trips dataset. If I open this up you'll see here, it shows me the DDL for this. I can do a quick run on it and see the data and just get a brief little preview what that looks like. There's a million records [that] just came back real quick.
Just to give me an idea what what this data is––now, this is a data set with 1 billion records. We're gonna only show you a subset of that in the web UI, because no one can really work with a billion records in the UI, really millions [is] more than enough. I can look at and get an idea what this data is. If I'm gonna start doing some calculations on top of that, Dremio will process the full data set to do all your calculations. But the great thing about this is I could start working with this and pull it somewhere else. But before I do that, I'm gonna go ahead and look at couple different tabs we have here, give this nice catalog tab that lets you see [the] information Wiki, information that we've created for this data set. You can edit any data set’s Wiki page and add anything, like text graphics and links and more information about what this is. In fact, we have users today that automatically populate this through our REST API. So if you have a external data catalog, and you want to populate that information, you could do that. This next tab here is our data lineage tab. We call it the graph tab. And this shows you where those datasets being used and what data sets it is using. So in this data set, you can see it's coming from an S3 source. As we traverse backwards, you can actually click on and traverse backwards to see the parent and the parent's parent. So this gives you a really great idea of where is this data really coming to and from. And you can see, we have a lot of descendants using these data sets everywhere. But ultimately, it's coming from this physical data set. You can tell it's a physical data set, because it's purple. And it's a view, if it's this green color. So we can see here, that's a physical dataset that is the raw, untouched data set, and on the top of it is just a view that we're going to work with. The last tab is our reflections tab. I'm not going to jump to the reflections tab yet, I'll show you that a little bit later.
So, coming out here into the trips data set. If I highlight this, I can easily go to the query data set view, and this will take us into the SQL editor. A couple of different things we can do in here. We can just run it and get an idea of the data. Looks like we can pull us out into other tools as well. We can even do calculations on this. So maybe when I do a quick little calculation, maybe I want to do a calculation of––let's just find the distance here, and we want to change this over to a calculated field. I want to go ahead and just do times, like 1.6 to give me some
kilometers, and I'm gonna keep both fields, and I'm just going to click apply. So now Dremio is going to apply that. And I can see I have both the distance in miles and the distance in kilometers. And that just gives you a great way of [saying] hey, I want to keep both of those, I want to have both for my analysis. I can save this. Then, as another view, I'm just gonna save my own personal space. Should I call it NYC trips? Anyway, just make sure there’s nothing else there.
Alright. So now that I click save, you'll see I have these couple buttons that pop up a table and PowerBI button. Any BI tool can really connect to Dremio. We have these native integrations with Tableau and PowerBI from our web UI, you can just click it, [and] it’s gonna download a file. That's it's, just a link to Dremio. I'm gonna go ahead and open that up.
When that opens up. I'm gonna switch over to Tableau. So we're just authenticating with single sign-on into Tableau, so you don't have to fill in your credentials in both places. Now, Tableau is gonna connect and run every query in here directly against Dremio. We're not gonna use cubes or extracts or anything of that matter, so if I go ahead and do a full count on all this data, I should see how many records I got use.
See? Dremio came back at the speed of thought. It was so quick. We got a billion records. Now, I want to slice and dice this data. Let's look at it by the pickup date. I want to quickly see, what does that look like? And without waiting for the window to spin and get this information back, I was able to quickly see how many riders per day, or per year, in this case, came back. And I can even drill into this data without using any kind of cues or extracts; I can get down to more granular levels of this data. In fact, if you have a cube extract, you might not be able to do that, because those cubes and extracts become too large. You can see how quick that was. I can break down to the year month level. In fact, if I really want to, I can go to the day level. You cannot do that today with most tools and get this kind of performance.
So let's just back this up a little bit. Let's look at the month level, and let's add some other measures to this. Let's go ahead and look at some things, like maybe the tip amount, and find out: has a tip mount changed throughout the year? We're going to do an average.
Alright, it's got it. It goes up a little bit towards the end of the year––people give a little during the holidays, feeling a little better giving bigger tips.
Let's see, [it] has a fare amount effect to that. Maybe I wanna see that as well. So it's gonna just change that to an average as well. So you can quickly do this analysis and see, okay, well, actually, the fare amount's gone up. That might be the reason why the fare has gone up, the tips have gone up in percentage-wise. So this just gives you a really quick, easy, fast way to build these dashboards. Apply anything you want to do with inside them, and then, and at the end of the day you can share it to the other users and have that access to that data.
Going back over to Dremio. Let me just switch over here. You can see that we have in the jobs page all these queries. These all came directly from Tableau, and ran directly in Dremio. If I open up all these queries, you'll see the query that was submitted some information about the job and how fast it ran. But something really important here is––how do we get this information so fast? The 1 billion records of data back? Well, we're able to do that via reflections. Reflections are an acceleration technique within Dremio that give you great performance on your data sets, even if they're very large data sets in this case. An aggregate reflection is like an index under data, and it's going through and indexes that data automatically for you, and will keep it up to date for you. So whenever you run a query against any data set that has this reflection within its query path, Dremio does automatic substitution, meaning no user queries a reflection directly, this happens automatically for them. So as a user, all I know is, my performance is great. That's all I care about.
So coming back over here, I know we touch base, you know, as we had mentioned earlier, and I'll be able to join data between different data sources. So it's very easy to drive me out––I have this Postgres database down here, and I have some data in S3. I'm going to go ahead, and I'm going to go to this query editor. I'm going to go ahead and go from my business layer transportation, I'm gonna just do a select borrow from this pull over this here, and then I'm just going to run that again. We've seen this before, as the data set––well, if you're SQL savvy, you might want to come in and just type in the SQL. I can always use the UI and click the join button, browse down to the Postgres database, the public, I'm going to drive some weather data, and I'll give you a nice little preview, what that looks like. Yep, that looks like what I want. Just go and click the next button, drag over the pickup date and the date here as well, And now I'm going to hit ‘apply’. This will generate the query for me, so I don't have to go and write it myself. And if I hit the run button, I will get some information back to see what does this really look like, and I should be able to then act on that data, save it, and share with other users. Again, if I scroll over, I should be able to see that join.
So you can see there, there's the 2 dates. I now have the weather data, and I have the data from the New York City trips data set––again, I can save this view. I can save my space just [as] NYC weather…And now I have a data set I can then go and use in any other tool. In fact, I can pull that out in the Tableau and do some analysis there as well if I want to. So this makes it very easy to share this, create datasets, join between different sources, and get performance on. That is one other cool thing here at the top. Again, we mentioned the graph tab. If I click on that graph tab, I can see both those data sets in that data lineage and see one's coming to Postgres and one’s coming from S3. So I know that where the data is coming to, and from, if I have more sources in here, join together, you'd see all of them as well, if they're just a bunch of S3 sources, you see a bunch of S3 tables.
Alright. So I touched base a little bit on permissions. I want to kind of jump into how we can do fine-grain access control. So within Dremio, if I go to the security layer, and I have this employees table––this is just a view that has some information, and I click run, I can see here only 3 records. This has a lot more than 3 records, and you can notice that the social security number, the credit card number, and the credit card code, they have some kind of masking done to them, and it makes [it] so I can't see the information in those columns. The way this is being done is not through something being stored inside of a view, but rather through our native policies. If I go in here, I can show you what those look like.
I have a couple UDF functions that define how we're using this masking. So I have a protect social security number masking that happens, and what it's doing is, it's looking at the user running the query, if it's, you know [email protected] or the userdremio, or if they are a member of the role accounting, then they can see the full social security number. Otherwise, they're going to see what I see, which is that masked-out number, [and] something very similar for the credit card code. I'm sorry I have 2 different ones for credit card code. But yeah, something very similar for that credit card code. And then, lastly, once you've done this, you just apply the masking policy to the data set, and that will then make sure that the policy gets applied, so anyone that runs a query against the data set will always have this, and you can reuse those UDFs on any data set within Dremio. So you sit there, create that same policy, and go out there and just apply it everywhere on everything that has the screen number, or whatever you need to do. It makes it very easy and very reusable within the environment.
One of the great things is that masking––I'm sorry the row level policy––can be applied as well the same way, and you can do it on the base tables. That way, anyone coming in is gonna automatically, even at the base table level, have those filters applied. So if there's any PII, and you don't want to share it with anyone, you can apply it right away.
Here's how you would do it with a relevant policy. You just look at the query user and you match it up with some kind of column. You want to filter around some, maybe department IDs is what I'm doing here, and we're filtering that department ID. And then you apply that policy as well––once it's done, like you saw before, you select from that table, and you can see that I'm only seeing 3 records. And this actually has a lot more than 3 records in it––this data set has…if I go back into the Postgres system, I can see in here––the employees table. I go to the raw table because I have permissions in the raw table. It has many more than that [3 records]. It's 107 records in there. So you can see all that information if you went to the base table because there are no policies being applied on the base table.
Dremio Test Drive
Alright. So how do you guys get started today? I'm gonna show you real quick how you can test out Dremio's functionalities without using a credit card, without doing anything. You can just sign up on our website and do Dremio Test Drive.
So how do you do that? Go to Dremio.com––you'll see up here in the top, we have this button [that] says: ‘Start Test Drive’. All you gotta do is go in here, fill in some information: your first name your last name, your email. And I'm just going to click ‘sign up’.
So this will then send me an email that I will then be able to log in. You could just click the ‘open gmail’ button. I'm gonna go over here, I'm gonna see if that got sent. You'll see here that you were invited by “gnarly narwhal’. You can click the join organization button, and it’ll take you over here. You can create a password, however you want to set up the sign-in. I'm just gonna click the login with Google real quick, and click that. And then, once I'm in, I will now have access to the environment, it's going to pop up here and we can go ahead and click. It will give you an overlay which I already dismissed, that will show you, “here's how you can get started today,” and it’ll walk you through the process of creating your own view, querying the data sets, like I did, so going into a data set, querying that, running the information out of there and quickly seeing––you know, the same thing I just did with you. It's gonna be a very read-only experience, but it'll let you see the information, look at the catalog view, you can see the catalog view in there. You’ll be able to see the graph and data lineage in there as well, and it'll let you create and save your own object inside your own space here. You won't have the option to add sources into Test Drive, or add spaces, but once you've gotten familiar with the environment, you really think, hey, I'm ready to go on and set up my own Dremio cloud instance, you can go back to Dremio.com, and sign up for Dremio Cloud at that point.
Alright. Thanks everyone. I'll head back over to Read here.
Dremio: The Easy and Open Data Lakehouse
That was great. Thanks, Brock. We're really gonna switch over here, but I think just to to emphasize that point, Test Drive’s a great way to just get hands-on, and what we find is as our customers go through that, they really start to experience the power, what you can do. They want to do it with their own data. And then we really have great options with Dremio Cloud and and also Community Edition, depending on where your data is, to get up and running and just try it out, get to that next-level experimentation.
One of the things to note, and I'll sort of close on this as we start taking questions, is all of this packaged together, just getting a sense of it from us, and and you know we'll call it a short 50 minutes here––it's led us to be one of the fastest growing companies. Deloitte Fast 500 just came out, and we're number 23. Yeah. So if if you're interested as you look at this, you're not alone, you have 5 of the fortune 10. And we just have a lot of a lot of companies coming in. So we really just say, like, “Look, try the product. Just try it out, use it.” That's really our call to action here. If you don't love it, don't buy it. If you do, talk to us, and we'll help you get set up.
If you want this presentation directly to your organization, you'll reach out on the ‘contact us’ on our website. We'll come in, we'll we'll give you a presentation just for your group if you want us to redo it. If you wanna reach out to me on Linkedin and say, “Hey, will you come out and present, or whatever?” Yeah, we can go work through that. Okay? So really excited to hear your questions. We got time for a few more, and and we'll shoot for it from there.
Ready? Oh, sorry. no,
I was gonna jump in just because the time is that cool, Alex? Like, specifically on the Merlin case study. So one of one of the questions was like, Hey, what was the advantage of moving from Merlin going from BigQuery to Dremio? And the reality on that one is, it wasn't actually like a head-to-head when they're going BigQuery to Dremio in that instance, because what happened is all their data was on S3 and based on an architectural concern. Your issue is, they basically had the warehouse, what they're using BigQuery for over, and trying to actually run those queries directly against the data. And then you have compute storage in different areas. And you know, there's some cases where those cloud providers are so close together that there's not as much of a latency concern between that, but they were trying to have everything over on on one provider. So when they were going to do that motion, they had already decided to say, “Look, we're gonna run this thing on Amazon.” Okay? And then what ended up happening is, he said, “Look, what we're gonna try––we're gonna go see what Athena does. And Athena didn't meet the requirements both from a usability standpoint, but also from a performance concurrency standpoint. Dremio did [so]. Just wanted to make sure I hit that question up. We hadn't gotten to it in the chat yet.
And there should be a poll. So everyone [should] feel free to answer that.
Yeah, just related to this, cause, this came up [as] I was answering questions. A lot of different people were are asking about: there's this item going on with with ETL, like, what do I need to do with Dremio in terms of this data environment, etc.? But they're not sharing a lot of details yet. Look, a lot of it depends on how you're landing data, where you're landing the data, with formats you're landing the data in, about whether you're something you're gonna need to deal with upstream of Dremio from more of a data engineering perspective, or it's something that you're gonna do with Dremio or a Dremio partner, you know, after you're still using the existing pipelines you have. And so I can't tell, in some cases right now, we can't get into that level of detail. So you're gonna just have to say, we'll reach out to you [with] an architect that will talk through those details, figure about your used case. So just click ‘yes’, and then that's our trigger to say, ‘okay, we'll follow up and go deep with you post the cast.’
Awesome. And just about all the questions were answered in the chat box. So if you’re looking at the Q&A, you should be able to see under the answer section, all the typed up responses to most of the questions. I think the meeting ones are probably in the process [of] being handled.
One last one came in. How do we talk to existing databases? Look, that's a push down. Okay? So someone just answered it. So we have optimized push-downs. We’re going to push down if you have a BI tool on top. What we're gonna use is reflections can help there. So we can actually do some work in Dremio to help speed up queries that you might be using to legacy databases or RDBMS’s, and we're doing all that through the query acceleration technology that Brock talked about.
And next off, I just want to say, awesome presentation! Thank you to ReadMaloney and Brock Griffey. I'm presenting today. So thank you guys for coming onto the show.
Okay, we will talk about––so we'll get back to you guys. So someone asked about like, are we really performing with a lot of on-prem data? Absolutely, it does depend on, if it's just an ad hoc exploratory question. You know the question you're asking––the data, we're really just limited to the system you have. If it's something where you're gonna be hitting that query more often like a BI level dashboard, we're probably gonna be able to speed it up. We're gonna see the biggest performance gains when you're using a lake like hadoop. If it's on-prem, you're gonna still see gains from other systems that go through. And that's all––driven by reflections, a lot of questions on reflections. We're gonna have to bring up a whole system. We're gonna do a whole nother, gnarly data waves on reflections. That's what that's what this is telling me, and we'll come back to the community with that. So, thanks for that feedback. I know we got a drop for today, guys.
Awesome. Yep, thank you. Thank you guys for coming on. And again, everybody, we'll see you guys again next week with the presentation on migrating your BI over to Superset. Make sure to check out Dremio Test Drive, make sure to check on your local Data Hops event, and register for theSubsurface Live conference. Again, that's March first and March second. But
thank you for attending the inaugural episode of Gnarly Data Waves. You guys all have a great evening, and we look forward to seeing you on next week. I'll see you all there.