Gnarly Data Waves

Episode 46

February 29, 2024

Getting Started with Dremio

Dremio’s unified lakehouse platform for self-service analytics enables data consumers to move fast while also reducing manual repetitive tasks and ticket overload for data engineers.

Organizations aim to increase data access and lower the time it takes to gain insights, all while managing governance and controlling rising data costs.

Dremio’s unified lakehouse platform for self-service analytics enables data consumers to move fast while also reducing manual repetitive tasks and ticket overload for data engineers.

In this video, you will learn:

Overview of Dremio, what is it and why is it growing rapidly
Proven use cases by some of the most demanding customers in the world
Demonstration for how to rapidly get started and try it out

Register to view episode

Speakers

Read Maloney

Read is a cloud, data, and AI marketing executive, with a history of building and leading high-growth marketing teams at AWS, Oracle, and H2O.ai. Most recently at H2O.ai, he served as the SVP of Marketing, leading all elements of marketing for the late-stage startup. Prior to working in the technology industry, Read was a Captain in the United States Marine Corps, serving two tours of duty as a Platoon Commander in Iraq. Read holds a Bachelor’s Degree in Mechanical Engineering from Duke University and an M.B.A from the Foster School of Business at the University of Washington.

Alex Merced

Alex Merced is Head of DevRel for Dremio, a developer, and a seasoned instructor with a rich professional background. Having worked with companies like GenEd Systems, Crossfield Digital, CampusGuard, and General Assembly.

Alex is a co-author of the O’Reilly Book “Apache Iceberg: The Definitive Guide.” With a deep understanding of the subject matter, Alex has shared his insights as a speaker at events including Data Day Texas, OSA Con, P99Conf and Data Council.

Driven by a profound passion for technology, Alex has been instrumental in disseminating his knowledge through various platforms. His tech content can be found in blogs, videos, and his podcasts, Datanation and Web Dev 101.

Moreover, Alex Merced has made contributions to the JavaScript and Python communities by developing a range of libraries. Notable examples include SencilloDB, CoquitoJS, and dremio-simple-query, among others.

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Opening

Alex Merced:

Hey, everybody! My name is Alex Merced, and welcome to another Gnarly Data Waves presented to you by Dremio. Today we're at Episode 460, how the time has passed! And we have an exciting episode of “Getting Started with Dremio.” Before we get started, just want to remind everybody that Subsurface is right around the corner, on May 2 and May 3, which will be held, live in New York and virtual online, so you can see it wherever you're at. Get registered today over there, at dremio.com/subsurface. It's going to be lots of great talks about data lakehouses, table formats, implementation patterns, and all of the great stuff. Make sure you're there!

With no further ado, let's get started with our feature presentation: “Getting Started with Dremio.” Our presenter today will be Read Maloney, CMO at Dremio, and I will join in a little bit later to present to you a little demo. But with that, Read, the stage is yours.

Read Maloney:

Great, thanks, Alex! One of the callouts, too, that we'll talk a little bit about today for Subsurface, is the number of sessions we'll have around Iceberg as well. It’s been a topic, you know a lot about Alex, with your work on the O'Reilly book, and how much you've done around the Definitive Guide around Iceberg and what's happening there, and that'll be an exciting part of that overall event.

One of the things we want to talk about is––I'm going to cover a real quick overview. We're going to do maybe 15 minutes or so of presentation, and then we're going to switch over to the demo. And we're big on that here, we want to show you––everything we're going to talk about is real, but we like to run these sessions every couple of months, because we continually have new customers come in, and as customers come in and they talk about their used cases, we have more use cases to develop, and so we like to come back to the market and say, hey, what is Dremio, how is it being used, and why is it being used? And so every quarter, we update this episode and the same thing with the demo. So Alex [has] some new stuff in the demo today that we're excited to showcase.

So, Dremio, what are we? We're a unified lakehouse platform for self-service analytics. We'll talk more about how that comes together, and what that could mean for your organizations or your teams that you're working on, but ultimately, unification––we're working across all of your data, wherever it is, any of the major types of platforms that you're storing it in. It could be on-prem, it could be in the cloud, it could be both––bringing that together for your organization. And then we're continuously working on making it easier and easier and easier to use for both engineers and analysts so that you can do more with the platform and bring more data into all the different lines of business and departments that either you're in, or that your teams are supporting.

Data Analytics - A History

Read Maloney:

So how do we get here? We've been doing these different changes and steps in terms of data platforms for a while. If you see, there are five different items on the slide. We'll talk through them real quick. I've seen a lot written to say this––we're actually in the sixth group now, in terms of data platforms, but well, how do we get here? We got here from enterprise data warehouses at the start, right? We start, say, look, we want to have all this data. We want to get insights out of it. We want to start driving our businesses using this data and moving forward, and that was with Teradata, Oracle, Natiza, etc, but we couldn't scale. And so that led to the movement around Hadoop and big data, and what did we find there? It was too slow, and then it got costly to manage, and so with the cloud data warehouses, they were able to separate compute and storage, and we started getting that scale, and we started getting speed with that, but we ended up with Teradata in the cloud. So now, we have these proprietary expensive systems that scale a lot better than they used to, but ultimately, we start to end up in a relatively similar set of problems. That cause––why people are continuing to move from Teradata––is now also the same reason why people are starting to move from Snowflake because we have Teradata in the cloud.

We'll talk a little bit more about some specifics that we have as people move to open architectures––whether they're using any type of open data, or whether using an open table format like Delta Lake or Iceberg, moving on to the Data Lakehouse, where they then have the best of breeds in terms of componentry. They're able to have more of a distributed platform in terms of providing that to the customer, and that's been driving the rise. And so Iceberg in these table formats have been allowing––now you can do full-on warehousing on the lake.

And then, in the next instantiation that's getting pushed here, all of the advancements in AI are getting brought into the lakehouse. Something we'll talk a bit about later that's unique to Dremio is that you can start to manage your data-as-code. So you can have your main data asset, just think about it as a data product, you can branch that as you add more data in, you can then check it, make sure it's right before you merge it back, and you're doing all of that without actually creating physical copies of the data. And so doing that on the engineering side, and then enabling self-service from a platform perspective on the analyst side, is what gets us into more of the data mesh concepts, where we're enabling both governance and quality, but we're also enabling speed and agility.

Data Lakehouse Adoption

Read Maloney:

And so what we've seen because of this change, this movement in the market is, we went out and did a survey to just 500 different organizations. These are not Dremio organizations, and we have this on our website: it's our “2024 Data and Analytics Trends and Predictions,” we've done some events with customers around this, and we also have a white paper. I'm just highlighting one stat here, which is that 70% of analytics will be run in a lakehouse in the next 3 years, and so I put out a blog recently, saying, I think this is the year actually where the lakehouse crosses the warehouse, that 50% mark, in terms of, we will see the majority of analytics started being run on the lakehouse by the end of the year, and that's happened so quickly, and it's due to the value that the lakehouses are providing over what the current solutions are from a data platform perspective.

Data Lifecycle Remains Complex, Brittle, and Expensive

Read Maloney:

And this is one of them, which is––a lot of architecture still looks like this. You have a whole range of sources. You have a bunch of ETL pipelines that you're moving data into the lake. You then have a whole other set of transformations that you're doing as you move the data into a warehouse, and then you're also still typically doing extracts into different client tools like Tableau. And so any of these items go, and they break, you then go back and you have a problem, and you have to go troubleshoot it, and that's preventing a lot of teams from finding the data that they need, creating the views that they want to be able to query, and then moving quickly. It also adds a ton of cost, so it's expensive to manage. You have all these systems, and then other than that, [it] also makes it hard to govern. And so lakehouses have been helping here, because you just remove the warehouse part here, where you're able to start shifting the data consumer closer to the product, which we think about as shifting left in terms of the amount of ETL that you need to do, and then, if you have fewer hops, you're getting closer to the self-service and helping the business move faster.

But they're not sufficient in terms of driving self-service––you need to be able to remove the extracts. You need to be able to shift even closer over and to do that, you need three things. You need an intelligent query engine, which is something that Dremio offers that nobody else does, and we'll talk about what that means a little bit later. You need an intuitive self-service experience, both for the engineers and the analysts. And then you need a set of next-generation DataOps capabilities, and I'll talk more about that, too. I just alluded to it, and Alex has a great demo here in terms of what that starts to look like as you start treating your data-as-code, and you're dealing with products that you're then branching, checking, and merging in terms of the data management. You're going to get cleaner results to be a lot easier to manage.

And so this is what Dremio looks like today––we have 3 main capabilities. We have the unified analytics capability––this is where you're able to look across all of your data sources, you're able to find the data and create the views that you need, we have a universal semantic layer, and then you're able to govern across all of those different sources as well. We have a SQL Query engine––it's excellent on price performance. We make that a huge pillar for us, and then we have this technology that we're going to talk about that's related to our intelligent query engine, called Reflections. You'll see that as reflections, acceleration. And then in Lakehouse Management, we have an Iceberg native catalog, we've talked about git-for-data, and we do something specific to automatic data optimization. And then we have two options––you can run self-manage, or you can run in the cloud, or you can do both, which provides a very flexible architecture. Here are the connections––ODBC Aeroflight, REST––through all the different client tools that you may want to connect to Dremio.

The Dremio Difference

Read Maloney:

So what makes Dremio different? There are a lot of different providers out there that you could look at to say, hey, is this going to be a component of my data platform? And if so, where does that fit? Overall, we're going to deliver the best-in-class TCO, as your SQL engine in your platform. We do a set of other work I just talked about, such as you won't need to make copies of your data, you won't need to set up separate environments, that's something that git for data helps you with as well––all of this contributes to needing to be able to operate your data platform very efficiently, as well as having best-in-class price performance.

The next is––fastest time to insight. This is what we talk about in terms of shifting left, we're making it much easier for analysts to go and create the views they need in a governed manner, and query them, while the engineers can go and help to optimize and ensure that the platform is still running very fast, but also running in a very cost-effective manner. And that you don't bottleneck on one versus the other, I'll have a slide on that later that's related to the intelligent query and re-engine and reflections that we talked about.

And then we have––ease of use through self-service data. This is an intuitive UI with a universal semantic layer, and then we have a flexible and open architecture. Again, with open data and open table formats like Apache Iceberg, and then also being able to run in the cloud or on-prem, it gives you the most flexibility in how to set up your data platform, and so you're not locked into proprietary formats, where we're just again repeating, where we now have Teradata in the cloud with Snowflake. And I'll show you later some of the savings that customers have from moving both to a more flexible architecture and then also saving a lot of money while they're doing that.

Dremio Use Cases

Read Maloney:

Here are some common use cases for us. We see customers that are modernizing from Hadoop––so they have Hadoop still on-prem––and then they're migrating that to object storage, and then using Dremio with the object storage to do their warehousing on the lake, so Lakehouse takes over from what they were doing with Hadoop. We have unified data access for analytics and AI. Again, bridging across multiple data sources and eliminating data silos.

Warehouse to Lakehouse, we're seeing more and more of this where if you have a traditional warehouse and you're saying, look, I need to reduce the costs, I want to have scale flexibility and reduced costs, you're starting to see that move over into Dremio.

Data virtualization––again, this goes to helping the business move much faster, and so based on our technology––usually in the past data virtualization slowed your system down. That doesn't happen with Dremio, so you can have fast performance while still allowing the business to create the virtualized data marts and views to help them move very quickly.

Data mesh––we support all the concepts around data mesh very well, including data products. As we said, we also have governance built into the platform as well.

And then lakehouse DataOps––and this will relate to how you're moving data into your lake, how you're building and managing your data products, and this is something that Alex will demo later.

SnowMelt

Read Maloney:

Okay, thanks for the poll. We're going to get into one of the ones we talked about where we're seeing warehouse to the lakehouse, specifically with SnowMelt––with Snowflake moving over to Dremio. But we have a question on the poll right now, which is: what use case are you looking at?

We'll give a moment for people to vote in terms of what you're looking at right now. You may not know. That's fine, too. That's what the purpose of the getting started session is, to help you explore where we typically will fit into your overall data platform or data architecture, or how you're trying to support your business.

All right, so we have a lot around unified data access for analytics and AI, and very common that that will go through, warehouse to lakehouse, and data mesh. So excellent, those are real top ones, and then we will be talking about Lakehouse DataOps. So if that's not what you were thinking coming in, that's fine, but we'll show you what that starts to look like now when you're starting to manage your data-as-code and think more in the line of data products overall.

This is going to be focused on everyone in the poll that just talked about, hey, we're looking at a warehouse to a lakehouse, or if you were looking at us from a unified analytics perspective. You may not know what we can do on the right side of the house––so instead of just reading from multiple sources, now, it's just writing data into the lake, often using a combination of DBT as well as our SQL engine, and what that looks like.

Fortune 10 Customer - 75% TCO Savings, $3M Savings in Just One Department

Read Maloney:

We call this our SnowMelt initiative, so this is one of our Fortune 10 customers, and this is specific to one of their departments. And so when we work with Snowflake, it's often not just us or Snowflake, and I'll show you the architecture, it's typically us and Snowflake, as customers start to cost-optimize their platform. Or they've been sitting there with a bunch of overages, Snowflake’s getting expensive. How are you going to be able to manage and do cost control? You'll often use both of us and so we have a reference architecture for that, based on what we're seeing from quite a few customers and an increasing number of customers here. So in this one, this is just one department––they saved 3 million dollars, it was 75% savings. And so that was a huge change. One of the other ones––because we support BI directly on the lake without pre-computation and extracts––this is the reflection technology that I'll have a slide on later––they went from weeks and days in terms of making changes to the reports that they had in Tableau, to be able to either just do it in minutes or seconds. And when we talk about self-service, that's what we mean is there, say these huge changes. Specifically, this one is twenty times faster in terms of their specific timelines for the customer.

91% Lower TCO with Dremio Compared to Snowflake

Read Maloney:

And this isn’t isolated. Just so you know, we have a TCO report with a Dremio and Snowflake. These are customer examples, but we also did one based on a TCPDS benchmark, and also some other customers we have and aggregating groups, and we have that paper online as well. So you can go look that out, see specifically how we did the evaluation, you could see each item, you can go and replicate the evaluation and do it yourself.

Here's one customer who did it with their workloads versus TCPDS––when they were looking, and so when they did this in this comparison, we were 91% lower for them, in terms of what they saw with Dremio versus Snowflake. This is a global leader in manufacturing commercial vehicles.

Warehouse and Lakehouse - Better Together

Read Maloney:

So often this is what we'll see from a combined architecture. So customers will cost-optimize with us. They'll also use us, as you saw a lot of people that are here, they're looking to eliminate data silos and do unified analytics, so you can do that as well. And then you're also setting yourself up with an open architecture, so as you continue to move and look at, hey, how do I ensure that I have control over my platform and that I'm not beholden to one proprietary system? I can continually evaluate across this entire ecosystem of technologies that make up my platform. This is a great way to get there, and so this is often the reference that we'll see, as customers are using both Dremio and Snowflake together.

Hadoop Migration - Rapid Hadoop Modernization to Drive Value

Read Maloney:

A couple of examples on Hadoop––so we didn't see a lot of this in terms of the group that's here today, but you'll see similar metrics from other types of migrations when you're moving to Dremio. So in this case TransUnion had a 10x gain, NetApp had a 22x gain, and then NCR had a 30x gain when they were moving from Hadoop over to Dremio, and on average this self-funds in about three months. So in terms of saying, hey, we're moving this, and we're going to go to an architecture where we're going from ACFS to object-store, it's about 3 months. You're able to pay for the project itself in about three months, and then the savings from there just accelerate. Again, if you're at 30 x price performance, you can think about the infrastructure difference that you need, how much savings that brings to the organization, and then how much that simplifies management overall.

Hadoop Modernization: Install, Query, Save

Read Maloney:

So this is often what we see as people––this initially starts the movement, and I'll show a couple of specific customer examples––they'll run Hadoop and object storage, and they'll put us over the top, and we'll be querying both, and you'll see immediate benefits from that, and then over time they continually move all of the data into object storage, and it can be transparent to the business user. So the business user is just writing queries, and they're just over time getting faster and faster and faster and faster performance, and all of the data consumers sitting there with their tools that are just connecting into Dremio, and this part of the movement can be transparent to them as that process takes place.

Architecture Before Dremio

Read Maloney:

Here's an example specifically from Netapp. So NetApp is not only a partner that will show the architecture, they're also a customer, so they moved from Cloudera to a combination of Netapp and Dremio. And this is what they looked like beforehand, with 7 petabytes of data, and 4,000 cores they moved over.

Architecture After Dremio

Read Maloney:

And so they used their products called StorageGrid, and when they consolidated and moved over, they now have 3 petabytes of data in their active IQ data lake with about 9,000 tables that they're running. And so this is how they've set up their overall architecture to support their business.

Business Outcomes:

Read Maloney:

Some of the differences they saw were they had 22 times faster query time, they had a greater than 60% reduction in TCO, and a reduction in compute. And their TCO in the first year––this was just the first year––they were already at 30%, even as they were going through the migration. And so this typically will end up around the 70% to 80% mark, almost 8 to 10 times less expensive.

Dremio and VAST

Read Maloney:

And then one of our other partners in this space is VAST Data. And so vast data is also involved when we see customers migrating from Hadoop, or when they're looking at platforms that are going to run multi-cloud hybrid or on-premises.

Reflections - Transparent Query Acceleration

Read Maloney:

So these 2 pieces I want to talk about were Reflections––this is a technology that only Dremio has, and it helps support…an element of what reflections does, is it's part of our query engine, and we're able to rewrite SQL on the fly. Ao what'll happen is we have the data sources come in. Purple in this case, these are physical data, green is virtualized views of the data, and this little lightning bolt is a reflection. I'll explain what that is. So if the data comes in, you have voter survey data in this case, and you may go and create a view off of that that's enriched voter data, and then you may want to have a specific view that's Ohio and Illinois. Well, as you're querying that data, you're doing schema on read, and then you're eventually hitting back into these tables, and then that can create a performance problem. And so what we do is we will recommend and tell you, hey, look, we think you should have a reflection on this view.

And we will then go and it will help create that, and then we will help maintain that to help with performance. What's secret here is that, or what's unique here, I should say, is that the queries that were written towards Ohio voters, for example, a dashboard based on that, never has to change to reflect the materialization of that view that's delivered by reflections. The SQL will be rewritten on the fly by the query engine, and that's what I mean when I talk about the query engine being intelligent. It will recognize the data, and then the reflection, and it will then rewrite the SQL to take advantage of it. And so you get both this the setup where the users can go create views and put dashboards on views and move quickly, and then you can have the engineers working with reflections to help manage the system performance and the cost of the system as they go under running dashboards directly on those views. So this is a unique item that only Dremio has.

To Simplify Pipeline Management, We’ve Created Git-inspired Version Control

Read Maloney:

And one of the next ones again, Alex is going to show this in the demo we talked about, is git-inspired data versioning––it's related to next-gen dataOps. And what'll happen is, you have a main branch of data, we'll have to think about that as a data product, and you're going to add, in this case, data, to that data product, so you're going to branch it. This is not a physical copy. It's a pointer back to the actual data. And you're going to then ingest the data and then run quality checks against that data. If the quality checks pass, you then merge that data, and you have the next version of your data, so this helps to ensure you have clear data ownership, clean data coming in, and if something does go wrong, it's now much simpler and faster to troubleshoot where the problem happened. This gets us into a much more automated way to deliver data to the business, which is where we're saying this is related to Next Gen DataOps.

And then on the other side, from the data science perspective, from experimentation, I can now just create a branch of that data to experiment. Again it's not a physical copy! I can then work on that, do feature engineering, bring other data into that branch, and then, as I have features, I can bring those back into my data lake and store them, or I can go and just shut down the branch if I didn't discover or learn what I wanted to do.

Summary

Read Maloney:

So before we move over to the demo, just a quick summary. What we see from customers is they're trying to improve data adoption in their organization. They're trying to move a lot faster, so they're reducing their mean time to insight, and they're trying to lower their costs at the same time that they're trying to deliver a lot more to their business from an adoption standpoint, which we think about a shifting left. All these benefits come from shifting left––and to do that, you need an intelligent query engine, you have to have an intuitive self-service experience, and you have to have these next-gen DataOps capabilities, which is specifically what we're working on at Dremio. It's why we're in business, it's why we exist, and this is why customers are continually looking at and choosing Dremio to be in to be a core part of their data platform.

So with that, Alex, I'm going to turn over to you to show Dremio, so people can get started and use these capabilities for themselves, or try them out for free.

Dremio, The Unified Lakehouse Analytics Platform

Alex Merced:

Hey, everybody! Welcome to this demonstration of Dremio, the unified lakehouse analytics platform that aims to make data lakehouse architecture easy, open, and fast by providing you with unified analytics to combine all your data sources, such as object storage

or on-prem data lakes, and other data sources in your long tail of analytical sources, such as relational databases, no SQL databases, data warehouses, and more, and being able to combine them all through Dremio's universal semantic layer, where you can govern it, and make sure you maintain top-level security. Not only can you curate the data, but you can query that data with an SQL query engine that has top price performance, allowing you to use unique features to Dremio, such as reflections to accelerate those queries, allows you to federate those queries, and again, it allows you to operate across multiple clouds and hybrid cloud and on-prem environments.

Shifting Left Reduces MTTI and Shortens ETL Pipelines

Alex Merced:

But oftentimes, the center of your analytical world will be your data lake as you build a data lakehouse, so being able to manage that lakehouse so that way it just works, is a big part of what Dremio makes. The table with its lakehouse management features provides you with a modern data catalog with git for data features along with automatic table optimization, so that way, those tables are always crisp, and always fast when they're queried. Again, this platform can be brought to you either through a self-managed Kubernetes deployment or through a cloud-managed SaaS deployment. So again, Dremio allows you to shift left––so allows you to move more and more of your workload from the extra storage cost, the extra compute cost, the extra Egress cost, of having to work more in a data warehouse, and be able to, again, shift left so that you're doing more of your workloads on the data lake. That way, Dremio can make the center of your analytical world your data lake, because you'll be able to connect your data sources directly to Dremio, and then deliver that directly to your analysts, data scientists, etc, because Dremio provides you that intuitive self-service experience, that intelligent and powerful query engine, and those next-generation DataOps capabilities with its integrated catalog, those git-like features, table optimization, and more, so now let's begin with the demonstration.

Demo

Alex Merced:

Hello, and welcome to this demonstration of the Dremio unified analytics platform. In this demonstration, we're going to assume that I'm a data engineer, and I'm getting different types of data requests from data analysts and data scientists, on basically curating our data platform. So you can see here, several different requests that I am receiving in my backlog that I would like to see in that completed column. So what we're going to do is we're going to go through each of these one by one, demonstrate the powers of the Dremio cloud platform, and make life easier for your data engineers, your data analysts, and data scientists as we go. So basically what's going to happen is that first, let's go over to our Dremio platform. So when you're taking a look at the Dremio platform, the bottom line is you're going to have your default catalog, which is generally where you're going to curate your semantic layer right over here. Here, you can have any additional catalogs, any connections to object storage, databases, and data warehouses, all curated, which you can add easily, by just clicking “add source” to see the variety of different sources that you can connect to.

Where we're going to be working today is, I'm going to be curating some work out of here, in this catalog––we have our Arctic catalog, we have our Demos catalog––and in here I'm going to have our Feb-get-started, and here we're going to be doing some work. As you can see here, what I have is a “departments” and “employees” table that we're going to want to work with, but we have some requests regarding this raw data. So again, this is the raw data. Whenever you see purple tables, those are raw data tables, not, as we'll see later on, views, which are not copies. They are just logical views on that data, so you're reducing your total storage input by not making copies in Dremio’s no-copy architecture. So the first request that I'm getting, is to create a view of employees and the departments they work for. So probably the first step I'm going to want to do in that, so if I take a look at this, I've broken it down into several steps that I'm going to do. I'm going to create a view, joining the “employees” and “departments” because I need to have that all in one view, but at the end of the day, after I create that, I'm going to create another view that only shows me the employee and their department. So how would I go about that? Well, fortunately, a cool feature of Dremio’s is that you're able to save scripts with SQL. So here I have a script that already does the joint part of this. I have it right here. And what I can do is, I can just set this to the right context to where that table is. So I'm going to sit here and say, Feb-get-started. And what's going to do? It's going to create a view called “raw department employees” that is a join of those 2 tables. So now, I can just run that query.

A couple of things to notice is––notice how it gets this tab, so you can run multiple queries in the same window, and each one will get its tab. Two, I can have multiple query engine tabs, so that we can have completely different workspaces working on different queries and different data sets at the same time, making this robust and easy-to-use SQL editor. But, as you can see there, I've created the view. So now, what I can do, is I'm going to head back over here and take a look at the Demos, 2024, Feb-get-started, and now I've created a view. So notice how the icon is green––this means this is a view, so this is not a copy of the data, so technically, the only physical data that I have stored currently are these 2 data sets right here, and this just allows me to provide a view that I can then grant access to my data analyst, who needs this data. But they don't want this, they want just the name and the department. But just to show you that when I run a query on this, I see it, I get it. I get the joined data. But that's not what the analysts wanted. They wanted, just first name, last name, and department, just the department name. So if I look over the columns, that's going to be this name column right here––we'll just rename that as department.

So now, what I'm going to do is I'm going to run that query just to make sure that it gives me the output that I expect. And I see––there we go, three columns, so that looks like the view that my analyst requested. So what I can do now is I can just click “save as view,” and this can also be done with SQL, which again, SQL can then be sent, not only through the Dremio UI, but through JDBC ODBC, Dremio's REST API, and Apache Arrow Flight, allowing a lot of the aspects of Dremio to be automated by external code if you wish.

So now I can save this view––and again, I can save it where I want it, so here in the Demos folder, here in 2024, under Feb-get-started. We'll call this name and department. I will save that, and then that will complete that, so now I can go back over here to see the results, so if I head over back to Demos, 2024, Feb-get-started, then we see the name and department view. And again, I can then grant access to that user, so I can click over here under settings, and see where it says privileges. I can then grant this access to this view to a user of my choice, and then actually choose their level of access to that view or to that table, so I get fine-grain access. On top of that, you have built-in column and row access rules, so you can do very, very granular access to ensure your users only see what they're supposed to have access to.

Cool, so we created that view. So we've satisfied this ticket. I've created the view, I created the join, and we created the view with name and department. Wonderful. That means this ticket which we have been working on is now completed.

So now let's bring over the next ticket. So in this next ticket, what we're going to do is––it turns out that the analyst wasn't happy with the view we created. They didn't want a separate first name and last name field, they wanted a single field. So now they filed a ticket requesting that field. Now normally using other systems, this would require me, the data engineer, to go back and change the creation of the view for me to be able to provide this for them. But, luckily with Dremio, Dremio is very self-service, so the analysts can deliver this for themselves. So if I'm the analyst, I can just log into my account, where I have access to name and department view. And then I can then make the changes necessary to get the view that I want. So I can say, hey, I want the first name…and actually, that makes this even easier, what I can do is I can use the text-to-SQL feature over here. So what I can do, is I grab this field here, and I'm just going to go grab the data set that I want to do this on. That's going to be under Demos 2024, so get started, here is the name and department table or view. I put that right there and say, “Hey, I want the first name and last name field, to be a single field called full name. And in this view also include the department field.”

So I can click generate, and it's going to generate the SQL for me to do that and let's see here, I have exactly that. So now I can copy this SQL, or I can just insert it. And see, it'll do that for me. I like to always get rid of some of this extra white space there, but yeah, so then I can run this to see if this gives me the result that I want…name, department.

Oh, so I have to set the context. Because, see, right now, since I have no contact set, it's going to expect the fully qualified name, so I have to give like Demos.2024.get-started. I could do that, but the beauty of the Dremio engine is that it makes life a lot easier. I can click here on the context, and I can just set the context to exactly where I expect this table to be, which is right here. And now that that's set, and notice, these errors are pretty, self-explanatory, so I can see, hey, it's not finding a way, it's looking for it, and now I can go back and adjust. And then run my query again. So good errors make for good troubleshooting. So now I see that I have the results that I want, that's the view that I want, so I, as the data analyst, can now just go and save that as a view for myself, and say, hey, I want that particular view, and say full name department. Save that back in that same folder, so, Demos 2024, Feb-get-started. And hit save. And now that view is going to be available in that folder, so if I go over there… and there it is! Full name, department, yay, I have fixed it.

So there we go. I can say, hey, have made a full name field, and now and again, this time the data engineer doesn't have to do it. That's the beauty of the self-service aspect of Dremio. There are going to be a lot of times when the data analysts and data scientists will just be able to handle many of these requests themselves, with the data that you've given them access to, instead of constantly having to go back and wait weeks sometimes for data requests to the data engineer, who already has a huge backlog of requests coming in.

So the next thing we want to do, is that we want to update that original view because they realize that they would also like the budget column in that view, so that is going to require the data engineer. So first, let me just move this over to completed, move this over here. Essentially, what we're going to do is we're going to update the view that we already created and make sure that it includes that view. So I'll just close out some of these extra tabs. What I'm going to do is I want to update this name department view. So I'm going to click on this pen, that way, I'm in edit mode. See here, it's showing me the original SQL, instead of just showing me the view namespace, so this is going to allow me to update that view. So now I can go in here and say, hey, I want to add budget. And then what I can do is I can just pop this out. Just so that way I can make sure I can see any information that I want here because usually, I can see the different columns here. But let's just see here, if I type in budget… let's just run the query, and see if it gets me the result that I wanted…

And I see here that, first name, last name, name, department are showing up as a number. So the reason for is same mistake I've made before, is that this should be named as department, and then I need to add another comma to here for budget, so you see, that's why I run the query first. I can just confirm that I did it quickly, do a quick visual validation, and then I can go back and correct it. Doing a lot of this stuff is a lot easier than finding out that you made a mistake way down the line. And there we go, name as department, although I see I have a name and department field, I should only be seeing one. So I'm examining a first name and last name…oh, I see, I have name twice, so let me go take out name there…and let's run that again.

And there we go––so there's the view that I want, and now I can just click save view. So now, I've just made a new save to the view, and again, these views are not copies, so nowhere have I duplicated data up to this point. But now I'm creating assets that people can view to see the data as they need to see it, and only the data they need to see. Another cool thing about Dremio is that since I'm using the Dremio catalogs, what I can do––currently called Arctic catalogs––I can go into this catalog, and I can see all the commits. So if I go here, every change is a commit to the catalog, so I can see, hey, look, there was that ultra-view statement done 30 seconds ago. That was me updating the view. And I can see when I created these views, and I can see, hey, who did it? This gives me a very high level of observability into what's going on in my catalog, what changes are being made, and who is making those changes. But it also gives me the ability to roll back to any of these commits, so just like, git, if something goes wrong, I can roll back to a commit where things were good. I can create tags so that way I can tag particular commits that I want to be able to easily replicate. So I can say, hey, I want to be able to scan or do a query at the catalog as it was then, and I can create branches to Iceberg, which we'll be showing you shortly. But going back over here, we did that, we updated the view to include the budget. Wonderful

And next, what we need to do is we'll be ingesting some new data into the employees and department table. So actually, first, let me move this to complete it. And then we're going to move this over here, to show you ingesting new data into the employees and department table. So what steps am I going to take? Well, one thing I'm going to do, is I'm going to take advantage of those git-like features because I don't want to accidentally ingest the data, and then there's a mistake with it, and now I'm having to pluck out the bad records, do backfilling, all that tedious work, and also affect production queries. What I'm going to want to do, is isolate that work from my production branch, the main branch, and get it done. So I'm going to create a branch to isolate the changes, do all the work on that branch, and then I would theoretically validate changes, whether I run any queries, I will just double-check to make sure that there's nothing wrong, and then, once all those queries have run successfully, and I'm happy with my results, and I'm happy with the changes, I can then merge those changes in there. And again, while I'm going to do all this in the UI, this is all done through SQL. And like anything that's SQL on Dremio, it could be automated using external programming scripts that connect via JDBC ODBC, our REST API, or Apache Arrow Flight. So it's all very, very flexible.

So now let's go do it. So in this case, let me go back to our query engine, and basically, I have a script for just this particular purpose. So right here, here's the script. Let me just make sure I set the context to the folder we're working out of today, which is this Feb-get-started folder. So there's my context set, so basically, if I were the walk through this, as you see here, I'm going to first create the tables as they are, just to show you that, hey, let's just see how many records are in the table before we do anything. Then we're going to create and switch to another branch which you can see is very simple SQL, so taking advantage of this feature is pretty easy. I will then ingest the new records into the respective tables. We'll then query the count of the table, so you can see that on this new branch, there is going to be new data. And on that note, I want to make sure that we use a unique new branch name, so I'm going to call this Feb 19, and I’m just going to change that everywhere in the script where we refer to the branch, so that way, I'm not using a branch that already exists. The idea is you create a new branch to isolate those particular sets of changes. So again, I'm creating the branch, and then switching to that branch, ingesting the data, then we're going to count the records, so you can see that the data has changed, they're going to switch back to the main branch, query the main branch, so you can see that the main branch has not changed so that the count should still be the same. Then we will merge the changes, and query the main branch again, so you can see that now the counts reflect the new data. So theoretically, before we do that merge, that's when you would do your validations, whether you're using DBT, Monte Carlo, Great Expectations, whatever your preferred way of validating your data is, or using your queries against your own business rules or constraint tables that you're checking. So with that, I'm going to go right over here and we're going to run this query and let this run, and again, so notice how many queries I have just running right here in this one editor. So again, I have that editor in this tab, and then again, each query down here is getting its tab, making it easy to get a lot of work done very quickly. I can start already seeing the results of queries that I've completed, so I can see that original query that queries the tables, that we see a count of five in the department table and a count of ten in the employee table. So that's before we make any changes. So I go to query three, and I create a branch, so I can confirm that the branch has been created. And then I can then see query number four that we have switched over to that Feb 19 branch. Then I can see that now from that branch, I've inserted 5 records into departments, and notice how it highlights the query that I'm referring to as I do it, so if I click on query 6, see, it's going to go highlight over here because that's the query that this particular tab corresponds to.

So here I injected another 5 records into employees. So now, when I query the ETL branch, I'll see that departments has 10 records, versus the 5 it had before, and employees have 15 records versus the 10 that it had before. Now, I'm going to switch back to the main branch. So right here we switched back to the main branch, so that switches our context. And then what I'm doing is I'm going to query tables on the main branch before the merge, so notice that we have still 5 records. So again, my main branch, my production data, has not changed. It's still only 5 records and only going to be 10 records for the employees table. So that's the beauty of this. We've isolated the changes, the changes are not being hit in our production queries.

But once I do the merge––we can see here that the merge has been confirmed––I can then see that now when I query the main branch, I do see the additional records, 10 and 15. So I can isolate my changes, can create branches for experimental purposes––there are a lot of use cases for this particular form. It can be very, very powerful. We've seen customers who view the pattern of creating a branch for every day, so at the beginning of each day, they create a branch, then all the ingestion work throughout the day is done on that branch, validate it at the end of the day, and then merge into their mainline production data set at the end, if all validations have passed and the data has been cleaned up––making it easy for them to ensure that the mainline production branch is always of top quality. We've completed that. So again, we created that branch, we ingested the data, validations can occur at that point, and then we merge the data back in. Easy peasy.

Now, one last ticket I get is, I have a ticket from a data analyst data user who is building a BI dashboard, and the problem is they need to improve the performance of the BI dashboard on weather data, so let's take a look…So if I go take a look for that weather data set, should be here again in our trusty folder 2024 Feb-get-started, we see this weather data set. Theoretically, we have a user who's building a BI dashboard on this. Now, usually, BI dashboards are the result of a lot of aggregation, so using group-by statements, counts, and all these kinds of things.

So what I want to do is that I want to make this faster. Dremio has a very unique feature called Reflections, and by clicking here where it says, edit data set, it's going to take me to this screen where again I can see the original table if it was the original query if it was a view, but I can also see the section called Reflections.

And now, notice, there are 2 types of reflections. There are raw reflections and aggregate reflections. Raw reflections essentially replace the need for materialized views. So anytime you need to speed up the query on raw rows of the table, you can just use raw reflection instead. We just turn it on. I'll just quickly show you, that I can turn on raw reflections, and then I can even tailor these raw reflections to only materialize certain columns to sort based on certain columns, or partition based on certain columns. And I can create multiple reflections–– typically what you do is you would create multiple materialized views for different query patterns and users would have to know which one to query. Here, you create the reflections, and Dremio does the figuring out for you. It'll say, I have these 5 reflections for your query, this one's going to make the most sense, to complete your query the fastest on this particular table. So instead of picking one table, and turning it into 5 additional namespaces, you just have one namespace. Users are just going to notice that their queries are faster.

So now, if I go back to reflections…then, we have our aggregate reflections. This is going to be more important for BI dashboards. So again, in this situation, we have a user who's building a BI dashboard on the weather column, so theoretically, what I can do is I can turn on aggregation reflections, and choose which dimensions they are optimizing for, which measures they're optimizing for. So maybe they're measuring precipitation and snowfall based on station and name. Again, those are the types of queries they're doing on their dashboard, and then what I'm going to do is I'll hit save, and what's going to do, it's going to create a materialization, a pre-computed version of those results. But instead, the user doesn't have to worry about this. They're just going to notice that their BI dashboard is just much faster, sub-second queries because all those results are precomputed. But Dremio is going to handle the swapping out of knowing when to use these materializations, and two, updating these materializations on a cadence, so that you can also manually trigger via REST API, and soon SQL. So bottom line, it gives you an easy way to accelerate, because oftentimes building a BI Cube or extract would be much harder than just flipping a switch like I just did. On top of it, it would require a lot more manual maintenance work on behalf of the data engineer, so now I'm getting those same benefits at the flip of a switch, and the same thing for materialized views with raw reflections. All these things make the Dremio platform much, much easier.

Just a couple of the things I'd like to point out is that when you're using Dremio Cloud again, you also get this jobs page that makes it easy to monitor all the jobs coming into your Dremio cluster to see who's running what queries, and how are they performing, and seeing their breakdowns. On top of that, you have this engine section where you can control the compute that you make available to your users, so you can have different engines of different sizes, and you can use engine routing to make sure that you're routing workloads to the engines that you want. So, for example, I might have high-priority workloads, and I want to make sure, go to particular engines that will execute them faster, and I may have lower-priority workloads that don't need to use the expensive compute. I have very flexible control to make sure that I'm scaling and managing my workloads in a way that's going to get me the performance that I need without breaking the buck because the engines can scale up and scale down based on the rules that you set. So you have a lot of flexibility as a data engineer to build what you need, but it's also easy to do, and unifies all your data because again, you're connecting all your data across several databases, data lakes, and data warehouses.

So hopefully, you guys enjoyed this tour of the Dremio unified lakehouse analytics platform. I'll see you later, and Ciao!

Q&A

Alex Merced:

Hey, everybody welcome back, we have a few questions, so I'm just going to go through these. So few of them, we've already answered, but I will speak to them. Oh, Read, did you want anything before we answer the questions?

Read Maloney: No, let's just take some questions for the last 4 minutes here.

Alex Merced: Cool, cool. So the first question we have is, do you need to use the web-based editor? Or can you use VSCode or another IDE? You can edit the code wherever you like. The Dremio web UI is just a nice place to do it, but you don't have to do it there, because with Dremio, you can send SQL to Dremio, either by the REST API, through JDBC ODBC, or Apache Arrow Flight. So using any language, you can access those, and most tools oftentimes will connect through one of those levers, so, using a tool like DBT, you can orchestrate the whole deal, and actually, you do all that, your favorite IDE, have it in git version control and all those benefits.

The next question: will the connection from the IDE to Dremio be configured automatically, like it was targeting SQL Server? Right now, there aren't any built-in integrations for an IDE, like a Visual Studio extension. But connecting to Dremio is pretty easy, and there is a library in the Python PyPl library––so basically, you can easily connect with Apache Arrow Flight, but they make it easier with a little library called Dremio simple query, and if you use that, literally within one line of code, you will be connected to Dremio, whether software or SaaS, and, you could ask for the results to be returned, either as a pandas or folders data frame,

or as an Apache Arrow table, or the DuckDB relation, so whichever way you like to write your local scripts, you just send the SQL, you get back the package the way you like it, and you just kinda move on doing what you're doing. So that makes it pretty convenient.

Next question, thanks, so basically, this is being positioned as a unified analytics alternative to OneLake and Fabric from Microsoft. The way I would put it is, that one of the problems that we solve is a similar problem, in the sense of trying to create a unified hub for your data world. A couple of differences is that we’re cloud agnostics. You don't have to marry any of the 3 Big Cloud vendors to be able to use Dremio as your data hub. Also, we have a variety of connectors that connect you to sources across multiple clouds. We also have reflections, and cutting-edge DataOps features that are unique to the Dremio platform, which makes us very unique in this space.

Read Maloney:

Yeah, Alex, the only other thing I’ll add in there is from a TCO basis when we’re talking at the beginning, is we have those differences. One of the things that they help with is not just ease-of-use, but helping the customers shift left, which is a big focus. That's how we talk about being a unified lakehouse platform for self-service. It's a big focus for our company, is that all of those things make the organization more efficient, and because we're so focused on performance and price performance, we end up having the lowest cost of ownership of any of the platforms as well. We're coming at it from both angles, which is the core engineering side, to be efficient there, on behalf of our customers, and then also looking at it from the usability standpoint, to make sure that the different personas, like the consumer analysts, for example, and engineers, are both able to be much more productive. When you add those two things together, you're going to get the highest efficiency with a platform like Dremio.

Alex Merced:

We have a couple more questions to go––so the first one, thank you, reflections effectively materialize views. A quick answer to that is that they're not views in the sense that you're not creating this separate thing that you have to separately manage the data engineer that creates a separate namespace that the analyst has to be aware of, instead, what it is you're asking Dremio to say, Hey, I want to optimize this query. It is creating a materialization in the sense of a physical construct, in this case in the form of an Apache Iceberg table, but Dremio will intelligently identify with us, so it won't just use it for that particular view, but it'll also use it for views that are created from that view. So you get more bang for your buck, and on top of it, it’s just easier to manage and easier for the analyst to take advantage of.

Read Maloney:

Yeah. One thing, Alex, which is just, look, the big difference is the query engine can rewrite the SQL on the fly with reflections, and so, if the analyst creates a view and then is putting a Tableau dashboard on that view to run, and they're doing that all self-service, that's great, and normally, you might run into some performance issues. In this case, the engineer can go and create a reflection for that dashboard, never even speaking to the analyst just to accelerate it. It’s transparent, and so those groups can go off and move fast, or the engineers can go and say, okay, look, I need to create a reflection here, and the query engine will take care of the rest. That's unique to Dremio.

Alex Merced:

Cool, and one more, and then we'll have to wrap up, just as a heads up. Would be interested in how the cloud-agnostic approach compares to the OneLake short test, which allows the consumption of data across the cloud without duplicating the data and data mirroring. The bottom line is a lot of aspects of Dremio's architecture will help you avoid not having to duplicate the data, and also interfaces with many of the address pools, so many of the things that you would be sharing the data, using OneLake within Azure, you can share within Dremio, so you can oftentimes have that same feel plus the benefits of the features and the cross-cloud into features with Dremio.

With that, I just want to say again, thank you for Read for coming onto the show this week. Also, come join us again soon! We have some pretty exciting episodes coming up regarding Snowflake TCO, and some other great topics, and then make sure you register for Subsurface, May 2-3 over there at dremio.com/subsurface, but with that, thank you very much. I'll see you all next time.

Gnarly Data Waves

Getting Started with Dremio

Register to view episode

Speakers

Transcript

Opening

Data Analytics - A History

Data Lakehouse Adoption

Data Lifecycle Remains Complex, Brittle, and Expensive

The Dremio Difference

Dremio Use Cases

SnowMelt

Fortune 10 Customer - 75% TCO Savings, $3M Savings in Just One Department

91% Lower TCO with Dremio Compared to Snowflake

Warehouse and Lakehouse - Better Together

Hadoop Migration - Rapid Hadoop Modernization to Drive Value

Hadoop Modernization: Install, Query, Save

Architecture Before Dremio

Architecture After Dremio

Business Outcomes:

Dremio and VAST

Reflections - Transparent Query Acceleration

To Simplify Pipeline Management, We’ve Created Git-inspired Version Control

Summary

Dremio, The Unified Lakehouse Analytics Platform

Shifting Left Reduces MTTI and Shortens ETL Pipelines

Demo

Q&A

Ready to Get Started? Here Are Some Resources to Help

Whitepaper

Delivering AI ready data with an Intelligent Iceberg Lakehouse

Case Study

Major Investment Management Firm Achieves Data Democratization with Dremio Lakehouse Platform

Webinars

Unlock AI-Ready Data with the Intelligent Iceberg Lakehouse

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?