Dremio Fall 2020 Release Webinar
Enable High-Concurrency, Low-Latency BI on a Cloud Data Lake to Shrink Your Data Warehouse Cost
The data lake has emerged as the preferred data repository in the cloud. Yet BI analysts have benefitted little from the cloud’s infinite supply of on-demand compute resources and inexpensive storage. Instead, they’ve been forced to attempt to analyze data stored in cloud data lakes by either:
- Querying the data in the cloud data lake using SQL engines far too slow for modern BI tools
- Waiting for their data team to move subsets of data into data warehouses and data marts
But Dremio’s Fall 2020 Release solves these issues—and can save you a significant amount of money and waiting as a result.
Join us for this webinar where we’ll explore how innovative, new Dremio features enable high-concurrency, low-latency BI queries directly on Amazon S3 and Azure Data Lake Storage.
We’ll discuss and demonstrate:
- New Apache Arrow-based capabilities that deliver sub-second query response times directly on cloud data lakes
- How you can support thousands of concurrent users and queries
- How Dremio accelerates performance >100x for data warehousing workloads that use star schemas
- A new, built-in Power BI integration that lets users immediately create beautiful dashboards on S3 or ADLS with Power BI
Tomer Shiran, Co-founder, CPO, Dremio
Tom Fry, Director of Product Management, Dremio
Hello everyone. Thank you so much for being here with us today. We will start the webinar in approximately one minute. Thank you. All right everyone, thank you so much for being here with us today. Good morning, good afternoon, and good evening from depending on where are you joining us from. I want to welcome you to this webinar titled, Enable Low-Latency BI on a Cloud Data Lake to Shrink Data Warehouse Cost. Before we start, there are a couple of housekeeping items that I want to run by the audience. First, if you have any questions, we will have a Q and A session at the end. However, I encourage you to place any questions that you may have on the question's panel on the webinar console and in that way, we'll have them ready for you by the time that we have the Q and A session.
Also, this presentation will be recorded and made available along with the transcript on our website later this week. Today, we have with us Tomer Shiran, our co-founder and CPO here at Dremio. Hey Tomer, how are you?
Good. Thank you Lucio.
Great, great to have you here and we also have Tom Fry, the director of product management here at Dremio as well. Hey Tom.
Hi Lucio, it's great to be here today.
Glad to have you. All right. Without further ado, the time is yours and take it away.
Thanks Lucio. Thanks everybody for joining us this morning or I guess depending on where you are. We have a big audience today, so I expect a lot of questions at the end. Feel free to type them in here. What we'll talk about today is just we'll cover a little bit of the trends that we're seeing in that cloud data lake market, and then dive into some of the new technologies that Dremio's released in its fall 2020 release. The biggest trend in the market here is that data lake storage has emerged as the default data repository in the cloud.
If we think about what companies are doing and we see this across every company, whether it's large enterprises that are moving to the cloud or it's technology startups that were maybe born in the cloud, S3 and on Amazon and ADLS on Azure, have become basically the default Bitbucket. It's now where everybody is dumping their data or creating their data, and really that's where it's landing before it goes anywhere else. Well, there's actually lots of reasons, but among other things, it's the low cost in the infinite scale, because we can put as much data as we want into something like S3 or ADLS, and not have to worry about whether we have enough capacity or not.
We also largely don't have to worry about costs so much, because it's really inexpensive, right? We can now store the data for $20 per terabyte per month more or less. It's available in many different regions worldwide, so we can have our choice of dozens of different regions where we want to store the data. We have all sorts of options around availability and durability. We only pay for what we use and then it's very, very highly available and durable, so we don't have to worry about that. It's really changed the game in terms of storage management and the fact that we just have this service that we can store as much data as we want at a very low cost. It is something that never existed before.
We never had this kind of an experience with any storage system on-premise, or even other types of storage in the cloud, right? Hadoop was never like this. You had to start a cluster, you had to manage it, you had to worry about uptime, you have to worry about capacity, which really inhibited adoption of that platform, but here with things like S3 and ADLS, we're just putting our data there. It's a no-brainer. The second big trend that's happened, and we've seen this really evolve over the last decade is the rise of open file and table formats. It's not enough that we have a standard place to put the data. It's also that we now have a standard way to represent the data in those systems.
Of course, originally we had text delimited. Those have been around for many, many decades and then the rise of formats like JSON that can be parsed and read by many different systems. Then more recently, the rise of columnar formats like Parquet and ORC, where the data is not just stored in an open format or an open source format, but it's also stored in a format that's very highly optimized for analytical workloads. That's one thing that's new, or relatively new. The second thing is that we now have table formats that are both open and open source and widely accessible.
If you think about Hive Metastore, of course being the original one here and Amazon has the glue catalog, which is a hosted version of that, but more recently with the rise of Delta Lake and now Apache Iceberg, we have ways to represent tables that even support things like transactions and inserts and updates and deletes and time travel. Apache Iceberg for those that don't know is a project created by Netflix with a lot of involvement for many of the big tech companies. Developers from Apple and Netflix, Airbnb, Stripe, et cetera, so a lot of momentum around that project as well. With both of these things, we now have a standard place to put data, and also a standard way to represent data. That should be great, right?
You should be able to now have all your data in these open formats and be able to use it. In fact, that's not the only thing that's great about what we have today as an industry. We also have Infinite Compute Capacity at our fingertips. If you think about the cloud and the infinite supply of compute, not just the storage capacity, that is enabled at the rise of these best of breed and decoupled compute here, right? We have different systems. We have Dremio for SQL queries, and we have Databricks for Spark queries, and EMR for Spark queries or jobs. We have data warehouses like Redshift and Snowflake that can query external tables, right? Maybe not their default mode, but have that support, right?
We have Hadoop engines in the cloud. We have all these different types of compute systems that can run, and all that's enabled by the fact that we have Infinite Compute Capacity from the cloud providers and the ability to rent that by the second, right? We don't have to go to Dell or HP and buy servers, and wait several months for those to arrive. We're basically living in this magical world where we have all our data being centralized in one place, in open formats, and we have these dynamic and elastic best of breed compute engines available to us. Well, what's the challenge then, right? One of the biggest challenges that we've had for a long time is that BI users were not able to take advantage of all this innovation, for two reasons really.
The first being that if you think about the typical SQL engines, whether it's open source SQL engines or serverless SQL engines that are out there, really just don't provide anywhere near the performance that a BI workload requires, right? Something like running dashboards, for example. What ends up happening is that you basically can't query the data using things like direct query that's the Power BI terminology, live query Tableau terminology and of course, the data in these systems in S3 and ADLS typically is too big for you to import or extract the data. We have one solution that doesn't work. The second option then which of course a lot of companies have had to adopt is moving the data out of this infinitely scalable repository and into data warehouses and data marts.
Of course, this requires a bunch of engineering work. It takes time. It means that any changes to your dashboards, or any need to update the datasets requires weeks of work typically and weeks of waiting. If you look at this picture on the right here, you can see a very typical scenario where data has to get moved into data warehouses. Of course, you don't want to move the majority of your data into those systems, because then you're paying twice to store that data and the complexity of having this pipeline running 24/7, and getting it to work and to keep working is very difficult. Oftentimes, you can't get the performance you want directly on the data warehouse.
You create data marts and you create extracts and aggregation tables on top of that, just so that BI user can have decent performance, but by the time you've built this really complex stack, the user can't get data in a timely manner. They can't do things on their own, and they're very limited in terms of the analysis they can do, right? They can look at a very small sliver of data and get fast speed on that, but anytime they want to do something different, or join it with something else, it has to go back to the data team and the users need help, and of course nobody's happy about that situation. This has been a problem for a long time, really that we have not had a great solution for BI users in this new world of cloud and cloud data lake storage like S3 and ADLS.
What we've always had at Dremio and what we've been very focused on is providing best-in-class SQL on data lake storage, so on S3 and ADLS primarily. We do that by providing a very fast query engine based on Apache Arrow. It's an open source project that we created I want to say about four or five years ago, and now has over 15 million downloads a month basically used by every data scientist, but internally, we use Apache Arrow to provide really fast execution and that columnar in-memory processing, as well as an acceleration engine, where we use a number of different technologies ranging from NVME caching to reflections, which allow us to accelerate queries.
Then we also provide that virtual dataset provisioning layer or a semantic layer on top of that, so that it's very easy to make data accessible to analysts and data scientists in a self-service manner. What we've done now with this new release, this fall 2020 release is we've really pushed the envelope in terms of what's possible with SQL on data lake storage. We've specifically focused on the BI workloads, which are quite different from the typical ad hoc exploratory workloads. What I'll do here is I'll hand this over to Tom and he'll talk about the requirements for supporting BI workloads and providing the performance that people expect with BI, and then talk about the new functionality in Dremio that enables that.
Then we'll come back to a quick demo showing you Power BI on top of Dremio and what that experience is like. Tom already...
Thanks a lot for that Tomer. I appreciate that. What we want to talk about and start off with our some of the requirements that we see typically in environments for the support BI dashboards, and then get through a bit about some of the great exciting features that we have that are particularly focused in this release around these type of patterns. BI dashboards are not new. They've been around for a long time and typically, they've been run on more relational systems or enterprise data warehouses.
When people think about migrating to data lake architecture or developing a new cloud data lake architecture, there are certain expectations requirements that have been served for some period of time that need to be served on that data lake architecture, and that are necessary in order to make BI dashboards successful and productive in a cloud data lake. The first one is obviously concurrency and the ability to scale the number of users on the systems. Typically, the number of users on a system will actually be very dynamic, and it'll scale throughout not just a day, but even different time periods throughout a year.
Think about end of quarter, end-of-year type scenarios, and there are definitely periods throughout the year where certain dashboards need to be run and visible and executed in real time by literally thousands of individuals within an organization. Think about quarter end numbers. People taking a look at that performing analytics on it, et cetera and as a result on this data lake architecture, a BI dashboards need to be able to scale to thousands or tens of thousands of users concurrently processing that data at once. Users also require interactivity. This isn't necessarily ad hoc type work. It's done through a structured BI dashboard, but often users still require interactivity.
People might look at one form of the dashboard. They might make adjustments in terms of the filter that they're going to apply, look at another, the dashboard maybe slice the slightly different way, et cetera. In order for that to happen, you have to have the data lake processing happen on the data lake at the speed of the individual's thought, where they ask a question to the dashboard, the dashboard presents results, and then they can ask another question. It's necessary to be able to provide very low latency queries in order to serve that interactive type of model. The other aspect is being able to support very efficient processing on star and snowflake style schemas.
Data models have been developed over many years for efficient processing and relational systems, which typically involve structuring things as either a star, a snowflake style schema, and that type of data model typically carries over the data lake. People are replicating similar types of structures because they are efficient. It's necessary in order to have efficient processing of BI dashboards on the data lake to be able to perform efficiently on these style of data models, and we'll get into that in a bit as we talk about some of the features that we have to make that more efficient. Lastly is obviously the consumption layer.
Most end users do not interact either with relational systems, or their data lake architecture directly, but they interact through BI tools, and they interact with the data through other tools on top of the data processing system. As a result, it's necessary to have essentially seamless integration between the N tool and the data lake. We have a lot of exciting announcements in terms of some of the further things we're doing with BI partners in that space as well, because it's not just about Dremio processing, but it's being about let users efficiently and seamlessly take advantage of that processing through their tool of choice and through their dashboard of choice. If we go to the next slide.
We have several great features coming in this release that are specifically focused about how do we make that BI dashboard style workloads very efficient in cloud data lake architectures, both to improve the end user experience, but also to think about cost efficiency and make sure that we're reducing typical cost compared to data warehouse model, or even cost in a data lake architecture model. We have a few different forms of this. Some are features that are really focused around enabling higher concurrency, lower latency on data lakes. One of these is called Apache Arrow caching which is the ability to use Apache Arrow within reflections, and that's actually really exciting.
The other aspect is continuing to build out the scale-out story for Dremio. Dremio is a scale-out execution engine, but one of the things that we've done in this release is essentially remove any point of a single point of a bottleneck in the system, and we'll get into that. Plus some very efficient filter features in terms of much more efficiently processing data, particularly on very large scale data lake scale size datasets, particularly as they grow over time. Typically when people put data into a data lake, it remains there forever and people develop many years of data. We have features such as runtime filtering that enable much more efficient processing on very large scale datasets such as that.
We also want to be focused about connectivity, and that's connectivity on essentially you can almost think of both sides of Dremio. Connectivity in terms of the types of tools that can connect to Dremio and how you can essentially get data out of Dremio, and then how does Dremio connect to other systems as well. We have an exciting feature called external queries, which really unlocks a lot of potential of Dremio working with other data systems. Something that I want to highlight as well is this is not all of the features that we have in this quarter's release. We have actually numerous other features improvements. I'd encourage everyone to go look at our release blogs.
Enter at our release notes for additional features that we've included as well, but these are essentially what we really wanted to focus on today in terms of how we're making BI dashboards and improve the end user experience, and improve the cost efficiency as well. With that let's go to the next slide. To step back a little bit, a core feature inside Dremio to improve the end user experience are data reflections. Data reflections are used to accelerate processing of physical datasets and virtual datasets, and they come in a variety of different forms. It's easy for example end users to say, "Here's a virtual dataset, just please reflect that and make that virtual dataset faster."
We also have other forms of reflections such as aggregation reflections, where we can perform precomputed measures on specific dimensions of different types of data to make all that data precomputed and materialized, and ready for immediate processing. Now the way reflections are built internally is they efficiently store essentially a materialized view of the data in Parquet format in the data lake, and this enables very efficient storage. We have some customers actually making use of a very large number of reflections, because we can so efficiently store those results within the system.
That works really well for analytical queries where maybe you're looking at an entire dataset, but in particular for BI dashboards, typically you might have a acceleration that might have a precomputed result, a certain aggregation, and that BI dashboard is essentially looking up those results. When we have those type of workloads, essentially Parquet format can be a little bit of a bottleneck at times because the data is compressed, it puts multiple rows into a block, et cetera. If you have certain patterns where you might be looking for a needle and a haystack within a BI dashboard, because the BI dashboard is going to look up the summary of a customer's spend within a certain quarter, that might exist as one row within a data reflection.
We want to be able to very efficiently get that data out. What we're introducing this release is essentially caching an arrow format of reflections. This technology essentially takes the reflections that are precomputed, materialized view of the data and stores it locally within the execution cluster in Apache Arrow format. Now one of the great things about this is it means that we can take that data off that's cached locally within the execution engine of the system, and load it directly into the processor's memory with no decomputation or decompression required at runtime. Now again, this is very efficient for BI dashboards on top of the data lake.
This essentially means during operation that data is being read from a local cache, typically an NVME memory directly into processor, and then it's directly being passed to the end user. It is built utilizing our cloud columnar cache technology that we call C3. It has all the advantages that we announced last year with the C3, particularly in terms of administration and behind the scenes operation. This type of operation is fully automatic behind the scenes. Users do not need to play with caching, or at all. There's zero administration, it's fully automatic. Dremio essentially during run times identifies the bits of information, the columns and the rows that are more efficient to store and cache locally in arrow format.
We promote that data automatically for the customer in arrow format. This essentially happens transparently behind the scenes. One of the great things about it is it tremendously improves the efficiency of these type of workloads on the data lake. In the initial deployments we've had with it, we've seen over a 5 to 10X improvement in cost efficiency. This means that users can essentially, you have a smaller amount of resources for a given amount of work, or they can put many more users onto the same system, and that's because the end user is seeing their data be processed much more quickly. Since their data is getting processed more quickly, we can have more users on the same amount of physical resources.
This has a great benefit of both in terms of lowering the cost structure, but also improving the end user experience. This is a great enabler again for accelerating typical patterns we see for BI dashboards in the data lake. Going to the next one. The next major feature is something that we call scale-out query planning. To step back a little bit, Dremio is a scale-out execution engine, and we have customers in production running Dremio across many hundreds of nodes. In addition, this summer, we announced a technology called elastic engines which supports actually Dremio with multiple separate execution clusters within the system.
This really enables efficient scale in terms of how can we scale the size and the number of workloads that we have. However, all users still utilize essentially a single coordinator node for some steps of query processing and in particular, for SQL planning purposes. A user submits a SQL function to be processed by Dremio. Dremio has to read that and perform some planning about how it will execute that SQL query that comes to it, and that would be done by a single coordinator node. We did actually support multiple coordinator nodes from the perspective of high availability and failover, but there's still be a single node that's used during runtime for some aspects of query processing including planning.
What we're including this release is the ability to add multiple coordinator nodes to scale the amount of queries that we can do for the coordinator node set of resources. This enables essentially a dramatic unbounding in terms of the number of users that you can put on the system, and it can help Dremio essentially linearly scale the number of users on the system by scaling the number of coordinator nodes. With scale of coordinated nodes, we've essentially removed any bottleneck within the system where all aspects of Dremio can scale in order to meet the needs of the customer at different time periods. This technology also enables essentially adding and removing coordinator nodes.
It's possible to add capacity at certain times of the year and reduce capacity at other times of the year, to essentially enable even though maybe at the end of the year, you need thousands of users on a certain dashboard. You can think about there's a lot of flexibility around the amount of resources you dedicate to that. It's highly resilient and available with a passive failover model behind it and with this technology, we can really unlock the capabilities of Dremio as a scale-out engine and support literally thousands of users on the system and change that over time. Next slide. The next feature that we're really excited to announce in this month's or this quarter's release is runtime filtering.
To step back a little bit, typical data models structure data as joins between many different tables. Typically, what you might have are larger what are called fact tables, which are a large repository of data and then smaller tables that typically are called dimension tables that have attributes. Efficient processing of these type of data models requires the ability to utilize filters that are computed during runtime from a dimension table and apply that to a fact table. A typical example might be for example maybe you're looking at orders for a customer. What the query will actually do is perform essentially a filter on a smaller customer table.
Maybe by name or location or something like that and get a small list of a single or a small list of customer ids that are then applied towards a massive order table that might have 30 years of data inside of it. For efficient processing, it's necessary to take filters that are computed during runtime and apply them towards the larger tables. What we're introducing this release is runtime filtering, which offers a dramatic multiple orders of magnitude improvements and performance, and really enables Dremio to be able to process data warehousing style workloads that are built around the star and snowflake style schemas. This enables a true 100X performance improvement for some of these patterns, and what's even better is performance typically scales with the data size.
Meaning, as your data gets better, the performance benefits of runtime filtering improve and become even more dramatic. This is because typically, most queries might be processing the most recent quarter of data, or the most recent year of data, et cetera. Maybe you have a dataset where every day data is being added to that dataset, but it's only the most recent let's say month or quarter or year that is typically being processed by dashboards, but over time, these datasets develop many, many years of data. What runtime filtering enables is very efficient pruning of all of that historical data, whether or not it's at partitions or road groups from consideration.
This means as your dataset grows from five years to 10 years to 30 years' worth of data, your runtime query is not actually impacted if it's processing the most recent set of information which is typical. This really provides a key data warehouse capability in terms of performance on cloud data lakes, and this tremendously unlocks efficient processing on cloud data lake scale sized workloads. Typically in cloud data lakes unlike with enterprise data warehouses, for example, people essentially never delete data. They'll keep data forever.
They'll keep that very long history of a table, and runtime filtering essentially enables people to be able to do that, so that you can keep a very long history inside your fact tables, but not have a degradation performance because Dremio can utilize these runtime filters to prune out a lot of that historical information. Next. Just another aspect that we improved in this month's release is how does Dremio work with external systems. To step back a little bit, Dremio has the ability not just to process data from cloud data lakes, but also to read data from external systems. We offer numerous connectors to relational systems, whether or not it's Oracle or Postgres, Teradata, et cetera.
We even have capabilities in Dremio hub for people to be able to build connectors to any relational system that's out there. Essentially, Dremio takes care of many optimizations automatically behind the scenes on behalf of customers. Dremio will automatically take a SQL query and develop intelligent push downs into those underlying relational systems. We might push down filters. We might intelligent try and join different datasets within an underlying system, and we'll do this to essentially accelerate processing. Instead of pulling all data from a relational system, we'll let the relational system perform part of the processing on behalf of the customer, and then return a subset of results within Dremio that maybe then will join with other data lake scale size datasets.
Now that however means that Dremio can only essentially Dremio is going to work automatically, and it's only going to work on the SQL functionality that Dremio supports. There's a couple different scenarios where it's useful for customers to have some control over the exact nature of the workload that Dremio is pushing down into the underlying relational systems. This could be for example maybe there's a certain structure to the join that makes sense in the relational system. Maybe there's some proprietary functionality inside the underlying relational system that is not ANSI standard SQL, but an end user wants to be able to take advantage.
What we're creating in this release is a really exciting feature called external queries, which is the ability for end users to specify an exact SQL statement to run on an underlying relational system. Dremio will essentially take the block of SQL that an end user submits and run this directly on the underlying system. This enables users to take advantage of any functionality that may be present within that underlying relational system. This could be proprietary or custom database functions that might exist in that proprietary relational system. It could be certain forms of analytical operations that made people to be able to run a certain way in the underlying system.
This is very highly flexible and essentially lets people take advantage of any capabilities of that underlying system. Now the way we designed it within Dremio is also very flexible as well. It can be incorporated into virtual datasets. It can have reflections built on top of it, or could be used by end users as they write SQL queries themselves. Essentially, what you're seeing here on the right is a table function called external query that we've introduced, which runs on a specific source. What you can do is essentially specify for this source that has been configured within Dremio. Again, that could be an Oracle system or a Postgres system or some custom system that the user connected, and essentially run this block of SQL in the underlying source.
Relational system will essentially return those results, and Dremio will process them in the rest of the query as if that was a table function. This is a table function that will run an operation and an underlying source, return a set of rows from that source as a table and that then can be joined to any other datasets. That could be joined to other relational systems, that could be joined to data lake data, et cetera. This can be used to essentially be very beneficial in terms of letting people structure things as they need. One of the great things about is administrators can configure reflections utilizing external queries and build virtual datasets with external queries and build reflections with external queries, while end users can also utilize it themselves.
If they end up finding some functionality that maybe Oracle has its proprietary, they cannot have to ping IT to set up VDSS. They can just make use of it themselves. It's also fully compatible with our Dremio hub connectors. We announced last year the Dremio hub framework, which lets users build connectors very easily, just as a YAML template file to any relational system that's out there. This functionality is compatible with that as long as we're connecting to something over a JDBC driver. You can essentially utilize external query with that system. This really enables people to maximize the processing that they get from a relational system, and then also to be able to join those results, and easily join those results which lot with large-scale datasets in the cloud.
The example that's being shown here on the right is an example of the syntax that you would use within Dremio, where you essentially would specify here's a block of SQL to run an external system that returns as a table function, and that can then be used as joins within that SQL statement with any other dataset. Maybe from another relational system on the data lake et cetera and as you can see in the relational system, Oracle in this example is we're just going to essentially run that SQL statement as is from the user. Next slide. One of the last things that we're really excited and announced are partnerships as well with BI tools within the ecosystem.
One of the great things about Dremio is we support any BI tool that's out there. We're always working with BI partners to improve our connectivity. We're really excited to announce our partnership with Microsoft in terms of improving the experience with Power BI. What we've done in this release is made it much easier for end users to be able to interact with Power BI from Dremio, or from within Power BI itself. Essentially, what we have here and as you can see in the screenshot are options to be able to expose Power BI to the end user as an icon in their different datasets. When essentially what this button does is when you click it, it automatically opens up Power BI on behalf of the user, and they can continue to explore that dataset within Power BI.
There are also other options to start within Power BI, and then utilize the native connector within Power BI to access datasets within Dremio. This enables a very frictionless experience for end users when working within Power BI or within Dremio in terms of being able to quickly move from one system to the other, and we're really excited to make this type of connectivity available. We're also excited with the fact that we're continuing to build the partnership with Microsoft and the Power BI team as well, and we have single sign-on connectivity coming which we're really excited to announce in an upcoming release. We're further improving the integration points between Power BI and Dremio as well.
Again, these were the major features that we had in terms of accelerating BI dashboards on the data lake in this release. They've improved the number of users that can be put on the system. It's improved the latency that users experience on the system, and it's improved significantly that the concurrency that people can experience on the data lake enabling organizations to essentially get to any scale for BI dashboards on the data lake. One of the things that we want to do now was Tomer is going to walk through an example with Power BI and some of this integration and the interactivity that this style of features and capabilities are able to unlock.
Sorry, I was on mute here. It's demo time here, so let's jump into the product, and w we'll make this a little bit bigger here. What you're looking at here is the Dremio user interface, and I'm just going to give you a quick orientation if this is new to you. On the left hand side, we have the data sources that we're connected to. Right now, you see we have a number of different data lake sources and that includes a glue catalog, some S3 buckets, some other samples that we have here and an ADLS file system as well. We also have in this cluster also external sources we can connect to other relational databases, allow you to do joins and things like that. That's the data sources.
Here we have these spaces, and spaces are basically an area where... This is basically where the virtual datasets live. This is Dremio's semantic layer. This allows you to expose different views of the data to different users and groups, and then every user has their own personal space and this is my own personal space here. You can see the green virtual datasets as well, some spreadsheets that I've uploaded, which I wanted to join with data lake datasets. What we can do here is let's jump into a large dataset here that has over a billion records. It's one of our standard demo datasets, and it's in the business space. There's a folder here called transportation and inside of that, we have this dataset of New York city taxi trips.
If I click on that, you'll see this dataset here. It has a number of different columns, like the vendor ID, the pickup time, the drop-off time, the number of passengers that were on that taxi trip, the trip distance and a bunch of other things. What you'll see here is this Power BI button. If I click on this button, it opens up this file her. Just by clicking on that, it will open up at Power BI desktop on my laptop here. This will just take one second, but what it's really doing is it's creating a live connection between Power BI and Dremio, so that you don't have to go through that context switch of going into Power BI, choosing a Dremio source, specifying where it is and so forth.
You can see right here, it actually brings up this dialogue with the preview of this dataset. Now we're looking at Power BI and this is that taxi dataset, right? The pickup time, the drop drop-off time, number of passengers, trip distance, et cetera. I'm going to click on load here, and it's not actually loading the data so that's a little bit maybe misleading. Really all it's doing is establishing a live connection and in fact, if you look at the bottom right here, you'll see this direct query. It says storage mode direct query. That means that every interaction now in Power BI is going to be a live SQL query to Dremio.
You can see here we've connected this taxi dataset, and that was really just with a couple clicks of the button, I've gone from looking at a dataset at Dremio to look at that same dataset in Power BI, and now being able to interact with it. I might do something like let's look at the average tip amount here. I'm going to drag the tip amount measure. This one of the columns that we have, and you can see the... Well, this would actually be the total tip amount is over a billion dollars in this dataset, so that's a lot of a lot of tips. Maybe we can change that here to the average. You can look at that now as the average tip amount for a taxi trip in New York for this dataset was just over one dollar.
If I wanted to look at it by say year, what I could do is I can go here, and I can say well let's create a new column in Power BI. I am going to simply take the year out of the time. Let's use the pickup time. I created a new column here. It's just a definition, and I can drag that onto the axis here. Basically, what I've done here is I've done a group by on the year, and you can see that the average tip amount in New York has actually increased year over year in this dataset, which is actually surprising to me. This is something I discovered only a week or two ago is that tips had gone up, and this dataset is from 2009 to 2014. It could be that the average tip went up because the economy was recovering, and people were more generous or perhaps the rates went up, right?
People generally tip as a percentage of the total amount. We could actually grab the total amount here and start interacting with that in the same way, right? Maybe let's look at the column for this graph, and see if that changed similarly. You can see the total amount which is actually interesting. This does not seem like it has increased at the same rate. There's something going on more than just the... I think the rates have gone up and yeah, it's just guessing that that might be due to the overall state of the economy at the time. What I can now do is if I go back to Dremio, I'll show you what's going on in the background here. We go to the job's tab. I'll open that up here and these are basically the queries that have run.
You can see here that all these queries that have come in from Power BI. In this case, we're looking at the total amount, and we were aggregating by basically the year. You can see the Power BI is using the year function in SQL and fetching that data into Power BI. You can see all these different queries. You can see they all ran in less than one second, and that is really the key to being able to provide a great experience for BI users. When data scientists or more technical analysts interact with data and explore it and do ad-hoc analysis, they understand that it might take a bunch of time for a query to come back, and they're typically that is the workflow, right? They're running a query.
Maybe they're waiting a few seconds or a few minutes, and then they run another query to dive deeper into the data. Oftentimes, when we're talking about dashboards, the users are expecting it to load like a website, right? They want to click on it, and they wanted to load up like the front page of Google. That's where having a system that can provide this sub second response time on data lake storage is really important. What you can also do here, and this is not specific to this latest release, but I will show you that if we go here and we create, for example, in the datasets tab, let's create another space and we'll call it the webinar space, just so we can do some fun things in here. I'll pin that here to the top.
You can see there's nothing in it, but back to my taxi dataset. If I wanted to for example create a new column instead of trip distance in miles, let's create the trip distance in kilometers, which is roughly multiplying by 1.6. I'll name this trip distance in kilometers. Click on apply and we now have a new column called trip distance in kilometers here. You can see that here. A lot of users know SQL, and so you wouldn't necessarily have to use the visual interface for that. You could just go in here and define the select statement. I can now go in here and say save as. I'm going to go save this is a new virtual dataset inside of my webinar space, and I'll call this trips in KM.
Now we have webinar.tripsinkm as a new dataset that you can query. Of course, you can go to the data graph and understand how that dataset is derived from this other dataset browses, like a Google Map. If I go back into Power BI and I click on the get data tab here, what I'll show you now is that I can also access this data. Instead of using the Power BI, which I can of course use the Power BI button on this new dataset, I just want to show you that I can also go in through Power BI itself and connect to that same environment with again important to select the direct query mode, which I was selecting so that we're not importing the data into Power BI and then trips in KM.
Let's see webinar. What did I do? Trips in KM here. I just have to select it and again, the load button doesn't really mean loading the data. It just means loading the metadata here, so I can start interacting with that dataset. What you'll see here is now we have this new dataset called trips in KM, and the columns will be underneath there. We'll have that new column that we just created if I can find it called trip distance KM. Now we have the trip distance in kilometers, which is something we did not have in the original dataset. You can see that if I drag that here, the query speed is just as fast as what we were seeing before. You see just over 4.5 billion kilometers where we're traveled.
With that, I'm going to end the demo here. Hopefully, that gives you a sense of what you can now do with in this case Power BI, but what you can do with dashboards and NBI directly on your cloud data lake. We can move towards Q and A.
All right Tomer, thank you so much and Tom as well. We have some questions here that we would like to address and I would like to clarify because of time issues, we may not get to all the questions that we have received. Tomer or Tom, and this is a question I believe is for Tomer since it's related to what you were showing just now. The question is in relation to I believe to the Power BI button. The question is, can we do the same UI integration with other BI tools?
Yeah, I think we've done that with Power BI and with Tableau. Those are probably the most popular tools that we see, but it's something that we can in terms of our roadmap consider other tools as well. I do want to clarify that, yeah, we have very good integrations with a variety of different BI tools and ranging from the ones I just mentioned to things like Looker and Superset and others. It's actually not very difficult to go into any of those BI tools. They have native Dremio connectors, and you basically make a connection and start interacting with the data.
Excellent. Thank you. Another question that came in says, when running queries on Dremio, where the file's hosted in the database, does Dremio dump a copy from the database into the Dremio system first, and then the person says just thinking about big files versus network transfers.
Dremio is not a database per se. Dremio enables you to query the data that you have on data lake storage on systems like S3 and ADLS, and also to join those datasets with the smaller datasets inside of relational databases. There is no need to copy the data. In fact, that's one of the key advantages of using this cloud data lake approach versus the traditional oh, load all your data and make a copy of it into a data warehouse, which of course very quickly results in significant costs and complexity. The idea here is you can query the data where it is. You don't have to create a copy, and you can start working with it.
Something to add to that, when thinking about network transfers and if there's some concerns on that as well is Dremio while we introduced last year something called cloud columnar cache technology or C3, where we will keep local copies of commonly accessed data within the Dremio execution clusters themselves. If you think, for example, maybe the first time you read a dataset from S3, yes, there is going to be a transfer of data there, but we will actually intelligently and behind the scenes, it's completely transparent to the user. Keep copies of highly accessed data within Dremio itself to greatly reduce that amount of network traffic to the underlying or to the external cloud data lake stores.
This can be used to both accelerate queries, but it's also very efficient from a cost perspective, because it means you might read an S3 dataset once, and then it could be served to through tens of thousands of queries to customers within the system.
Great. Thank you, and then there is another question. What will be the recommended infrastructure when using Kubernetes, and the person is asking for details about data and the frequency as well.
For Kubernetes, we have a wide variety of deployments that the customers use for that. We have standard Helm charts that are available to deploy Dremio on Kubernetes and popular Kubernetes systems as well. Whether or not you're using AKS or EKS, I'd probably say we have a large percentage of customers in the cloud, for example, utilizing AKS or EKS, or their own local on-prem Kubernetes deployments as well. On the Kubernetes side, we support a very wide range of infrastructure systems there, and we've seen a lot of flexibility in terms of how people being able to deploy through Kubernetes.
We also enable with Kubernetes as well dynamic auto scaling. When you think about what's the right way to think about your infrastructure for different data sizes or frequency, we've made it easy in Kubernetes to essentially add and scale up and down the size of the execution cluster. If you want to think about scaling that for different times throughout a year or throughout a day, some of those patterns are actually very simple within our Helm charts.
Sounds great. All right, so let's see. Is there anything else that we would like to add before we close for the day, Tomer or Tom?
I think there was a question here about the Dremio roadmap and Iceberg table formats, so I can talk about that. Yeah, you should look forward to seeing full Iceberg and Delta Lake supported natively inside of Dremio in the very near future, something we're very excited about. In fact, I think one of the most interesting things that will happen over the coming months and quarters in general in the industry is basically that data lakes really will have the ability to do all the things we've come to expect from cloud data warehouses without having to move the data. We're increasingly living in a world where there's no longer a need to move data into data warehouses and rack up all that cost and complexity.
That's something we're very excited about and actually we have in end of January, January 27th we'll have our next subsurface event, which will have speakers from a lot of the creators of these various open source technologies and then companies that are using these things in production. Just if you're interested in learning, that'd be a great, great event.
Excellent. There's another question and this says, what is the future looking for support for mainlining Apache Flight?
For some background on that, for people on the call, we're introducing a feature called Apache Arrow Flight, which essentially you can think about it as almost like a successor to the ODBC and JDBC standards, which have been in the industry for multiple decades. As a result, they're showing their age, and they're actually in many ways a lot of the serialization, deserialization aspects that happen within ODBC and JDBC are a bottleneck towards transferring data between systems. They were designed 30 years ago when processors were your single core and networks were much slower, et cetera. We have a feature that's been in the work called Arrow Flight, which is essentially a successor to those standards that utilizes the Apache Arrow format behind the scenes to very efficiently transfer data between systems.
That could be Dremio reading data from a SQL server, for example, or for a client tool whether or not it's Power BI or a custom client tool to read data from Dremio. We've had a preview feature of Arrow Flight out for some number of releases, and we've seen customers get over 50X performance improvement by utilizing that preview. What we have actually come in the next release is a GA version of a server endpoint for Arrow Flight within Dremio and while we have active working with the community, our client tools that will be mainlined into the Apache community which will include client tools for the C++ Java and Python languages, plus a SQL client driver that will be mainlined into the Apache Arrow Flight community to enable essentially a more SQL driver experience to utilize Arrow Flight with client tools.
This really enables very, very efficient processing or transfer of data from Dremio to some client tool. Again, we're seeing over a magnitude order improvement in terms of bandwidth with it. We're going to have a server endpoint in the coming release or so, and we will expect over the next few months on mainlining these client tools for the C++ Java and Python languages, plus a SQL client driver as well.
Great. Thank you, and I think we have time for a couple more questions. One question that we have here is about licensing and it says, what is the impact of flexibility for elastic engines regarding licensing?
Yeah. Especially in the public cloud where the infrastructure is dynamic can be rented by the second from the cloud providers, it makes sense to have features that allow you to expand and contract the cluster automatically. What we've done for example with Dremio's AWS edition, which you can get from the Amazon marketplace both the free version and the enterprise edition, is we've introduced a feature called engines, which allow you to create different engines, an extra large engine for your marketing group, a medium-sized engine for your data scientists and so forth. Those engines will automatically stop when the workload is not there, right? If nobody's running any queries at night time, then it will hibernate.
In order to make that work, of course you're paying for the Amazon infrastructure in that case by the second, but also with Dremio, if you're using the enterprise edition, you can also pay for what you're using. Either through the marketplace or even when working with us, you can actually license it through something we call Dremio compute units or DCUs.
All right, and another question that we have here is this is about performance. He says is, more than 100 or greater than 100X performance improvement for data warehousing workloads also true for on-prem architecture?
Mm-hmm (affirmative). Yeah. What we talked about here with the enhancement around performance for data warehousing workloads, specifically with the runtime filtering applies both for cloud and for on-prem deployments. It's really a feature that takes advantage of dynamic or adaptive learning of the data as the query is executing and leveraging that to more aggressively filter what actually has to be read. Yes, by reducing the amount of data that has to be read to satisfy that query and in one of these star schema situations, that's going to speed up query regardless of what infrastructure you're running on.
Excellent and let's see, does Dremio support cluster out of scaling, scale down to zero?
Yeah, so the engine's capability that we talked about specifically on Amazon with the AWS edition will scale down to zero. The engines will automatically start and stop based on the workload. You can have many different engines and they're all isolated from each other, so you don't have to worry about performance impacts of one team's queries impacting other teams queries. Yeah, if there's nothing running, it goes down to zero, so that you don't have to pay.
All right. I think that is it for the questions that we had. Tomer, thank you so much for such a great presentation. Tom, thank you so much as well for being here with us. To the rest of the audience, I want to remind you that the webinar presentation will be available on our website on Dremio.com/library. I want to thank you for your time, your attention, and I hope you have a wonderful rest of the week. Stay safe and stay healthy, and we will talk to you later. Thank you, bye-bye.