Gnarly Data Waves

GDW R1 Diagonal Dev 200ppi

Episode 47

|

March 5, 2024

Learn how to reduce your Snowflake cost by 50%+ with a lakehouse

Join Alex Merced, Developer Advocate at Dremio to explore the future of data management and discover how Dremio can revolutionize your analytics TCO, enabling you to do more with less.

Data leaders are navigating the challenging landscape of enabling data-driven customer experiences and enhancing operational efficiency through analytics insights, all while meticulously managing budgets. Organizations leveraging cloud data warehouses, like Snowflake, often grapple with the complexities of unifying data analytics across diverse cloud and on-premise applications. The process involves significant costs, resources, and time to extract, rebuild, and integrate data for consumability.

Enter the data lakehouse – offering the potential to drastically reduce the total cost of ownership (TCO) associated with analytics.

In this video, you will gain insights into:

  1. Key distinctions between traditional data warehouses and the innovative data lakehouse model.
  2. How Dremio empowers organizations to slash analytics TCO by over 50%.
  3. Uncovering hidden costs associated with data ingestion, storage, compute, business intelligence, and labor.
  4. Simplifying self-service analytics through Dremio’s unified lakehouse platform.

Watch Alex Merced, Developer Advocate at Dremio to explore the future of data management and discover how Dremio can revolutionize your analytics TCO, enabling you to do more with less.

Watch or listen on your favorite platform

Register to view episode

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Opening

Alex Merced:

Hey, everybody! This is Alex Merced, and welcome to another episode of Gnarly Data Waves presented by Dremio. In this episode, we're going to have an exciting topic about learning how to reduce your Snowflake costs by 50% with a lakehouse. So the idea here is, one: what are those cost drivers that happen in times when we're using those cloud data warehouses? And how are we able to eliminate a lot of those drivers when we move a lot more of our workloads to a data lakehouse architecture?

Apache Iceberg: The Definitive Guide

But before we go, I always like to remind you, that if you haven't gotten yourself an early-release copy of Apache Iceberg Definitive Guide, go get one now! I think we're just going to be at the point where you can pretty much get the full manuscript from the early release, so do that again. It’s coming out in a few more months, so yeah, pretty exciting. You can also pre-order a physical copy on Amazon, so that's a thing you can do now.

Survey Report

Also, make sure you download the state-of-the-data-lakehouse survey report, so if you're interested in learning more about different trends like, how are people adopting data mesh, and which table formats? Are people adopting or planning to adopt? How many people are you implementing a lakehouse? And why are they implementing a data lakehouse? You'll find those stats in the survey of 500 industry participants. By scanning this QR code, you can get that report. 

Subsurface LIVE

Also, make sure to register for our conference coming May 2 and May 3 at Subsurface: the Data Lakehouse Conference, presented to you by Dremio. There's going to be a lot of people there if you're there in person, and you can also watch it online, generally 40-plus sessions. Every time we do this, it's always a really good time with a lot of interesting topics about the data lakehouse table formats––query, optimization governance––all sorts of really good stuff. So make sure to register.

Learn How to Reduce Your Snowflake Cost by 50% with a Lakehouse

With no further ado, we'll begin our presentation, and we're going to talk about how to learn how to reduce your Snowflake costs by 50% with a lakehouse. And presenting it, well, it's me, Alex Merced, Dremio developer advocate. Let's kickstart this off, where we're going to be learning how to reduce your Snowflake cost by 50% with a lakehouse, and again, we will do that by using Dremio’s unified lakehouse platform for self-service analytics, which we'll learn more about throughout this presentation as well. But it's hard to talk to you about a solution if we haven't talked about the problem yet. So let's first do that, and everything we will be talking about in this presentation you can find more additional reading in our white paper: “Data Warehousing at Less than Half the Cost of Snowflake,” where you'll be able to find a lot of statistics, numbers and case studies that we'll talk about in this presentation.

You can also find a more high-level overview over there in our blog: “Using Dremio to Reduce your Snowflake Data Warehouse Cost,” so these QR codes can get you to both of these resources. If you don't have a chance to scan these QR codes now, this will be posted to YouTube within 24 to 40 hours after we present this. So you'll be able to go catch it there. 

Back to the problem––so the bottom line is, being in the data lakehouse space, we talk to a lot of people who are trying to improve the data platforms and talking about trying to identify the problems they have with their existing platforms. 

What Snowflake Customers Tell Us

Some of the things that we hear from Snowflake customers that they tell us are things like that––there's a lot of data lock-in, and they can't access data efficiently, if it isn't Snowflake. It becomes hard to access the data. You're trying to see some movement away from that, with Iceberg cables or whatnot, but it's still very much that you have to have the data in Iceberg to get full advantage of the Iceberg platform.

Not necessarily ideal for BI and self-service, so while there are a lot of ease-of-use, features in Snowflake at the end of the day, you still end up having to do these old patterns of having to build things like BI extracts and cubes to improve BI performance, and, when people take a look at, hey, what is Snowflake costing? These are not the costs of those that additional workloads to create these BI cubes. And BI extracts aren't factored in there, so that ends up being much more expensive than you realize when you factor in all the work you do around Snowflake to optimize Snowflake.

Expensive to maintain––data teams spend a lot of time in resources maintaining expensive queries, and optimizing materialized views on top of it is like the cost of the storage because you don't realize that you're tracking and storing all the historical versions of your data for generally up to like 90 days, and paying for that. So that can add up.  You have the Egress costs from ETL, so that brings us to our last point. 

Expensive ETL, so when you're ingesting data, you're moving data outside of your data lake, hitting those Egress costs that add additional costs to just the movement of the data from the data lake to the data warehouse, so you have all the costs that you generate operating within Snowflake, but you have all the cost of getting your data to Snowflake, and the cost of optimizing Snowflake, not just of your purely just your workloads. And in those initial cost calculations, I'm like, Hey, we're going to adopt this platform. It's going to cost us XY and Z, you're focused on: just hey, these are what my workloads cost, and you're not factoring in all those other additional costs that drive those Snowflake bills so large.

Current Approaches to Data Management

And again, this all comes down to the traditional approach to data management and trying to see, hey, is there a better way? Whether it is Snowflake or some other cloud data warehouse, the pattern is still the same, and you still run into the same problems. That is, you start with all your sources––which could be your application databases and different files that are generated throughout your business SaaS applications, you're downloading data from like Salesforce or from some other third-party app we have. You're collecting third-party data apps all sorts of different stuff, [that] ends up typically in your data lake. Why? Because the data lake is a place where you can store structured and unstructured data, now all of that data is going to be able to go directly to the data warehouse, so you generally land all of it in the data lake.

and in there, on the Data Lake, people will do what's called a medallion architecture, where they land all of their raw data in a Bronze Zone. Then they'll clean it, standardize it, move it into what's called a Silver Zone, so it'll be a Silver version of the data, and then they'll have the one that's ready for consumption, and they'll have that Gold layer of data. So right there, you've created 3 copies, because you took the raw and you transformed it to Silver, and you transformed [that to] Gold, and each time it's a physical copy in this traditional pattern. 

And so you're generating costs there, and then you take the bits of the gold that you want in the data warehouse and you're going to ETL that into the data warehouse where there's going to be more curation there, so you're going to have a curated zone where it's just a copy of your Gold data. You're going to generate some summary tables and materialize views for acceleration summary tables to help accelerate BI dashboards and reporting, you're going to create materialized views to help accelerate rock queries which then you have to like, we're worried about how to maintain those and keep those in sync and update it and fresh.

And then you're going to generate departmental data marts which are going to be made up of a combination of some, some logical views, but a lot of times you're creating additional copies, so that way, people within those data marts can then work, make their changes, and work with their own data in their own way, so there are more physical copies, and again, each of these movements means there is a programming code that has to be written, code that has to be tested, code that has to be deployed, code has to be managed to make sure all these things happen. So it gets more and more complex, which means it takes longer, increasing your time to insight, because you're taking longer to get that insight. 

And then two, increasing your costs, because you're spending more compute––you’re increasing the amount of your storage footprint. You can start seeing how this gets expensive really quickly, and I'm sure you've seen your cloud bills, your data warehouse bills, to the point where you can sympathize with what I'm saying here. But again, then you're generating all these extracts and cubes, maybe external to that, to even further optimize BI dashboards. So creating these separate collections of pre-aggregated data gets expensive and complicated. At the end of the day, we have these complex ETL processes, which makes analytics really expensive.

And the thing is that the more complicated it gets, the less self-serve it gets, so we start getting away from that desire of being able to allow our data analysts and data scientists to be able to have more direct access to the data that they can provide for themselves. We start having data copies everywhere, because there are now so many different versions of the data, [and] all of that data has to be governed if you want to comply with regulations. It gets really hard to be governing all that data across so many different systems. 

And again, you have data lock-in once your data is in that data warehouse. Generally that data warehouse is internal format. If you want to use some other tool, guess what? You're going to have to do another move into that platform, into their proprietary format, which again,

more ETL, more cost, more storage, more the whole deal, so again, all this gets complicated, expensive, and slows down––how soon you're going to get those insights, increasing your time to insight. Ideally, you want to have that minimum time, you want to get an instant time to insight, wouldn't you? So how do we make that better? 

Well, the idea is, we're going to want to shift left. So again, think of this is––we’re here on the right, we're in the data warehouse, [what] we're going to do is we’re going to want to move things over to the left, more to the data lake, and that means we want to treat our data lake more like a data warehouse. That's the pattern called the data lakehouse, and that data lakehouse is made up of a lot of different pieces. Think about it this way. Your data lake is your storage layer, so that'd be like your S3, so that'd be your object storage that we see right here. And then what you would do is, you want the files, your Parquet files, your RC files, Avro files, that have landed in that data lake to be able to be treated as tables. 

Ideal Enterprise-Grade Lakehouse

And that's where open table formats come into play, they allow you to identify groups of files as tables, but all the tools––you don't have one big table, at least, not all the time, sometimes you have lots of tables, a lot of the times many, many tables. So you need a way to track all those tables, and that's where we get into a catalog that's going to allow you to discover those tables across multiple tools. But what's the point of being able to track your tables? If nothing, if you're not going to do any analytics with it, so you need a query engine that's going to be able to run queries and transformations and do the kinds of processing work that you want to do on that data. But at the end of the day for your end users, you want to deliver the data in a way that is as easy for them to understand, find, and discover, that's well governed. And that's where you need a semantic layer––that'll be a place where people can go get a unified view of where all their data is, and be able to discover that data so that way they can then reach for it and bring it into their different use cases, whether it's data science dashboards, building data applications, and being able to use typical interfaces to grab that data such as ODBC, JDBC, REST APIs, Apache Aeroflight, and all these things. But also in this picture, the problem is––well, ideally, most of that data in your data lake is treated as tables through these table formats. Not all of your data is ever going to be in the data lake. You're going to have data that might be stored just purely in object storage because you got it through AWS's data-sharing marketplace. You might have data sitting on Snowflake because you're using Snowflake’s marketplace. You might have other data that's just in a database that isn't worth moving to the data lake, and you rather just operate it with it directly from the database.

So in this data lakehouse platform, ideally for enterprise-grade data lakehouse, you're also going to want to have some virtualization, to be able to virtualize a long tail of additional data. Think of it as an 80/20 rule where you want 80% of your data on the data lake, but you're going to have 20% of data that's going to be coming from other places, so you need a tool that can give you access to that additional 20%, along with giving you a platform to unify all these pieces that make a data lakehouse and make it all usable. And that's essentially what Dremio provides. 

The Unified Lakehouse Platform for Self-Service Analytics

Dremio is the data lakehouse platform––it's a unified lakehouse platform for self-service analytics. You can connect to your object storage where there's a DLS Google Cloud, and on-prem sources, so you can have a hybrid lakehouse. It can connect to that long tail of additional sources like Snowflake, no-SQL databases like MongoDB, or relational databases like MySQL, PostgreSQL Server, and all sorts of other unique sources, but it provides you with all the things you need to tie things together. The unified analytics is going to provide you with that semantic layer so that we have that nice view where your end users can easily discover their data, that data can be documented, it can be governed as an SQL query engine––that's going to provide you the best in class price-performance with all sorts of features for acceleration. It's going to provide you with late cost-management features that provide you a catalog that has unique git-for-data-lake features and can provide automatic data lakehouse table optimization.

So that way the feel of using the data lakehouse feels like a cloud data warehouse. It has that unified one-platform feel, but you still have that open nature, and you're still shifting left, moving all those workloads to your data lake. But you have a platform that feels like that cloud data warehouse, that has that ease-of-use factor. And then you can pass that data to your data science dashboards, and applications. 

How Dremio Enables Enterprise-Grade Lakehouse

So let's dig a little bit deeper into this. Again, Dremio is enabling a lot of this through its unified access layer, so again, you can see the easy view, where you can see all your data, nice and organized in one place, and [the] SQL query engine allows you to easily query that data, but also quickly [query] the data. And the thing is, that that speed also means less cost. Because the sooner––so one, if you don't have to ETL the data in the data warehouse,

you save the money of that. If your queries run faster, then you save money on the compute there. And then also, if you're using cheaper compute, you save money on that arm as well, you're using less storage. You start seeing where a lot of these cost savings come from. 

And then you have lakehouse management, so that way, those lakehouse tables are always ideally optimized––so compaction, clean up, so that way, you're not spending more than you need to on storage, and that way your queries are never slower than they need to be because your data sets are optimized, which again, faster queries means saved money.

The Dremio Difference

 More efficient queries mean less long, shorter-running compute, so, at the end of the day, the Dremio difference is that's bringing you the best-in-class TCO that's going to bring you the fastest time insight and ease-of-use through self-service data, through its flexible and open architecture.

Warehouse and Lakehouse - Better Together

So how would you bring that together, at the end of the day? We're not saying: go pull out [of] Snowflake, and just get rid of it. There are a lot of reasons for you to still use Snowflake, again, one being, that it's a data marketplace, but the idea is to start shifting left and bring more of those workloads over to Dremio, or the data lakehouse so that we start seeing a lot of those cost reductions. Again, a reduction in the [number] of your storage costs on Snowflake, a reduction in the Egress costs you have in moving data to Snowflake, a reduction in the cost of the ETL that takes you to do that additional movement into the data lake and generate all those data marts. Instead, you can virtually model that data in the lakehouse and not make a bajillion copies of it, and then be able to govern all that data, including that long tail of additional data, all from one place.

So essentially, you would have all your batch sources of data, and again, the vast majority of your data, you'll then land in your data lake as Iceberg tables. And [then] they would be tracked by your Iceberg catalog. Dremio has an integrated catalog––again, that provides you those git-like features, so that means you can not only use Iceberg, but you can also practice cutting-edge DataOps capabilities, such as branching, merging, tagging, and git-for-data type capabilities. You would then land that data in your data lake. And then Dremio will be the interface to managing and working with that data lake, providing you with that nice UI. But Dremio can also connect to Snowflake, so it'll be able to see those shared data sets that you've purchased, so that we can still use those data sets to enrich your lakehouse. And then deliver that data to data science notebooks, applications, BI tools, etc. And this, again, you're minimizing storage, you're minimizing Egress, you're minimizing compute. 

Then the Dremio engine itself has a lot of other things that it does under the hood to further increase your costs––a lot of types of caching that are going to reduce a lot of your data access costs and Egress costs because basically, that's referred to as the column or cloud cache. You have Reflections that are going to speed up queries even further and allow you to have things like materialized views in the extracts that don't come with the headaches because you can just turn on Reflections. And then Dremio does all the management and also allows them [to be] much more reusable, so you don't have to create as many. Your analysts and scientists don't even need to know that they exist, because Dremio will intelligently use these Reflections to speed up queries on the right data, on the relevant data sets. So in that case, it becomes easier for the data engineer to optimize queries, and it also becomes easier for the analyst to take advantage of that optimization. They don't need to be aware of all these extra copies and whatnot, they can just focus on, hey, this is the data set I need to run this dashboard on, I need to run this query against, and get those insights to you in a quick, efficient way, again reducing that time to insight. So that way, you have an instant, or much closer to instant time to insight, so just to show you some examples.

Analytics on Dremio is Less than Half the Cost of Snowflake

Here, we have a company that's taking a look at 3-year analytics of total cost ownership. So here, when you take a look at the end-to-end analytics, TCO, so total cost of ownership, basically analyzing everything from end to end, from the point of view where you're pulling the data from the source to getting it and serving that dashboard, serving that report. So I mean, it's going to count all those different steps, all those different ETL costs, Egress costs, trying to take a look at the complete picture. Here we're comparing a Dremio large cluster, which is eight nodes, versus a Snowflake large warehouse. So that's what the comparison is, and again, you can download that whitepaper to see the breakdown of that, dollar for dollar, that I shared earlier.

But as you can see here, with Dremio. you saw a cost of over three years of 1.4 million versus Snowflake’s 2.9 million when you break it all down. And that's 50%.  And while this is just one example with Snowflake, some customers have seen savings of 50% or more in different situations. We recently had an episode here of Gnarly Data Waves where SMP was talking about how they save 50% of their costs by using Dremio [for] more and more of their workloads. So it's a real thing, and it's a transformative thing because you can think about what can you do with that money––what data projects currently aren't funded that you can fund

to expand, and not only get insights faster but increase the types of insights that you're getting that you can do with that money. So that savings go a long way. There's a lot of value there

beyond just the dollars and cents. 

Fortune 10 Customer - 75% TCO Savings, $3M Savings in Just One Department

Here's another Fortune 10 customer who had a 75% TCO savings, so this is something that happened––3 million dollar savings just in one department of the company, so imagine if they had adopted this pattern in all of their departments. So basically, the original story is that they were using Snowflake. So as you can see here, they have the data in S3, and then they would move it into Snowflake, and then they would generate all their data marts and Snowflake. Then they would generate all their extracts, so you get 700 million records extracted, so that way, they can power all these dashboards. So you can see that there are a lot of steps there. And they were taking three to four minutes. So every time someone was, like, turning a knob on the dashboard to see the data a little bit differently, they have to wait 3 or 4 min for the data to update, so I click, go, get a cup of coffee, come back, see the dashboard updated. Well, again, that's not necessarily the time for the insight that you're looking for.  But with Dremio, it got a lot simpler. They landed their data in the data lake, so that's where the data lives, so this is just one copy of the data. Dremio delivered the data directly to Tableau with dashboards––in this case, it was just live queries, there was no need for any further acceleration, and they were able to deliver that data within five to fifteen seconds per click. And again, that's without using things like Reflections that can get that down to like a sub-second, so pretty sweet. So again, that's a reduced time to insight, on top of the 75% TCO savings. So they're saving money, but they're also getting value to the business and being able to get those insights that they need to make important business decisions to grow their business. At the end of the day, what matters is not just what you spend on your data platform, but what are you getting out of your data platform in a world today where decisions are made in a split second, you need to have that nimbleness, that flexibility, to be able to make those decisions that quick, and you need the platform that can allow you to do that.

91% Lower TCO with Dremio Compared to Snowflake

Here's another example. Here we see a 91% lower TCO with Dremio compared to Snowflake. This company was a global leader in the manufacturing of commercial vehicles.

Now, what happened here is that when they were using Snowflake, it cost them 47 dollars and 20 cents to run some certain self-service and BI workloads on Azure and AWS. So basically,

that was a cost per hour to be running these load workloads when they broke it down. To take those same workloads––and again, you can see the specific compute that we use here, so you can see here with Dremio, we're talking about five nodes of m5d.2xlarge instances. While in Snowflake––that meant the equivalent medium warehouse which would be the comparable analog––with Snowflake, it costs $47.20, while Dremio costs just $4. That is quite a difference in savings in doing the same work––and again, this didn't require any data movement. No having to copy the data, because again, we're capturing the full picture here. All those other costs you don't realize you're paying other than just the pure workload cost. You see here that there's a huge reduction, a huge savings, and these are real savings from real companies using Dremio over what they were doing before.

Dremio is Over 3 Times Faster than Snowflake

So now, in this example, what we see here is that Dremio is working with a large pharmaceutical company, and basically, they are working with workloads on Azure, so basically as a comparison, what we did is we ran those same workloads on Dremio, using the Parquet files on ADLS using a 10 note 120 GB, 16 CPU cluster––the analogous cluster on Snowflake would be the extra large warehouse cluster, so the best for apples to apples––and again, all of this is detailed in that whitepaper that we shared in the beginning. What happened here is that when we ran those workloads on Snowflake, those were about 30 seconds of workloads, but those same workloads through Dremio, [it] took seven seconds. So again, that's

quite a bit faster again, over three times faster. But the thing is that: think about like, if you can get your data insights faster, so you can get that data faster and be able to make those nimble decisions you need to make to transform your business faster, what difference it makes. Imagine that!

Dremio vs. Snowflake

So let's sum it up––Dremio versus Snowflake. So when we take a look at the data ingest side, that engine is pretty straightforward. You don't have to ingest data into Dremio. Dremio just connects to your data lake directly, so the ingestion you've already done to your data lake is enough. Snowflake––you do require additional ingestion from your data lake to your data warehouse. Granted again, there are newer external table features and things like that, but they get the full performance of Snowflake, you're still ending up detailing your data into Snowflake.

And that means you're going to have those ingestion costs, means you're going to have those Egress costs and those additional storage costs. And then when it comes to that last-mile ETL, on Dremio, you don't have to make more copies of the data to do that last-mile ETL to add a column or remodel some data. A lot of that can be done virtually, and it can be done virtually with performance. That's what Dremio’s secret sauce is, that allows being able to create that virtual data, that virtual data mark and make that practical. With Snowflake, you still create these items through lots of physical copies of your data, and again, more copies, more governance, more work, and having to manage all that data plus the storage costs. So, again, now, far as that a transformation like the long haul, meaning like the major transformation, so like when you're curing that, that Bronze, Silver, Gold layer, not a core Dremiio of capability yet––probably still, at least for the time being, you'll be using Spark. And again, you might do some of that in Snowflake, but Snowflake compute tends to be more expensive than doing it directly from the data lake with the Spark cluster, and then yeah, it's more expensive. 

Now, as far as user experience goes, and again, this is even more important, as, far as just making it easy for people to work with the data, Dremio's core functionality, Dremio semantic layers built-in, it's part of what the fabric of what Dremio is––that user interface where you can easily discover and organize and curate and document data. Well, with Snowflake, there isn't a semantic layer, you use third-party services, things like AtScale are used with Snowflake and Databricks and Cube and other things, but again, that's another service. It's more things to configure, whether things integrate. And again, you're still managing all these BI cubes and extracts externally a lot of the time. So just not quite as complete a story acceleration. 

So again, Dremio has things like C3 caching––it's worth caching things that get queried very often, so that way, it doesn't have to keep fetching those from your S3, which is going to reduce your S3 network, and your S3 network costs and reduce your compute costs.

Reflections––that functionality in Dremio can just do away with materialized views and cubes because it supplies that in a much better form. While with Snowflake, you can use materialized tables, or materialized aggregates. Again, there's some maintenance work. There's configuration work, but they generally match one-on-one with a single table, and again, often have materialized views. You're creating a whole new namespace, so then, analysts have to know the query that namespace, not the original table. Well, that's not an issue with Dremio, because Dremio will intelligently substitute the reflection whenever you query that original table.

Data Curation/Federation with Dremio––you can do a virtual join. So you can virtually join data from, so you might have data from the Snowflake marketplace, and you can virtually join that with the data from your data lakehouse, so you don't have to have all your data in Snowflake, to be able to benefit from the Snowflake marketplace. You can also create your custom Reflections with external Reflections. So if you want to create your custom materializations. That's something you can do in Dremio with external Reflections, while with Snowflake you still have to do a lot of ETL, you have to generally work with a sandbox environment. 

And then, as far as a query redirect, with Dremio, it'll automatically rewrite your queries. That's where the reflection feature comes in, so if you're creating Reflections to help speed up raw queries or aggregate queries, the Reflections are aware of which data sets

they're related to, so then, when a query comes in for that original data set. it'll use the optimized representation intelligently and swap it out. The end user doesn't need to worry about that, and nor does the data engineer have to worry about creating ways to rewrite queries and redirect queries.

And in Snowflake, you have rewrites only for materialized views without joints. Otherwise, the user must direct the query directly to the materialized view or the base table to get those results. So you see, there's quite a––it's not just about cost, but it's also about ease of use. Dremio’s whole purpose is to try to make things more efficient and easier. Not just easier, but directly from your data lake storage, so that you don't have to move your data over and over again, and you can get the benefits of that. 

Download the Whitepaper on Data Warehousing

But with that, that wraps up this presentation. I welcome you to download the whitepaper on data warehousing at less than half the cost of Snowflake, and read our blog, “Using Dremio to Produce your Snowflake Data Warehouse Cost.” You can scan these QR codes to access them. Also, I invite you to inquire about getting a free architectural workshop with us at Dremio, where we can identify, what are your needs. What are your goals to help you determine a plan for you to best get there? And see whether Dremio is a fit for helping you achieve those goals and fix the problems that you may have.

Q&A

 But with that, let's head over to Q&A. My name is Alex Merced, a developer advocate here at Dremio, and hopefully, now you understand a little bit better about how you can reduce your Snowflake costs by 50%.

Hello! This is a question from post-episode that I just want to make sure I covered. We got a question just as the session was ending saying, can you talk about Reflections, and if there's any data that you need to be generated and stored? And where’s this data stored? So just to kind of expound on data Reflections. The way data Reflections work is when I flip that little switch and say, hey, I want data Reflections on this table, or this view, what essentially Dremio [is doing] is generating an Apache Iceberg table, the backs, the item, the object you’re creating a reflection on the table or the view. And so essentially this will be stored on your data lake, so generally, when you configure Dremio, you are going to configure some sort of data lake, whether it’s an AWS, ADLS, Hadoop, GCP, some sort of data lake will be connected to your Dremio account, and that’s generally where it’s going to end up storing the Reflections. That’s different than the Columnar Cloud Cache, where that’s caching things on repeated requests, and that’s being cached on the individual nodes in their NVM memory. I’ll leave it at that, but essentially what will happen is that Reflections create an alternative representation on the data lake that you can use to accelerate the query. It accelerates the query for a variety of different reasons, and again it depends on how you use it, but it accelerates the query because maybe it’s on a view that’s on a subset of the data. 

So, for example, if I had a table with a million records, and I created a view with a thousand records, I could turn on Reflections on the thousand-record view, and that’s going to create a thousand-record representation. So just by the fact that it’s a more narrow representation of the data, I’m going to get performance. But on top of it, you get a couple of other benefits, because of the way Dremio does it––one, because it’s an Apache Iceberg table, you have the Apache Iceberg metadata providing you with lots of statistics that can allow you to scan the files that make up that table much more efficiently, and three, because the data lake is separate from Dremio’s cluster, you can get infinite scaling. 

For example, another really nice strategy with Reflections is to create Reflections on database tables, so you might have an SQL server, and the problem is you know once the SQL server’s capacity is met, it’s met. It’s just the scaling proposition is different when it comes to horizontal and vertical scaling. But with Dremio, if you have a Reflection on that particular table, then essentially, Dremio can generate––scale up additional compute clusters. Whether they’re new clusters or a bigger cluster to run queries without it running out of necessary resources, or in the sense that when Dremio is running the computation on the reflection, it isn’t affecting other concurrent queries that might be operational that are going directly from your SQL Server, because technically going forward, it’s not waiting, it’s not pushing down the query to your SQL server anymore, it’s using the reflection. Periodically, that reflection will refresh, and in that case, it will query the SQL server to get the updated version, but in that case, all your queries are now hitting the reflection. So you’re not necessarily splitting your database across operational and analytical queries, so there are a lot of really good ways you can use Reflections. 

Closing

Hopefully, you found this interesting and useful, and I’ll see you all soon!

Ready to Get Started? Here Are Some Resources to Help

Infographics Thumb

Infographic

Quick Guide to the Apache Iceberg Lakehouse

read more
AnalystReports Thumb

Analyst Report

It’s Time to Consider a Hybrid Lakehouse Strategy

read more
CaseStudies Thumb

Case Study

Navigating the Data Mesh Journey: Lessons from Scania’s Implementation

read more
get started

Get Started Free

No time limit - totally free - just the way you like it.

Sign Up Now
demo on demand

See Dremio in Action

Not ready to get started today? See the platform in action.

Watch Demo
talk expert

Talk to an Expert

Not sure where to start? Get your questions answered fast.

Contact Us

Ready to Get Started?

Enable the business to create and consume data products powered by Apache Iceberg, accelerating AI and analytics initiatives and dramatically reducing costs.