May 2, 2024

Freedom, Creativity, and Results – The Promise of Open and Autonomous Data Platforms

Sendur Sellakumar, CEO of Dremio, alongside industry leaders, will illuminate the critical role of open architectures and user self service in fostering innovation. The keynote will focus on dismantling barriers to better insights and applications by making open technologies more accessible and cost-effective.

Through a blend of technology and real-world applications shared by customers, the session will demonstrate how the combination of open platforms, self service capabilities, and autonomous innovations are driving significant results at some of the top organizations in the world.

This keynote promises to offer a vision where the freedom provided by open architectures, combined with innovative technology, equips analysts, data scientists and engineers with the tools they need to explore, innovate, and deliver.

Topics Covered

AI & Data Science
Dremio Use Cases


Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Providing Value to Customers

Sendur Sellakumar:

Thank you, Colleen. Thank you, Colleen. I’m so excited to be here today. A lot of content to go over. Great faces in the audience and customers and prospects. You know, I’m certainly here to talk about Dremio, but I’m also here to talk about customers. The ethos of this company, the DNA is around customers. What do we do that delivers customer outcomes? Yes, the technology is important, but in that customer back thinking, what customer outcomes are we delivering? Today, with our partners and with Dremio, we’re hoping to deliver a lot more value to customers. 

Now, since I’ve joined the organization, I’ve met with hundreds of customers and prospects from mid-market to global 2000. And a common trait across all of them is that they want to be a data-driven organization. They want every individual at that company to be able to leverage data in their operations, whatever the function might be. They also want to leverage AI. But as we all know, the foundation of AI is data. We’ve heard that multiple times. Everyone wants to be empowered. And that means more queries, more analytics, and empowering them to do the things they need to do. They want to move quickly, at the speed of their questions. They want to have data confidence, so that their data is accurate, complete, and fresh. But then reality strikes. And what reality shows, even with the modern data stack, is that these common challenges are the same across industries. As I said, they want to move at the speed of their questions. They want insights quickly. But too often, when that question changes, they’ve either got to create that big table to get at their data, or they’ve got to go back to the central team, go through ETL processes, go through changes in their Python flows, et cetera, Spark flows. But that separates also the domain expert from that central team. We’re making this break. We want to make that central team do what they love doing, which is building greater data technologies to empower the organization, not handling individual juror tickets on data changes. We want that data analyst to be able to do what they need to do and not have to go back to the central team. So how do we empower folks? But in truth, when you look at it all together, customers are ultimately held in neutral. 

Empowering our Customers

We see three common themes across our customer base and our prospects that we talk to. They’re dealing with proprietary systems, whether legacy enterprise data warehouses, cloud data warehouses, or other systems. And they’re seeing that their access to data has to go through these systems. That’s the first problem. The minute I go through the said system, I’m paying that tax. And if I want to move away from that vendor, that’s a challenge. The vendors have often prioritized themselves over customer choice. Dremio is about customer choice. The second thing we see are operational bottlenecks. Certainly, the advent of modern BI technologies, whether it’s Apache Superset, Tableau, Power BI, et cetera, they’ve enabled that data analyst to be more flexible in analyzing their data. But when that question changes, they’ve got to go back to the raw data. They’ve got to look at the lineage. They have to iterate. And they’re relying on those central teams to change up Spark pipelines or what have you to make things happen. I call this the data do loop. We need to prevent this data do loop, because it slows down everybody, both central teams as well as end users. And of course, high costs. As I mentioned, you want everyone in the organization to leverage data, right? And if you took that and you said, well, I’m going to extrapolate my query cost significantly over the course of the next few years, you find your budget’s going to go out of whack. Customer choice means you can run analytics with the engine you want, because ultimately, it’s your data. You don’t have to go through a vendor to get at your data. And we want to make sure that our data analysts aren’t held hostage on a query tax every time they’re running a query. 

How Did We Get Here?

How do we address this? How did we get here? Well, certainly, we had enterprise data warehouses for many years. And we started this advent of data lakes probably a little over a decade ago. And data lakes were meant to be open and flexible. But they were complex. And they didn’t ultimately solve a lot of the SQL workloads that our customers were using in enterprise data warehouses, which is ultimately why cloud data warehouses came out. Well, they solved some of the hardware challenges of an on-prem enterprise data warehouse. But they were closed and proprietary. Some were easier to use. But again, they were closed and proprietary. You still had ETL data into these cloud data warehouses. And often, when I talk to customers, because of the cost of that, they’re only sending in a fraction of their data. So now, if that data analyst needs more information, you’ve got to now create a new ETL flow back into a central data warehouse. So all these challenges come up– flexibility, openness, cost, which is why we’re so interested and excited about this open lake house concept. I’ll come back to the word “open” in a moment. We love the lake house concept because it is open and flexible, certainly. But it also brings the full data warehousing capabilities that traditional data lakes were not able to provide almost a decade ago. And today, we’re seeing this massive increase in iceberg-centric lake houses because that is becoming not only the standard for the table format that enables data lake houses, but also a pattern that allows freedom and choice for the customer. Because ultimately, it’s your data, your metadata, and your storage. We want to make it flexible and easy at a cost that makes sense for customers. Now, there are plenty of lake houses out there. It’s becoming more accepted. 

What Does it Mean to Be “Open”

But I want to be clear about what “open” means. For Dremio, open is not just about open source. Yes, we can all be open source. But what open source really means is you have a vibrant ecosystem. That ecosystem prevents vendor lock-in. No single vendor controls the ecosystem. And it allows the flexibility and ultimately choice for customers. This is why we’re so excited about Iceberg. So we view open differently. Open is not just about source open. It’s about open that allows flexibility, choice, and freedom for the customer. And ecosystems ultimately accrue to the customer. And so when you think about where we are, we’re in the Iceberg world. We’re so excited about this, not just because our logo is up there, but there’s logos of other options, lots of vendors, lots of solutions, and across different areas of the data space– AI, streaming, data observability, data quality, business intelligence. Iceberg provides an open framework to do this. It provides interoperability so customers get choice. So tomorrow, they wanted to move from one solution to another. There isn’t this vendor lock-in. Because ultimately, it’s the customer’s data. It’s not the vendor’s data. That’s what’s most important for us as a company. That’s what drives us in everything we do– customer choice and flexibility. 

Now, we’ve obviously chosen Iceberg as the core to Dremio. But we should also share the momentum we’ve seen on Iceberg. So the screen here shows you the amount of Iceberg commits that have increased over the last several years. And you can see the momentum continues to grow. These are committers, of course, hundreds of committers, probably thousands of committers. But over 100 different organizations are submitting. That’s clear momentum. There isn’t one organization that’s committing 90% of the code. Why is that important? Because that’s what open means. That’s what flexible means. You can leverage this Iceberg community as you wish, central to your strategy, and not worry in two or three years’ time, some ecosystem change will happen. We love that. And we’re big contributors to Iceberg. And we are hugely thankful for that community and what it enables for customers. And we believe this momentum will simply accelerate. Now, open is in our DNA. We were actually– the original founders of Apache Arrow, Calcite, and Parcare are additional projects that we contribute to. We’re major contributors. We’re our PMC members in some cases. And we’re a top five Iceberg contributor. Why is that important? It means there’s lots of people contributing. And it’s central to our strategy. So open is truly in our DNA. 

Now, speaking of open, what we love about that is that we’ve actually written the book on Iceberg. So one key point here is we recently released the O’Reilly book on Iceberg. There’s a QR code on the screen here. You can click on that and actually get a PDF copy of that. And as you can see also on the screen, lots of additional– I love this phone’s coming out. You can also see additional data points here on what we’re doing around Iceberg. In fact, we’re co-hosting the first official Apache Iceberg Summit, May 15 through 16. So again, it’s not just about Dremio. It’s about having this open community around Iceberg, which we believe is going to be central for data lakehouses going forward. 

So we talked about being open, talked about the momentum on Iceberg. But we also have to play our part, which is why I’m very excited to show you that Dremio is built to handle this true definition of what a lakehouse is. The original vision of a lakehouse was what? That you could bring any engine to the data lakehouse. Any engine to your data and do what you wanted that made most sense for your business and your users. So that is an ethos of Dremio. We see that certainly in a table format. But we believe we’re already starting to see it in the metadata catalog area as well. Too many vendors are saying, you must go through my catalog to access your data, which we find a little orthogonal to the sort of core mission of a lakehouse. So with Dremio, we’re able to support as a SQL engine, multiple catalogs, and as an overall lakehouse platform, provide a catalog as well. So again, this is about choice for the customer. You pick your engine, you pick your catalog, and it works seamlessly with that. 

Support for the Iceberg REST Catalog Specification

And speaking of that, I’m excited to announce Dremio will formally support the Iceberg REST catalog specification. Very excited about this. Now, what is this all about? It’s about the customer. It’s about that ecosystem and that freedom. You get to choose Dremio for what it’s great at. If you want to use Spark or other technologies, have at it. Why? Because they all support the REST specification. And so we are hopeful that other vendors will continue to gain momentum here and provide support for the REST specification. Why is that important? Because it’s all about the customer. It accrues to the customer in being able to be flexible in what they’re doing in the system. And it shows a common foundation for interoperability as these technologies evolve. We’re very excited about this. Now, when we think about Iceberg, that’s great. It’s open. But open is not enough. For you to go into production in these environments, you need to make open enterprise-grade. That means you have no trade-offs when you choose open technologies. And that’s where Dremio comes in. So open’s critically important, but enterprise-grade is equally important, which is why we’re also excited to announce the merging of our metadata catalog into our enterprise software product, Dremio Software. 

Now, what that means is that we have an open source project in Nessie already. But now we’re merging that into our Dremio software. So now when you deploy Dremio Software or Dremio Cloud, you get a fully capable Iceberg catalog along with it. And because we support the REST specification, you can use it. You can choose to use a different catalog as you wish. And we’ll continue to maintain our commitment to open and by continuing to support and investing in our open source Nessie project. So both the combination of open, through open source Nessie, but a full, robust Iceberg catalog in both Dremio Software and Dremio Cloud. Flexibility for the customer. The second piece around enterprise is around enterprise scalability and consistent performance. 

I’m excited to announce automatic scaling for Dremio Software. Now, Dremio Software has scaling, certainly, and a lot of technologies do. But the difference with both Dremio Software and Dremio Cloud is that we’re taking advantage of our scaling capabilities. And we’re merging it with our intelligent workload management. So what does that mean? Does that mean simply adding nodes? Yes, certainly adding nodes and compute performance to adapt and be relevant for your queries. But equally importantly, to be able to take that workload management and align it to your budgets and your SLA. As much as you all know, certainly I don’t have to repeat this in the data-centric audience here, you have certain users that you need to deliver sub-second performance to. And there are certain users, and I was an intern for many years, so I’ll say as interns, you may not want to give all your budget to the new intern in the audience. So how do we make that seamless and easy? So that combination of automatic scaling and intelligent workload management allows for customers to get the SLA they want for the workloads that matter, separate workloads as appropriate, and make it seamlessly affordable and easy to use. That’s our goal with enterprise-grade scale and performance.

Stability, Durability, Performance

Now that we have scale, what about durability? We’ve been spending a lot of time within Dremio. Continue to invest in our engine. And that engine has a lot of guts and details. And there’s some words on the screen that talk about some of the engineering projects we’ve been working on to make it more and more efficient, but also more durable. And we have a phrase within Dremio, no query fails. That is our basic approach. And so in cloud, where we get data from our millions of queries that come from our customers, we see our query failure rates typically sub 0.5%, on a good day, 0.1% or 0.2%. That’s pretty powerful. And we’re on this maniacal focus to continue to decrease that as much as possible. I’d be naive to believe we’ll get it to zero. Maybe we will. But our goal is to work continually towards that. It’s important because not only do you want consistent performance, you want durability when that query comes in. Last thing you want is a call from a user saying, hey, my query didn’t work. How do we make this seamless and easy? And because our cloud environment shares a lot of the common code base as our self-managed environment, you’re able to get the same power and query resiliency in our self-managed product. That’s our continued mission, stability, durability, performance. 

Enterprise-Grade: Lakehouse Management

Now, we’ve talked a lot about the engine. What about the full lake house? We’ve always and have already added automatic table optimization and cleanup. Because while adopting Iceberg is great, you also need to actively manage it in the context of an enterprise deployment. So now with Dremio, you’re able to fully get the benefit of automatic table optimization and cleanup. Small files, garbage collection, again, is about that autonomous data platform where I started the conversation today. Makes it more seamless for the customer to leverage and adopt Dremio. Talked about scalability, performance, durability. Talked about management. Now let’s talk about security. If you deploy in an enterprise scenario, you also have to integrate with major secrets manager systems. So I’m happy to announce integration with all HashiCorp, AWS, and Azure Key Vault as part of our integration with enterprise systems. 

Now, it’s critical for enterprise rollouts. So now as you can see in these series of announcements, we’ve got performance, automatic scaling across Dremio software and cloud. We’ve got security around secrets manager and so forth. We’ve got durability. And you have scale and performance. All coming together. Now, this is across both Dremio software and Dremio cloud. Speaking of flexibility, we are now formally ready to announce Dremio cloud on Azure. We all know data has gravity. So how do we make it easy for our customers and prospects to be able to leverage data where it sits? It’s about flexibility and choice for our customers. And that is critically important. 

Now, one of the unique things about Dremio’s cloud architecture is that we actually separate the compute plane from the control plane. So you keep your data and your compute in your account, in your AWS or Azure account. And we actually run our control plane separately. Why is that important? It’s important because as there are advances in compute, as there are advances in performance, you get an automatic TCO reduction. We’re hiding what compute runs underneath the covers. We’ll talk a little bit about that later today and some of the performance advantages we’re able to provide in the latest versions of our software. This provides much more flexibility and cost optimization for the customer. 

Now, I like partners that are so customer-centric, they drive great outcomes. Speaking of a partner that has like-minded solutions, I’d love to introduce a good friend of mine, Colleen Tartow, from VAST Data. Colleen, come on stage. Now, I’d like to invite Benjamin Schweizer from StackIt.

VASTs Partnership With Dremio

Colleen Tartow:

Hello, subsurface. I am very excited to be here to tell you all about our reinvigorated VAST and Dremio partnership. Now, if you don’t know VAST data, we are an exabyte-scale data platform based on an all-flash data lake for AI, BI, high-performance computing, and more. And VAST focuses a little bit more– a little bit sooner in the pipeline than Dremio. We focus on sourcing, ingestion, processing, and curation of data, whether that data is structured, unstructured, or semi-structured. One of the key features of the VAST data platform is the VAST database. We like to call this the VASTA base sometimes, rolls off the tongue. So the VASTA base breaks trade-offs. It’s both transactional and analytical, which is very exciting. And we really focus on the three pillars of performance, scalability, out-to-exabyte scales, and cost efficiency. And part of our partnership with Dremio that I’m announcing today is our new VAST database connector for Dremio. And this unlocks a lot of use cases for us. And it’s very exciting. 

Zero Trust Data Lakehouse Platform

But one key use case I’m going to talk about today is Zero Trust, which may be something you’ve heard about recently. It’s been in the news a lot. If you don’t know about Zero Trust, the one key takeaway you need is that it’s not a technology. It’s actually a set of architectural principles. And this is designed to safeguard your data and ensure robust cybersecurity in your environment. And you’ll see here the core tenets of the Zero Trust mandate, which come from an executive order from the Biden administration. And essentially, what this is doing is it’s compelling organizations to design their security stance to never trust and always verify. So verify everything, trust no one. And compliance with this is actually going to be required for US federal agencies by the end of 2025. And as you might imagine, this is a very significant endeavor for most organizations. And therefore, working with the right partners is going to be critical to success on your path for Zero Trust compliance. 

And so that’s why I’m happy to announce the Zero Trust Data Lakehouse platform featuring VAST and Dremio. Together, these technologies create an end-to-end data pipeline for security incident and event management data. So if you’ve heard of SIEM data, that’s what we’re talking about here. And this has traditionally been handled very well at small scales by tools like Splunk. But with these new mandates that have come out, there are new scales that we now need to address. The need to store significantly larger volumes of data with the requirement to be able to run highly selective and performant queries very efficiently is exactly what VAST and Dremio are built for. 

So for example, a traditional solution like Splunk might have storing, monitoring, and analyzing of data within one platform. And it might keep roughly 30 days of data, minimal data. But typically, this would be some minimal subset of data that wouldn’t provide a comprehensive long-term system wide view of SIEM analytics on data. And that is the new mandate that’s coming down. And so the federal government has said that now organizations need to include 18-plus months of data, 18 months minimum. And there’s a significantly different volume of data now. Instead of talking about gigabytes and terabytes, we’re talking about petabytes and exabytes. And like I said, this is the perfect use for the VAST database, which is purpose-built for that scale and running very highly selective queries in a cost-efficient and performant manner at the exabyte scale. I’m very excited to tell you more, but I am going to go back to Sendur. If you would like to hear more, we have a demo at the VAST booth out there. And I’m also giving a deep dive into the VAST database connector for Dremio tomorrow at 2 PM during my talk. Thank you very much.

Sendur Sellakumar

Thank you so much, Colleen. That’s awesome. Solving customer problems. Leveraging great technology in VAST and Dremio, but solving customer problems. That’s powerful. Meeting customers where they are, because they’ve got data sitting on-premises and in the cloud. Now, speaking of meeting customers where they are, I want to introduce our next guest to the stage. Now, our guests come from StackIt, which is part of the Schwartz Group. Schwartz Group is just the number one retailer in Europe. So it’s a small organization. Now, I’d like to invite Benjamin Schweizer from StackIt.

Benjamin Schweizer:

My name is Benjamin Schweizer. I’m a senior manager at StackIt. StackIt is a cloud platform built by Schwartz Group, Europe’s largest retailer. I joined StackIt five years ago when we were in the early stages. Now we are close to 400 people. And I’m responsible for the development and operation for multiple products, one of them being in the field of data and AI, learning about that later. 

The Schwarz Group: Diversity is our Strength

So Schwartz Group, as said, is Europe’s biggest retailer. We have the branches Lidl, Kaufland, Schwartz Produktion, PreZero, and Schwartz Digits. Lidl is also in the US, so you might know that. Schwartz Group has 575,000 employees. We are making over $150 billion in sales volume and operate more than 30,000 stores. That’s really big. But we are not only about shops and selling. In recent years, we looked into the circular economy. Now, what does it mean? With circular economy, we integrate production, retail, and recycling. So all those goods we bring to our store, a lot of them are produced on our own. And after that, we recycle that stuff to produce granulate to create new water bottles. And with that, Schwartz Group produces more water bottles in Germany than Coca-Cola is doing. And we are the biggest recycling company also in Europe. So it’s all about scale and size. And for all that, we need digitization. And with that, we moved all the IT and digitization endeavors into our branch, Schwartz Digits. That’s more than 7,000 people. On the picture, you see our new headquarter. And that is located in Germany in the Neckarsurm region. And we are working on business applications, online shops, mobile apps, data, and AI, and basically a lot of the stuff we need internally. And with StackIt, we are bringing this to the outside world and opening up as a platform our services. So aside of that, we are pushing also in the AI topic. And this picture shows the new Innovation Park AI that is also built in Germany, which is a big investment by Schwartz Group and other German companies to build the foundation to have more independence in Europe to also make development in that field. 

And with all of that, we are building our own cloud platform, which is StackIt. Now, with StackIt, we want to push an independent Europe digital leading. Together with our customers and partners, we are laying the foundation for a digital ecosystem. And in this way, we are creating the conditions for independence, growth, and future viability in Europe, offering infrastructure services like compute, network, and object storage, but also platform services like managed Kubernetes, managed databases, message queues, and all the parts that you need to run business applications on top. And this is being built by Schwartz Group to ensure political and economical stability. And yeah, so that’s StackIt. 

What to Do With Our Data?

We asked the question, what to do with our data? And we see there is the data and AI landscape is moving fast. We see the transition from data lakes to data lake houses. We have the demand for open standards to integrate different tools, not the one size fits all, but a composition of different components. We want to have self-service for our own users, and so do our customers. And we see a rapid growth in data, especially with AI use cases. And after all, sovereignty is very important for us. As you see with the integration of the production and so on, this is also about sovereignty and independence. 

STACKIT Data and AI Platform

Now, what does Dreamio offer us? Dreamio is committed to open data formats, which is a great match. It brings a cost reduction of up to $1 billion and a cost reduction of up to 80%. And in trading, cost is always important. It brings 5 to 10 times faster analytics and AI projects. And there is the Iceberg format, which brings the openness and what we just heard about the Iceberg data catalog. Now, so today, we are announcing the StackIt data and AI platform. [APPLAUSE] With the StackIt data and AI platform, we have a modular solution to ingest, process, and access large amounts of data. We can connect business applications, enterprise systems, IoT devices, and sensors. We can realize various use cases, business intelligence, and interactive dashboards. We can build data-driven decision and business models, integrate AI and machine learning, and low-code and no-code applications. So when you open up the box, what’s inside the StackIt data and AI platform, there is ingestion based around the old product named Intake, which is compatible to Apache Kafka. We have the access layer built around Dreameo, which brings the SQL interface and self-service and options for more integrations. We do visualization with Apache Superset. But due to the openness, we can connect any visualization on top of that. We are starting with a private preview. So you can line up, and we can start with first customers there to learn about their specific demands and adjust and adopt the platform for future growth. We have a dedicated data and AI consulting team for that. And this is now starting today. And with that, we want to support the StackIt vision of an independent Europe, digital leading. Thank you very much. And now, back to Sendur.

Sendur Sellakumar:

Thank you, Ben. Super powerful. I love that. Vast StackIt, great partners providing choice and flexibility for customers. We’re going to see this trend continue. We’re excited to collaborate with our partners and excited for what we’re going to bring to customers. So we’ve talked a lot about proprietary systems and how being open and flexible means we can work with great partners, such as StackIt and Vast, to provide value to customers. That’s what open means. We embrace open standards. We’re not held hostage by budgets or individual vendors and choice. And that choice means you get to leverage the full ecosystem. Now, we talked about proprietary systems. 

Operational Bottlenecks

Now, let’s hit on operational bottlenecks. And it’s, of course, a very common problem with our customers that we’ve attempted to solve and have solved for many of them. Now, this exists even in the modern data stack. When you look at these organizations, everyone, as I said, wants to be a data citizen. Every department, they want to leverage AI in what they’re doing. But it all has to work together. Critically important. And when you look at what’s happening today, most of these organizations are leveraging modern BI solutions, certainly, but they’re going through central teams. The central teams are busy dealing with the multitude of requests that come in. And frankly, if we want everyone to be a data citizen, those requests are going to just increase exponentially, frankly. In fact, one retail customer I spoke with spends anywhere from 30 to 90 days making changes to their ETL pipelines. So imagine asking a question and waiting 90 days. Now, maybe the CEO gets it in three hours, but I’m pretty sure that retail analyst doesn’t get it in three hours. How do we make this easier for them? And this customer went from that kind of timeline to a matter of days with Dremio. And we believe in the course of time, we’ll get to the matter of hours. So that data analyst can move at the velocity of the questions and insights they need for their business. Cloud services and CDWs are great, but they don’t really cut this wait time that we see as a reality of the data analyst in a modern enterprise. 

But why is that the case? Well, you look at the actual access pattern that happens. Frankly, if I had presented this slide 25 years ago, it’d be very similar. There might be some different technology changes. You might have used Informatica versus Spark. You might have used network-attached storage versus object storage. But the same pattern existed. And while the individual technologies have gotten better and more robust, we still have this pattern, where the business user and the business unit is separated from the central team. Central team has to deal with multiple requests. So how do we break this logjam, make both parties more efficient and more effective? 

Shifting Left

So at Dremio, we believe in this concept of shifting left. What does that mean? That means enabling that data analyst to move earlier in their data pipeline and be able to self-serve their needs on their own. This, we believe, is the foundation to democratize data analytics. I know it’s a buzzword we hear a lot, but this is what it really means. It means getting that data analyst to be able to do a lot more than simply visualization-level changes. What does this require? It requires what Dremio delivers. Number one, unified analytics, both the ability to query data on the lake, but also relational data stores. An intuitive self-service user experience that we are continuing to build upon with our generative AI capabilities, which we’ll talk about later today. And of course, virtual data products. And too often, people think of virtual data products as simply, oh, it’s a view on top of my data. Sure, that’s true. But what we really want is a virtual data product that can eat into the pipeline. Because then that data analyst knows not only the pipeline, but how they got to their product. How that end-to-end view is really what customers are seeking. Because I’ll tell you the minute I see data, my question is going to be, how did you get to that value? So you always ask that question. Well, now the data analyst can answer that and not have to rely on a central team. We democratize that velocity. That’s about empowerment to the domain experts in these organizations. 

How do we make this possible? Of course, we can query data in data lakes. We’ve talked a lot about data lake houses in Iceberg. But we also can query into databases. So what I know, as you probably all realize, what you live with the reality of, which is you don’t have any single database. You have lots of databases. And I suspect if I threw out a bunch of names, you’d just keep nodding your head. And that’s the reality of the world we live in. Why? Because, well, these technologies are great for specific things. And people have learned, and developers, and internal teams, and vendors have started to use these technologies. Postgres, SQL Server, Sybase, pick your list. They all exist out there. And there’s no end state where all that lives in one place. We’re not naive to believe that. We believe it will need to be distributed to some fashion. So Dremio has built a semantic layer that sits on top of all that. And this unique combination of both the semantic layer as well as the engine means that combination is what allows us to deliver velocity and scale. And that allows clients like BI solutions, custom applications, data science tools to not have to know where that data lives. They can simply leverage that semantic layer and do what they need to do at the velocity of their business. Users can discover and self-service their needs. Now, self-service can also be aided by generative AI. Of course, that’s all the buzz and rage nowadays. And generative AI truly has some great advances in productivity for that data analyst. We’ve already released our text-to-SQL capabilities. We’ve also added automatic wiki generation, automatic data labelling, inclusive of PII information you might have in your data. And that’s something that we’re going to talk about later today and in tomorrow’s keynote as well. Self-service, federation, and a semantic layer all come together and are enabled even better with the generative AI capabilities to drive greater productivity for our data analysts. 

We’ve got all those capabilities. And remember, it’s not just about creating that view. It’s that end-to-end pipeline that we want to enable. When I talk to customers, 90% of their Spark or ETL work is pretty simple items that can be simply done in SQL. So how do we empower that data analyst to leverage SQL, do what they need to do, and not rely on specialized resources? I’m excited to have Alex Merced share some of this in Tomer’s keynote later today. So hang tight for that. That’ll be some very interesting and cool data operations capabilities via self-service. 

Reflection Management Today

Now, what does this empower? Empowers much more flexibility. As this visual can show you, one of the benefits of our technology is that by doing the self-service, you don’t need to do as many Tableau extracts or Power BI extracts. But often, the flexibility can come at a price. You give all the self-service to the user, and guess what? They’re running massive queries. Your bills are going crazy. Performance slows down. We get this. We get self-service is not the nirvana. Self-service has to be combined with the platform that can enable self-service within the guardrails of your organization, certainly security and governance, but also performance. One of the ways performance is delivered through Dremio is obviously with our core engine, but also with the capability we call reflections. Now, this is not a materialized view. In fact, a lot of people sort of make that analogy, but in truth, it’s very different. Yes, there’s some summaries of data, but there’s also individual column summaries. And so we take the different tiers of summarization, which, yes, has some semblance of a materialized view. But the real difference is that we take that and intelligently rewrite your query at query time to leverage which summarization makes sense. It’s only possible because we have that semantic information. So imagine, in a traditional materializer, you’re hard coupling the user query and the data. But with Dremio, you get to break that apart. Why is that important? Well, in a large organization, the last thing you want to do is you see performance issues and have to go tell 1,000 users, hey, go use this view over here. Good luck with that change management. It’s hard. With Dremio, you don’t need to worry about that. We’ve got reflections that allow you to seamlessly accelerate these technologies and decouple that. That’s the query intelligence of Dremio. 

And so this is an area we’re continuing to advance in our business. Historically, when you create a reflection, you have to identify the slow queries, create a reflections, leverage a reflections UI to create the reflection, manually maintain the reflection, and, of course, develop and think about that cost-benefit analysis. Does it make sense to summarize this data? Should I summarize it? What’s the benefit for my users? So it’s a manual work today. And so we’re excited to announce Reflections Recommendations and Reflections Observability. [APPLAUSE] What is Reflections Recommendations? Reflections Recommendations analyzes your query patterns and automatically tells you which accelerations to trigger. So that’s self-service capability, automatically tells the system, hey, analysts, you’re doing a lot of things in this area. Here’s a recommendation we have to accelerate your queries based on the query patterns that are being delivered into the system. Reflections Observability tells you which reflections are actively being used. Now you know, cost-benefit analysis-wise, which reflections to use and which ones you can turn off. Those are just two of the announcements today. Coming soon will be a capability called Live Reflections, where as data comes in, it will automatically update these summaries. There’s no disconnect between a summary and the raw data. Don’t take my word for it, though. I’m excited to bring on stage Isha Sharma, product manager here at Dremio. Isha.

Self-Service Reflections

Isha Sharma:

Thank you, Sendur. As someone at Dremio who creates her own critical dashboards, I’m super excited to talk to you today about our self-service reflections. So let’s get to it. In this scenario, we’re retail company X. And we’ve got a variety of products in our inventory at different locations across several different categories. Our goal today is to create a dashboard of our top 10 customers. The challenge here becomes that the data sets which contain information about products, orders, and our customers are hundreds of billions of records, which means calculations and aggregations on top of them are going to be slow. But this is a critical dashboard. I need sub-second performance every time. So to address this, we’ll go through three main steps today– creating a data product, creating a dashboard, and then meeting our SLAs using Dremio’s acceleration technology. Let’s start with creating a data product. And we’ll, first of all, find, query, and validate our data to do that. 

So we’ve got our catalog with use case-based data products. Going into our customer retention, we’ve got a virtual data pipeline with raw curated in production that you can think of as bronze, silver, and gold. Production is where our dashboards and reports are running off of. And with our customers’ orders and products views, those are dimensions of customer retention. There’s also a customer retention report, which looks interesting, but I don’t know a ton about it. I don’t want to ingest new data for this case, so I’m hoping to use what’s here. So to find out more about our customer retention report, let’s go ahead and query it with our intuitive SQL runner. And we’ll just get a sample of what’s in here real quick so we can determine if this is the right thing to use. And we’ll see that by the columns that are coming back, we have customer information. But do we have order information, which is what we’ll need? So let’s go ahead and– which we do. We have order and customer information, which will give us our top 10 customers. So let’s switch over to this pre-written query that I have to take customer retention report and derive our top 10 customers from it. 

We’ll go ahead and save this as a view in our production layer, which is what our dashboard has access to. And we’ll call this top 10. That’s how easy it is to add a view to your data product. And now, before we move on to the next step, I want to make sure that my team has a good understanding of this view when they come across it. So we’ll do a quick labeling and wiki generation on this. With Dremio’s LLM-powered capabilities, you can quickly generate consistent labels across the data sets in your semantic layer. And let’s do the same for the description of the view, as well as the column descriptions. And there you have it. We’ve documented our view. 

Creating a Visualization in Your Dashboard

So that’s how easy it is to create a view in your data product using Dremio. Now, let’s move on to creating a visualization in our dashboard. And let me show you how to do that. So here, we’ll add our production layer. And then we’ll add our top 10 view that we just created. Had we wanted to do the business logic in the top 10 view in the client tool itself, it would have been slow. And everybody who wanted to recreate that same metric in their client tool would have to recreate that logic. Here, with the top 10 view, we’ve done it once. And it’s available to all analysts and data scientists via the semantic layer. So now that we have that, let’s go ahead and create this visualization. And we’ll call this Top 10 Customers. And we’ll add the full name of our customers. As you can see, this has taken a while. Hundreds of billions of records, so it’s expected. So let’s switch over to a visualization that I’ve pre-created. And here, we have our top 10 customers and their total spent. 

All right, so we have a visualization. That’s great. But 25 seconds to load it isn’t going to cut it. So let’s move on to the next step of accelerating the performance on this query by using Dremio’s acceleration technology called Reflections. And back in the Dremio console, I’ll go ahead and check out the reflection recommendations that have been generated based on the queries that we’ve run. I’ve got three of them. And not only am I getting cost-based recommendations, but I’m also getting data points to inform how impactful the recommendations are. Now, these recommendations are for raw reflections, which store row-level data in an optimized form for scans. And based on the query speedup time, what we’ll do here is accept the customer retention report recommendation as well as the customer’s view reflection. While these raw reflections are being created, I’m going to come back to our top 10 view and create an aggregation reflection so that when the top 10 view is queried, Dremio will rewrite the query with pre-aggregated statistics that the reflection has collected based on the dimensions and the measures that are here. 

The recommender hasn’t given us this reflection yet because we just haven’t run enough queries. So for this aggregation reflection, I’ll go ahead and make it specific to the two columns that we’re using in our visualization so that it remains super optimized. And we’ll go ahead and do that and hit Save. And then coming back to our Reflections page, you’ll see that the aggregation reflection is also being created now. From an observability standpoint, we’ve got some great metrics here. You can find out how often the reflection is matched, how often it’s being used in an acceleration, as well as the current footprint to make informed decisions about the set of reflections that are effective and optimized. So now, as soon as this completes– there I go– it’s completed creation. And here we can see the visualization is coming back in an instant. 

So we went from 25 seconds to sub-second to process hundreds of billions of records. With Reflection Recommendations and Observability, we’ve trimmed down the number of steps that it takes to create and manage reflections significantly. And with that, just a few clicks, you can easily get consistent, accelerated performance on all the applications and tools that are leveraging data products in your semantic layer. And with that, back to Sendur.

Sendur Sellakumar:

Thank you, Isha. Notice, Isha did not have to go back to a central team. She was able to self-serve her needs with regard to query analytics and performance. That’s the power of self-service and why you need a platform that can enable that for your data analyst. We’ve been working with customers to realize this vision. And this means we can change the physics of analytics for an organization. Now, speaking of customers who are changing the physics of analytics with their organization, I want to welcome a customer onto the stage who’s had truly distributed data ownership and domain ownership of their data. Please welcome–

Becoming Data Driven

Karl Smolka:

So thank you, Sendur. And good morning, everyone. Before I get going, I just want to say a big thank you to Sendur and the wider Dremio team for not just having us today, but for the wider partnership. But today, I want to talk about a journey that probably most of us in this room are on. And that’s getting to the point where our organizations are treating data as a strategic asset. And I think probably most of you agree– or maybe it’s just me– that getting there is actually quite difficult. So today, I’m going to talk a little bit about the journey that we’ve been on, how Dremio has been helping us, and just some insight into that. But a quick personal introduction– my name’s Carl Smolka. I’m the MD at TD Securities. And I look after the data platform and the technology side of things. 

OK, so hopefully, this big green machine logo here is fairly recognizable for most of you in the room. In fact, there’s one right out of that window just there. So we have a pretty strong presence down the eastern seaboard. And it’s continuing to get stronger. But for those of you who don’t know us, we’re the capital markets or investment arm of the Canadian Retail Bank from Toronto. We’re number two in Canada. And we’re known as a very safe pair of hands. That’s the organization. But our mission– what is our mission? Our mission is pretty simple. We want to be top 10. We want to be a top 10 investment bank with global reach. And to get there, I think we know, collectively as an organization, that if we’re going to stand any chance in making that top 10 spot, a modern and effective data strategy is not a nice to have. It’s an absolute necessity. And we can probably take that statement a little bit further. But we can probably say that for any large organization, if you want to be around in 10 years time, you’re probably going to have to be data driven. 

OK, so you’re probably asking, well, is this new? And the answer is no, not really. Capital markets, investment banking, like most of your organizations, have always been about data. So there’s nothing new here. But a couple of things, I think, have changed. And the first is recent advancements in AI have really obviously shined a light on how important data is. We know that data is the Achilles heel of AI. And if you want to be good at AI, you have to be good at data. Along with that, those pesky regulators, they want more and more information from us. And they want to know that our processes and the data can be trusted as well. So that’s the two things that I think are really changing the landscape in which we operate. So as an organization, we’ve reached this inflection point. And I understand that we need to get good at this, because the evidence is clear that organizations that are data driven are more successful. 

Data Strategy Components

OK, before we get geeky, we’re going to feed you a little bit of data strategy. Because one of the most common questions I get, even as a technologist, is, Karl, what is your data strategy? And whilst as a technologist, I don’t have sole responsibility for the strategy, it’s a question that I think people often don’t really know what to expect in return. And I’m not going to dwell on this today. But my point I’m making here is that data strategy and becoming data driven is not really just about unwrapping a piece of software. So a good data strategy really starts with people and processes. You have to have your governance, your rules, your regs, and your guardrail in place. And that’s all part of the journey. And Dremio very much will help you with that for any prospective users out there. But tech alone is not enough to liberate your data. 

But what is our strategy? It’s pretty difficult to get a whole data strategy into one slide, but here’s my attempt. And first is accessibility. Within any organization, you’re going to have data splattered around your ecosystem. And often, getting hold of that could be difficult. Requires numerous conversations. So I need to get people access. But then once that’s solved, we need to make sure that that data can be trusted. There’s a lot of data out there. But that data can be trusted. There’s nothing worse than putting a report in front of a sales team or a trader than for them to say immediately, that’s not right. Just undermines all credibility of the data. And lastly, we need to give users across the organization the tools they need to actually make data-driven decisions with minimal tech involvement. 

Self-Service Data Experience with Dremio

So let’s talk a little bit about Dremio and how they’re helping us along this journey. And to do that, let’s walk through a hypothetical data citizen within TD Securities that’s perhaps new to the organization. And let’s talk about the steps they need to go through to get to that decision. And the first step is data sourcing. And often, users have to go on what I would call data fishing expeditions. They have to call around many people and ask hundreds– not hundreds, but tens of people, what data is there? Where is it? Is it reliable? Who looks after it? And can I trust it? And that’s a lengthy process. It requires numerous painful conversations. The second phase is analysis. Once you do eventually get hold of that data, I can tell you I’ve lost count of the amount of times that people will come to me and say, Carl, can I borrow a developer for a couple of hours? I’ve got some data. It’s just too damn big. And data, we know, is only getting bigger. And that really speaks to the third point, efficiency. Now, I think we know that if any organization wants to scale this practice, we can’t have IT being on the critical path for every single data decision. Obviously, Sendur talked a lot about that earlier today. And ultimately, in the investigation phase, we need to provide people with the right tools that they can actually use. And they need to be literate in those tools so they’re able to do this themselves to ultimately derive some of that value. 

So how is Dremio helping? So in the sourcing layer, we use Dremio very much as a data fabric, that sort of unified data access layer across our disparate sources. And that’s a natural place for us to leverage their search, discovery, and data management capabilities. And what that allows is that allows users to really quickly ascertain, with minimal tech involvement, what data is out there. In the analysis phase, for the first time having this data fabric, users are now able to run SQL workloads in the cloud at scale without fear of blowing up the underlying operational system. That’s a big step forward. The workload management and the reflection features are a big enabler for us. 

And that really speaks to the efficiency. So users are able to now get to this point via self-service. IT are not the critical path here. Self-service, as you’ve seen today, is very much in the DNA of Dremio. Which then leads us on to the investigation phase. SQL is highly pervasive. And what we’re seeing across the organization, it’s no longer just the language of developers. Traders, everyone in the organization at least knows a little bit, or should know a little bit of SQL. And ultimately, that’s getting us to value quicker from days, weeks, months, et cetera. 

OK, so we are not fully in on Dremio. And that’s OK. We use other technologies. And we have a large, large technology estate. And that’s fine, because central to Dremio’s working mandate, as you’ve heard, is their adoption of open standards, which is allowing us to pick and choose where we use Dremio at the right time. We’re very much using Dremio for the data acquisition features. And the virtualization capabilities are a huge enabler for us, providing access to those data silos across our organization. We’re also leveraging, obviously, the SQL workloads in the cloud and some of the data management capabilities. But as a tech leader, what this gives me, it gives me reassurances that I’m not going to get locked in to a closed ecosystem that I’m unable to pivot away from in the future, because we know the tech landscape is still evolving. 

Lessons Learned

So maybe to finish up, just a few lessons learned from someone in tech in a large organization that’s been on this journey for probably a couple of years or had the job now. And the first thing that took me by surprise is data literacy. Data literacy is hard. A few years ago, when we presented our pitch for data strategy to our chief operating officer, the first thing he said to me is, this is a bit esoteric. And he was right. Explaining data products to a trader can sometimes take a lot of repetition. And that’s something you have to consider on your journey. The other thing, as leaders, we have to quantify the value of data strategy. We all know we need it, but how do you quantify it? And sometimes, it’s not that easy. It’s a little bit abstract at times. And to that extent, that’s why you need top-level executive buy-in. And I’m not talking just about technology executives. Everybody in the organization needs to understand how much of a priority this is. I want to say, look, the technology landscape is still evolving. Huge things are happening in the industry right now. It’s not going to be the same place in two years’ time. And maybe to finish off, just say, look, getting to that journey of a strategic asset, it’s something that’s very easy to agree we need to do. But in reality, it’s actually quite hard to execute on. But I just want to say, again, thank you to Dremio for helping us make some of the inroads that we are today. Thank you.

Sendur Sellakumar:

Thank you, Karl. Great example of a customer leveraging Dremio to break that data do loop, enabling customers to shift left, users to shift left, and move at the pace of their business. The last thing a trader wants to do is have to go back to a central team for data access. They want to move quickly to make that next trade as fast as possible. As I always say, time is money, right? End of the day. So unified analytics, virtual data products, and generative AI all enable open, flexible, and self-service access to your data. 

Cost of Data

We’ve talked about open and flexible. We’ve talked about self-service. But now let’s talk about cost. That’s a big thing. Certainly what’s going to happen, and as Carl alluded to, there are going to be more and more data citizens at TD. They’re going to run more queries. Those queries are going to generate more costs for them. So how do they make it efficient and easy? Last thing you want is your budget for analytics to grow at the same rate as your query volume. That’s a little crazy. And today, because systems are not open and flexible, you’re sort of held hostage having to go through one system to access your data. We want to eliminate that query tax. Now, when you look at budgets, they’re going anywhere from 7%, 8%, 4% at our customer base and in prospects. But data volumes are going at double digits. And that will continue for years to come. Those data volumes mean that even if query volumes don’t grow, you’re now scanning a lot more data than you did before. So at the end of the day, no matter how you slice it, the computational need, the query load will just continue to increase. 

Now, how do we make this work? How do we connect this disconnect that’s occurring in organizations? Now, it starts with people, process, and certainly software. Because you have an open system, because you can eliminate some of the operational bottlenecks, you can deliver great savings automatically. You can get access and to data much faster. You can eliminate copies of data through this concept of virtual products– or virtual data products, excuse me. But what about price performance? That’s also important. Dremio is maniacal about focusing around price performance. We want to be world class at this. We continue to put efforts in this area. And if you look at the data, being a data company, you can see the performance improvements we’ve made over the last several years. Nearly 2 and 1/2 times better performance. Why is this important? Again, your budgets aren’t growing like crazy. Your query volumes are. So how do we make it easy for you? So that analyst doesn’t have to think twice about, the minute I hit that query, is Carl going to ask me how much more budget did I just spend? How do we make it easy? How do we reduce cost? How do we increase time to insight? This naturally transits into less hardware for you all to run your query volumes. 

Dremio on Graviton

But I’m especially excited to announce support for AWS Graviton. [APPLAUSE] Now, why is this important? Back to the price performance story. This graphic shows you a movement of going from one EC2 instance to another. And what you can see is an out-of-the-box 30% improvement in performance. It’s pretty magical. Literally, tomorrow, point your engines at a different node type. Probably spend two, three minutes doing that, and you’re getting 30% price performance improvement. That’s pretty powerful. Imagine going to your finance partner and saying, I just cut my hardware budget by 30%. That is pretty powerful. We’re excited to partner with Amazon on this. Just an example of the type of improvements we can deliver with Dremio. Now, the beauty of that is that you don’t need to wait for a vendor to change their underlying hardware. You just get that benefit automatically. I also want to share some benchmarks. Now, I’ll be the first to admit, benchmarks are not everything. But they’re a data point. They’re a data point to convey at least a way to keep score to a degree. And this is a one terabyte TPC-DS benchmark. And when you look at the data, you can see that with Dremio, you’re getting a 70% performance improvement relative to Snowflake, and over 100% relative to Trino. Why is this important? Well, this means hardware savings team. This means a fraction of the infrastructure you need. Now, benchmarks aren’t everything, as I said when I started the conversation. Those benchmarks have to tie to what customers are telling us. This is what customers tell us. These are real savings from customers. We’ve left the names off to protect the innocent. But what I can tell you is they tell us, I was working with this vendor, and I came to you, and I delivered seven-figure savings. Seven-figure savings. Greater than 50% of their budget savings. I can tell you, this gets me up in the morning. When I see these sort of savings for customers, we know we can deliver a ton of value. This is powerful. So yes, the benchmarks are nice. But man, customers telling you, we delivered some outcomes for them where I started the conversation today. Customer outcome. This is customer outcome.

Now, truth be told, are we the end-all, be-all for all analytics at your organization? No. But if you rely on open, flexible formats, and you rely on something like Dremio to give you cost and price performance improvements, in addition to all the acceleration and self-service technologies, you can pick which solution and platform you need for what outcome. And certainly for SQL, we intend to be that. Here, analytics do not hold you hostage on your budget. So high costs are clearly preventing data adoption. But with Dremio, we’re lowering the costs day in and day out, increasing velocity for your users, lowering expert needs, and passing through those savings to you. As I said, we’re not hiding. And in our cloud infrastructure, our cloud offering, we don’t hide what infrastructure we run underneath the covers. You get the benefit of that.