March 1, 2023

11:00 am - 12:30 pm EST

The Year of the Data Lakehouse

Data and Analytics organizations have worked to balance improving access and self-service for the business with achieving security and governance at scale. Data lakehouses are ideally suited to help organizations provide both agility and governance. Deepika Duggirala, SVP Global Technology Platforms, at TransUnion and Tamas Kerekjarto, Head of Engineering, Renewables and Energy Solutions at Shell, will share their journeys to deliver governed self-service.

Apache Iceberg development and adoption accelerated significantly this year, enabling modern data lakes to deliver data warehouse functionality and become data lakehouses. With Apache Iceberg, the industry has consolidated around a vendor-agnostic table format, and innovation from tech companies (Apple, Netflix, etc.) and service providers (AWS, Dremio, GCP, Snowflake, etc.) is creating a world in which data and compute are independent. In this new world, companies can enjoy advancements in data processing, thanks to engine freedom, and data management, thanks to new paradigms such as Data-as-Code. Tomer Shiran, CPO and Founder of Dremio, will deliver Subsurface’s keynote address.

Topics Covered

Enterprises
Iceberg
Keynotes
Open Source
Real-world implementation

Sign up to watch all Subsurface 2023 sessions

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Tomer Shiran:

All right. It’s exciting to be here. Exciting to have this first Subsurface in person in three different locations. I think we brought good weather to all three from what I checked this morning, and thousands joining us through the streaming interface as well. So welcome everybody to Subsurface this year. I’m going to talk about lake houses and data meshes. And I want to start just by sharing that I’ve been on the road a lot in the last two months. In fact, I just came back from a trip to the UK and France and had the opportunity to meet lots of different companies, lots of different Dremio customers, and others. Folks that are dealing with a lot of data. And what I’ve seen repeatedly through all these customer meetings and all these interactions, is that we’re living now in a world.

Most companies are living in a world where they have these two opposing forces that are conflicting with each other. On the one hand, the need of the business to have faster access to data, to have speed and agility and really wanting to get things done faster. And then on the other side, you have the need for data governance and security. And that’s typically driven by the central IT teams. And of course, you need both of these things to happen in order to be successful and not take crazy risks. These two forces, of course, are in conflict with each other and it’s very difficult for many companies to deal with this. Now, fortunately over the last few years we’ve seen the rise of this new idea. This new data architecture is called a data mesh.

Data Mesh

And many of you have probably heard of this. Really, four different kinds of principles that are involved in this architecture. One, managing data as a product. Two, having a self-service approach to accessing that data and having domain ownership where different groups own their own data and manage it and test it and so forth. And then finally, having federated data governance where different groups are responsible for their own governance. And that’s federated, as opposed to having one team that does everything. So, I’m not going to get into many more details about the idea and the principles of a data mesh. Fortunately, we have some amazing talks here at Subsurface over the next two days. In fact, we have the authors of both of these data mesh books, tomorrow at 8:00 AM. We have a panel about data mesh with Zhamack participating on that panel.

And then we have the author of Data Mesh in Action tomorrow at 10:10 AM Pacific time. So do the math and I’m excited to have these folks presenting and other talks about data mesh implementations as well. And so, if you’re interested in this concept and how this can help, definitely tune into some of these talks. I want to focus today on what you need from a data platform in order to implement a data mesh. So, a little bit more about the technology, talk about the requirements for that. Some of the things we’ve done at Dremio specifically around enabling data mesh. And then also, invite to the stage a few people to share real world experience and how they’ve implemented data mesh in their own companies and really driving changes in the world using data.

Platform Requirements For Data Mesh

All right. So when we think about the data mesh and the three requirements for data mesh, really we think about three things. One, self-service, two, performance, and three, data governance. I’m going to start by talking about self-service. So, self-service really requires three different things. It requires, one, the ability of course, to be able to use any tool and to connect to any data. And that’s really important because we have many different tools within organizations and lots of different data. The second thing is being able to have an abstraction, right? The ability to have a semantic layer that’s governed and makes this data consistent and accessible to everybody. And then finally, of course, it has to be easy. We have to have a very simple interface for people that are technical and people that are less technical. So let’s talk about these points.

One of the things we did at Dremio, we worked very hard to integrate with a variety of different client applications and a variety of different data sources. And we think about data sources. Of course, today, many organizations put a large portion of their data in data lakes, right? And things like S3 and Azure storage and GCS on Google, right? But organizations have lots of data in other places. They have data in databases and data warehouses, sometimes hybrid environments where they have data on-prem, right? And maybe S3 compatible storage. And organizations, especially large organizations, have all of these things. They don’t just have one data source. When you think about client applications, BI tools, data science applications, things like that. Again, a large organization has all of these. When I ask companies that I talk to, “Okay, which BI tool do you use?”

The most common answer is all of them, right? That’s typically what I hear. And so being able to provide self-service means enabling all the different constituents in the company to access data, whatever their favorite tool is. And of course, being able to access a large amount of the data that the company has. But of course, that isn’t so simple because we know that the data in these sources is stored in different ways, different schemas, it’s not consistent. And you have to organize that data and make it more approachable to a business user or an analyst. And so having a semantic layer allows you to do that. It allows you to basically define the metrics, the dimensions, the KPIs that people are going to consume in a consistent way that will work across all the different data sources.

But not only that, it’s really important for the semantic layer to not be part of the BI tool, because historically that’s where the semantic layer was. It was part of a specific BI tool. And when it’s part of a specific BI tool, that means that the person in the other room that’s using a different tool can’t use the same definitions, the same metrics, the same dimensions, and they get inconsistent results. And so that doesn’t work. And what’s also really important when you think about data mesh is that you can organize the semantic layer not as just one monolithic thing, but in different domains, right? You can support different departments or different applications in different use cases. In this example sales, marketing, supply chain, etc, right? But that’s not enough, because if you just create a semantic layer like this, then you have performance problems, right?

We all know that the speed that you can get from some of these data sources isn’t enough, right? Especially if you’re talking about tables that have billions of records or maybe many millions of records, the results are just not going to come back fast enough. And the more logic you add to the semantic layer, the slower things might be. So this is great in theory, you can connect to any data, any tool, define perfectly clean semantics in the middle, but without being able to accelerate those queries, that’s not going to work. And so, one of the things that we created at Dremio is a technology called Data Reflections, which basically materializes data automatically in different shapes and forms. So aggregations, sorts, different partitioning schemes, and uses those reflections automatically when queries come in from these different tools to make sure things are fast.

And that allows you to get sub second response times and it allows you to get the performance that you would get if you really physically optimize the data, right? And so you have this clean environment, users see the abstraction layer. Things are fast, right? That is really what enables this kind of governed domain based semantic layer to work. And like I said, that’s not all. You need to have a simple interface for working with data products. And so we’ve created an interface that makes it very easy to create data products, whether you’re a SQL user, somebody writes SQL by hand and we’ve added things like autocomplete recently and built in documentation to see exactly what functions exist, all sorts of new enhancements to the interface here. And we’ve also made it very easy if you’re a non SQL user to point and click and do various transformations like you would on a spreadsheet, right?

Regardless of how big the data is. And of course, for most companies, it’s not one person, unless it’s a small startup. It’s not one person that’s working on these things, it’s many people. And so having the ability to collaborate and to share one of these data products that you’ve created, whether it’s a view or maybe a collection of queries, to share that with somebody else and to control exactly what that other person or that other group of users is able to do. And then finally, we talked about having to support all these different tools and provide really nice integration. We’ve worked very hard with a variety of different BI technologies and data science applications to provide things like Single Sign-On and very clean authentication. And even within the Dremio user interface, you can click a button and bring up something like Tableau, or bring a Power BI with a live connection already pre-created. So, it made it very easy to use any of these BI tools regardless of where you’re starting in your process.

Performance

So, like I said, self-service is very important, but without performance, there is no self-service. We’ve all seen this before. We’ve created these architectures, we’ve deployed systems. Things are too slow and the users don’t want to use it, right? So we know that self-service just doesn’t exist unless things are fast. And so performance is really critical, and we’ve done a lot to make performance fast. Three key things I want to talk about. First, being a project called Apache Arrow, which we created several years ago to provide rock query performance, query acceleration, to eliminate the need for Tableau Extracts and Power AI imports, Cubes and all those kinds of things. And then an architecture that scales infinitely to allow you to have as much concurrency as you need.

So let’s talk about this for a second. Apache Arrow is an open source project that we created several years ago and the idea was to provide a columnar in memory format. So we looked at Dremio’s in-memory format, and we said, “Well, wouldn’t it be nice if we could have a standard in the industry, just like we have columnar formats on disk to have a columnar format in memory that can be used? And so we open sourced our in-memory technology. And fast forward to today, in the last year, Apache Arrow has been downloaded over 700 million times. Almost every data scientist today uses Apache Arrow. There’s a project called Pi Arrow. You import it, and you can start using Arrow.

Many new systems are being built on Apache Arrow, and the reason for that is that it’s very fast. It allows you to really take advantage of modern intel CPUs, AMD CPUs, ARM processors and so forth in a very efficient way. Dremio is the only engine that’s built entirely on Arrow. Every operator in our system is actually based on the arrow. And we can also return results really fast to a data scientist who is using something like the PI Arrow Library. So this is about raw speed, but at the end of the day, sometimes the data is so big that it doesn’t matter how fast you’re going to be, you’re not going to be able to process it cost efficiently if you have to scan every record every time, right?

And that’s the reason we created Data Reflections. We saw a little bit of that in the context of the semantic layer, but if you think about the legacy that the typical data architecture based on warehouses, what ends up happening is you end up having your tables in the data warehouse, and then every new use case results in more copies of data, right? Somebody creates a new application or dashboard, and they say, “You know what? For the performance, for this to work, I need these three tables pre-joined and for the performance of this other use case to work, somebody else needs that table aggregated and summarized in different ways, right?” And so you end up with all these disconnected copies of data, which is expensive, expensive to create, expensive to store, and dangerous because permissions don’t travel with data. And you have to manage these things and you end up with a hundred thousand tables in your data warehouse or more. And you don’t know who created them.Do they still work at the company? How do I manage that?

And that’s not the end of the story, because after that you have the BI users creating their own extracts and imports, right? We’re all familiar with that. We connect Tableau, we don’t get the performance we want. We start creating Tableau extracts. And the same problem happens. Lots and lots of copies of data, those get refreshed every night. I recently met a company. They were telling me that they had 10,000 of these Tableau extracts. They were refreshing every day and they thought that most of them were created by people that no longer work at the company. Okay? So then you end up with these extracts and imports and then finally you have visualizations that maybe are fast enough. But by the time you’ve done all this, there is no self-service. It’s too complicated and it’s very expensive and hard to manage.

And so by having Data Reflections where we can store different materializations of the data, aggregations, sorts, partitioning schemes of the data, directly in something like S3 in a columnar, or in an open source columnar or format like Parquet, right? Or in Azure storage or GCS, we can then use those reflections in our query optimizer automatically, right? Without the user having to ever point to one of these materializations. They just work with the tables and the views in the system, and they get sub second response times. And that’s the reason we created Reflections to eliminate BI extracts and imports, and to eliminate all these additional disconnected copies within the warehouse.

And then finally, concurrency, right? We talked about the speed of the query and how we make queries fast, both raw execution and ultimately needing to scan less data for each query. But there’s also a question of how many queries do we have? And sometimes you have a few data scientists working over the weekend to do some analysis that they have to get done by Monday morning, and it’s just some Ad Hoc Queries and maybe 10 at a time. And then sometimes it’s Monday morning and everybody’s logging into all these business dashboards, and you have thousands of queries running at the same time. And so you want a system that can scale and automatically handle any kind of demand, any kind of concurrency.

And that’s what we’ve done with Dremio Cloud. We’ve built this architecture that enables infinite concurrency, both at the quarry planning level, when you connect to the control plane, and also in terms of the query execution, where within any of the engines that you’ve created, whether it’s a large engine or a medium engine, will automatically create as many replicas as needed on demand based on the workload at that point in time. And of course, reduce that when the workload goes down. So that’s what we’ve done for performance. And so now with this kind of performance and scalability, you don’t have to worry about how big the data is or how many users you have that are accessing the data, and that is really important for self-service.

RenaissanceRe Query Performance with Dremio and Amazon S3

One real world example is a company called Renaissance Reinsurance, RenaissanceRe. They’re a leading global provider of insurance and reinsurance, and they’ve partnered with Dremio a couple years ago. And they started in this use case with query response times that took 4,200 seconds. This is a data warehouse based architecture. And at the time, just because of Apache Arrow and the columnar in-memory execution, they were able to reduce that to 33 seconds. And also because of the scale out and nature of the system. So it went from 4,200 to 33, then they introduced Reflections, right?

We just talked about Data Reflections. They introduced that and got another 10x or 11x performance improvement with reflections. Now, we haven’t stopped working on performance. We’ve really pushed the boundaries in terms of performance for TPCDS and many different workloads. And one of the results is that two years later without any changes on their end, they’re now at a point where they’re seeing sub second response times for this workload that took originally 4,200 seconds. And you can imagine what that does, not just in terms of just performance, and cost of course is a lot lower when you know when you’re running less time. But what that does in terms of the user experience and getting adoption of the platform and lots of people of course, then want to use the system. So super excited about the partnership with the RenaissanceRe and the results that they were able to see here.

Security and Governance

The third pillar in terms of data mesh platform is security and governance. Because without security and governance, you really can’t have these other things. You can’t have self-service, right? Nobody will allow people to have self-service unless the data’s secure, and unless there’s good data governance, especially in an enterprise. And in a second, you’re going to hear more about that from some companies that have actually done this in practice. So, security and governance. What do I mean by that? Well, many different things. Probably dozens of different capabilities and features, but I’ll talk specifically about three things today. Fine-grained access control, end-to-end authentication and identity and compliance. Let’s start with fine grained access control. It used to be that just having rule-based access control and table level permissions was enough. You would go and you would define which users and groups have access to a specific table, who could select from that table, who could insert into the table, things like that.

But that’s not enough anymore because we’re living in a much more complex world with a lot of sensitive data and lots of different kinds of people in different countries and different regulations. And so with fine grained access control now available in the platform, you can create a simple SQL UDF that masks data, for example, right? Takes a value and produces a mask version, let’s say a social security number, and it only leaves the last four digits exposed. You can then take that UDF and with one SQL command, apply it to different columns in different tables. So you might apply it to a column. This column here in this table has social security numbers and to a different table that also has a column for Social security numbers. And can also use other columns to make that decision. So it makes it very easy to do data masking.

And similarly, real access policies, because you might have a situation where different users should see different records within that table, right? You might have for data locality reasons, the employees in France can only see records belonging to people that are in the EU. And the employees in San Francisco can only see records belonging to the users in the US. That’s very easy to do. You create a function, it returns a null value based on location, true or false, and the system then automatically uses that function to determine what records a person will see when they query this table. All of this, by the way, works really well with Reflections automatically handled, so people can see the right set of data that they’re supposed to see. And also get at the query acceleration that we talked about earlier.

So now that we have these fine grained access controls, and you can control things at a table level, at a column level, at a row level, it’s really important to know who the user is, right? And that even historically has been hard, right? And so what we’ve done is we’ve actually worked very closely with both Microsoft and with Tableau. And we’ve introduced Single Sign-On all the way from the BI Tool all the way to the back end of the system. So if you’re already logged into Power BI you get a Microsoft website, right? You log in, you don’t have to log in ever again. You’re logged in, we know what your identity is, it’s all based on tokens and the identity flows through the whole system. And then the policies that you’ve defined that we just saw, whether you’re using REST or UI or SQL commands to define those policies, all that gets enforced based on who you are. Same thing with Tableau.

And then these days, there are also dedicated software products and services to define security, security platforms like Privacera, like PlainID. Leaders in this space that we’ve partnered with. And you can use those solutions as well to apply and to create these policies, these fine grain policies. So if you’re using something like that it also works with Dremio, and you can very easily apply those policies through Privacera, through PlainID and so forth. And then finally compliance is important. You have to meet the requirements and meet the compliance obligations. And we’ve worked hard at Dremio, to make sure that Dremio Cloud, is compliant. It’s ISO compliant, ISO-certified, it’s SOC 2 compliant, it’s HIPAA compliant. And this just helps folks that are using it to know that they’re compliant and secure.

So those are the three pillars. I think of a data mesh platform. And you really have to make sure that you have all these three things in order to implement data meshes. Now, I’ve. Like I said, I worked with, now, hundreds of companies that have implemented data mesh, and I’m super excited today to have Tamas Kerekjarto with us. He’s the engineering lead and senior architect at Shell and he’s going to talk about Shell’s journey to data mesh, and also tell us about how Shell is really bringing in a new world of clean energy. Thank you, Tamas.

Context of Shell and Dremio Partnership

Tamas Kerekjarto:

Thank you very much. Thank you folks. Thanks. Well, fantastic. It’s so great to be here and I’m really excited to share our story today. So, but before we dive into the nitty gritty details of our data journey I need to talk about a few things, okay? To give you a little bit of context. So there are three things here. The energy transitioning, one, the company as and how it relates to it and try to take a leadership role in that transitioning. And a little bit about digitalization, how machine learning, AI, and software is really tying this all together. So energy transitioning is basically the world trying to move towards clean energy. They try to move towards clean energy, but at the same time try to decarbonize and try to drive towards net zero emissions. And this seems to be quite a bit of a challenge because the world is still needing more and more energy, right?

The demand is growing and this is really where our company is coming into the picture at Shell. We like to take a leadership role in this transition and help with that challenge. But in order to do that, our company really realized that we needed to transform the business, right? And this transformation is what we are referring to powering progress. What you see on the screen is the four pillars of powering progress. First and foremost, we really like to generate value to our shareholders and do this in a profitable manner. We like to partner up with our customers, businesses, and governments across various sectors in order to help them drive towards net zero emissions. At the same time still respecting nature by reducing waste and contributing to our biodiversity. And we are powering lives and livelihoods, and really trying to ensure that this transition is happening successfully and profitably.

Energy Transitioning

So, energy transitioning. There are two major mega trends that we see that’s going to really shape our lives in the next 10 years. Besides the energy systems becoming really decentralized, digitalization is also a key factor as machine learning, AI and software really now, not just having a large impact on all these different business models that are being created, but at the same time, they are defining new business models, which is quite an interesting phenomenon. But let’s go back to the power value chain and this decentralization concept a little bit. So a couple years ago, right? The power value chain was quite simple. Simple in a sense that we had a few large power plants that were basically generating electricity. That electricity got transmitted and distributed across the wire. And of course, because electricity wasn’t something you could store indefinitely, it needed to be balanced in real time, right?

Supply and demand always needed to be in balance. So that was one of the difficulties. But in a sense, it was fairly simple because these large power plants were just generating the electricity and consumers at the end were just tapping into the grid and consuming it. So what changed? What really changed was the emergence and the appearance of these different renewable energy sources, right? Solar, wind, batteries, electrical vehicles. And so now what really happens is that these large power plants are being completely distributed across the overall grid. And millions and millions of these minor, smaller generations are starting to contribute into the mix. And this really makes things complex. If you think about it, people can put solars on their rooftops and they can generate electricity and contribute back to the grid. Then the batteries come into the picture.

People can actually do this thing called load shedding or load shaping, which means basically that you are using your battery when the electricity prices are really high, and then you are charging it actually when they’re really low. So this really introduced a lot of new behaviors, a lot of new challenges and all kinds of facets. And that’s what really brings us to the story today. Because one of the things across this complexity that really makes a company or anyone stand out, is the ability to be able to forecast the consumption of electricity, right? If we know how much is going to be the demand and how much is going to be the consumption, then we can really prepare for it better. So this is where one of our internal organizations, Power Retail Organization, really decided to expand on their existing forecasting capability and build their own.

The Data Problem

So hopefully now you are able to see that with all these transformations, whereas this has always been a compute problem because time series forecasting has always been around, but now with all these different facets, it’s really now transforming into a data problem, okay? And this is the data problem that I wanted to talk to you about today. Not necessarily about the forecasting that has its own intricacies in terms of what techniques and approaches are being used, but really about what the data challenges brought us and needed to solve. Okay? So it all really came down to speed, right? And as Tomer was talking about, self-service and performance. So what we were dealing with is the classical situation of having a large volume of data residing in different data sources.

And we really needed to be able to tap into these data sources without building our complex kind of ETLs, which would take a lot of time, allowing our data analysts, data engineers, product people to really contribute to this and get it to the data scientists so they can start doing their magic. Build all those magical algorithms and then eventually put them into production and operationalize them. So the two key challenges I’d really like to hone in on was one of these around the data volume. And again, enabling these people without writing a whole lot of code to serve the data and move these data products along this chain up to the data scientist. And then just one interesting example when we were supplying data to the data scientists, and then one day they came back and said, “Hey, Jupyter Notebook is crashing, what’s happening?”

“So, well, let’s take a look. No wonder it’s questioning because you’re trying to load 80 gigs worth of data. So in memory that’s not going to work. And by the way, you have about 15 steps to join in here, and oh, let’s just see, one of the unique IDs might not be not so unique,” right? So what do you do? You basically refactor, push it down the stack and try to enable them and move forward as quickly as possible. The other one was around when we tried to put these inference models into production and it turns out that we needed to run about a hundred of them concurrently. And it was about six or 8 billion records that needed to be retrieved within about a couple minutes, right? So that whole timeline really needed to be condensed to be able to allow people to go through this and achieve this.

Data Platform Architecture with Data Lakehouse

Okay? So here’s the architecture. Dremio was a natural choice and really gave us a huge uplift as a compute engine to be able to address this because we were able to again, tap into these resources, data sources and have really quick iterations before we unleash the data engineers to write ETLs and so forth. So VDS’s were flying, and then PDS’s were flying, and then we all got to the reflections, right? And we were joking about like, “how is your reflection doing?”

And when they started getting some love and some hugs, then those got hugged and then got a little bit choked. So we needed to allocate a little bit more juice behind it and isolate them. But overall, I can say that it’s been a pretty positive journey and I can’t say that we are completely over it, right? And we have those, high accuracy forecasts and so honky dory, but where we really got to is in a stage where now we can push through this. The different high volume of data with relative ease. We have the process, all the collaboration, all those people safely can publish their data sets and it’s all working.

Distributed Data Mesh

So here’s another view of what we’ve created. And what really made us realize was that this actually became like a mini data mesh, right? The data has been residing in different sources mostly distributed. And this unified access layer really provided a great abstraction from all that complexity, allowing the data engineers and data analysts to contribute, provision different spaces, and then in a visible and very dynamic fashion, allow the data products to evolve and really hit the various customer levels.

And in this case, we had customers, the data scientists, and we also had the end customers who were consuming these forecasts. And along the way, we also realized that we have actually generated a lot of very valuable data sets that they actually really like consuming. So all these learnings really gave us the impression that now we have most of the, if not all the characteristics of a data mesh. And we like to bank on these learnings, in the future.

Lessons learned

So some of the additional things, just to mention, the iterative data model was actually a pain in the butt. The first. But we realized that we really needed to refactor this constantly. And just going back to that scenario where that 15 plus joins with the not so unique id, right? We were able to really jump on it because of the visibility that the lineage provided us. But something really had. We had to deal with.

On the other hand, the Dremio compute engine is a sophisticated beast, right? So when you have used these reflections, then you really want to be careful as to how you treat them and allocate enough memory and isolate them and so forth, because once they become successful and famous, then they really get famous, right? And take on. And then finally, the fine grained access control, that was an absolute key because this way that we can provision these spaces and safely, securely assign it to Azure active directory groups really kept our IRM comrades and friends at bay. And they remained our friends. So that’s a really good thing. So overall really, great experience. If you are interested in more details, please go to our breakout session. Thank you very much. Really appreciate being here and listening to our story. Thank you guys, and enjoy the conference.

Tomer Shiran:

Thank you. Thank you, Tomas. Grab that. All right. That was super interesting. And just thank you for sharing your experience, your learnings. At Dremio we’re of course happy to play a little tiny role in that transition to clean energy. But nevertheless very exciting for us as well. Next I want to Dipika Duggirala. And Dipika is the senior Vice President of Global Technology Platforms at TransUnion. For those that don’t know TransUnion, maybe you’re joining internationally from a country where you’re not familiar with them. They are a global credit reporting agency and they manage data on over 1 billion people, right? And so you can imagine this is some of the most sensitive data, of course operating in a very highly regulated environment. And so Deepika is going to talk about self-service in a regulatory environment. Thank you, Deepika.

What is TransUnion?

Deepika Duggirala:

Thanks Tomer. Hi everyone. It’s great to be here and talk to you about how TransUnion is using Dremio for our self-service capabilities and analytics in a regulated environment. I thought I would start by talking to you about TransUnion because I think most of you think of it as a credit reporting agency, but we’re more than that. We operate in 30 countries across five continents and we pride ourselves on being an information and insights company. What we do is make trust possible by ensuring that every individual, every one of us is reliably represented in the marketplace. I run architecture and tech strategy and I’m responsible for the buildout of shared capabilities and platforms at TransUnion to enable us to do this in a secure, compliant, and consistent way across the globe.

Make Trust Possible

So I’ll start by we say, “Make trust possible.” What does that mean, right? It’s about providing powerful consumer insights. Insights about the core identity, a multi-layered, contextualized understanding of a person that is accurate across their online and offline identity fragments, if you will, as they go about their lives. As well as relevant information. So it’s recent observable events that feed this. TransUnion stewards this data with our expertise, but more importantly, with the accordance of the local regulations that exist around the world in terms of protecting that identity.

And this true picture, this identity is the core of the products and solutions that we offer globally. It enables not only credit, which is what we’re all familiar with, but it enables fraud, risk, marketing, as well as other advanced analytics capabilities all around that. And that’s what our world is.And as companies use that information and our solutions to transact with confidence and build confidence with their consumers, consumers use TransUnion products and services to access, maintain, protect their identities. So together, this creates really great experiences, but personal empowerment and economic opportunity. And at TransUnion, we call that information for good, because that’s what we want this data to be used for.

Analytics Platform at TransUnion

So when we talk about an analytics platform at TransUnion, it’s about enabling those powerful opportunities and creating new ways of using information for good. At the same time, we have to be cognizant of the fact that we’re dealing with highly confidential personal data and we place a premium on that trust. The trust of the individuals around the globe whose data we manage. The trust of the businesses who use this information to execute on their transactions and businesses. At the same time, we want to enable innovation.

We want to enable experimentation. We want to allow this diverse group of users of TransUnion data. When you think about our data scientists and data analysts internally and our external customers to be able to collaborate and unlock new opportunities. So it’s paramount for us that we manager data in a secure and compliant way, but still build an analytics platform that’s easy to use, that’s performant, and that allows them to continue to build new solutions. And as we’ve moved into our digital transformation, our intent is for this analytics platform to be available globally. So it’s a consistent way in which we’re accelerating innovation across the globe.

Hybrid Multi-Cloud Strategy

Our digital transformation, similar to many companies that are going through it right now, is really about embracing the public cloud. Finding a way to accelerate how we innovate and bring products to market. So while the public cloud gives us the access to technology and to scale and availability that comes with it, we realize that there might be certain use cases and situations where we don’t want to adopt one or the other environment.

So we built what we call a Hybrid Multi-Cloud. We don’t see ourselves in one environment ever again. We have this varied environment across public clouds, as well as our on-prem data centers, which creates our control plane at the bottom. And then our analytics capabilities and other common capabilities built on top of it, really to look across this. So as you can see, the complexity, the diversity of our data is growing as we go through this. And we hold the responsibility of that secure access and manage governance. But in general, what we’re looking to do is allow our data to live where it lives, allow access only to users who are allowed to use it, but allow that access in an easy to use way so that innovation happens across the board. Piece of cake, right?

Dremio and TransUnion

That’s where Dremio comes in for us at TransUnion. It has allowed us. It’s been part of our architecture for a long time. We were early adopters of Dremio at TransUnion because it allows us to bring together that state-of-the-art tooling alongside self-service data access in a governed way. So, I talked about our geographically distributed data. When you think about the different types of data sources that we deal with, we tend to have structured, unstructured data. We have data that’s proprietary to us that needs to be analyzed alongside some public data sources, for example. And the reason Dremio was such a great fit is it created a data mesh for us across all of these. And it allowed our customers, our users, data scientists, data engineers, data analysts to use SQL to access and explore data across the enterprise.

Now, I talked about our transformation. I talked about this hybrid multi-cloud. What we’ve done is we’ve expanded the complexity of that. We’re working with Dremio because our goal is that in this expanded data mesh, Dremio becomes the single consistent query engine. So it’s not just about SQL based querying, it is the BI tools, and the point in click interfaces, the different ways in which a user may want to access the data. We still support it in a common way. Behind the scenes of all of this. And I talk about the users a lot, cause at the end of the day it’s the experience that matters, right? What we do is behind the scenes manage the complexity of the governance around it. And the fine-grained access controls work perfectly for us on that front. So every user that has access to TransUnion data has a well-defined process they have to go through.

We have policies around who can access what information at a role level. Specific columns that should never be visible, users and groups of individuals that can access certain data sets, certain folders, certain tables, all of that authorization control, all of the entitlements are managed through Dremio for the users that are interacting with it. So this seamless, easy to use data environment, what it really enables our associates to do, our data scientists to do is focus on that information for good and focus on creating innovation, creating financial inclusion.

Credit Inclusion Examples

And what I thought I’d do is end with a couple examples of what this really means. So in the United States, traditional credit scoring is done based on data and at a point in time. So through the data analysis and looking at the different types of data points that are available on consumers, our data scientists determined that by using trended credit data rather than a point in time that looks at payment history and the amount of money borrowed over time, mortgages and everything else, alongside the day-to-day activities that we do as consumers, right?

How are we doing with our accounts? How often do we move? What’s our address stability? How are we doing with the little microfinance loans that we take or our rental payments? Really starting to bring those together increases the number of people that actually can participate in the credit ecosystem. So there are people that are not represented in the credit ecosystem today for a variety of reasons and about 60 million additional consumers right here in the United States were able to gain access to credit because of this combination of critical data points and information.

Another example is in India. India, as many of you might know, is primarily an agricultural economy. About 55% of the workforce in India operates in the agricultural sector. But getting loans for farmers is really difficult because banks and lenders deal with this complexity of trying to figure out what’s the credit risk of the individual I’m giving a loan to. But what’s the production risk on the land on which the crop is being grown? Is that going to be productive? And this really made access to credit for farmers difficult. What TransUnion CIBIL, which is the division in India did, is they partnered with a company called Satur that has geospatial data and they combined traditional bureau information with agricultural information about the land and what’s growing and what’s happening there. And together created a report that really helps lenders make those decisions quickly. A staggering 89 million farmers in India have access to easier credit because of this. My friends, that’s information for good and that’s the power of data. Thank you all.

Tomer Shiran:

All right, thank you. Thank you, Deepika. That’s really amazing. You think about what it’s like to build a data infrastructure company. These are the types of things that get us most excited. It’s the impact that our customers, our partners are having on the world. And unlocking credit for 60 million people in the United States, or 89 million farmers in India. That’s just information for good, as you said. So congratulations on that. And Tomas on all the work on the transition to clean energy, just unbelievable. I’m going to turn our attention now a little bit to focus a little bit more on some of the newer technologies, some of the recent innovation that’s happening in Dremio in the community related to lake houses and how that’s making data meshes even easier and even more capable going forward.

Data Warehouse to Data Lakehouse

So let’s dive in. One of these technologies is Apache Iceberg and I’m sure everybody here has probably heard about Apache Iceberg. Really a meteoric rise over the last year. And so we’re going to talk first of all about that, and let’s take a step back and think about the history of data analytics infrastructure. So of course, for several decades we had enterprise data warehouses. Every company deployed an enterprise data warehouse using technologies such as Oracle, Teradata and so forth. And then we had the rise of the data lake, right? I think back to around 2009, 2010 the Hadoop based technology stack. Around 2015, public cloud became dominant even within enterprises and cloud data warehouses, things like Redshift, snowflake, Azure, SQL Data Warehouse, then Synapse, all of those came to be.

And now of course, we’re all talking about data lakehouse’s, right? Well, what are the things that have enabled these data lakes and LA data Lake houses? If you think about the Data Lake originally called the Hadoop stack, right? It was really driven by the fact that we had common shared open source file formats, right? You could have data in a common format like Parquet and then you could have different engines reading and writing that data, right? That’s what enabled these data lakes. And so we had a common file format. What’s changing now with the Lakehouse is that it’s not just a common file format, it’s a common table format, right? And so we have technologies like Iceberg and Delta Lake that are enabling us to have a table format that different engines can read and write. And a table format is much more capable than a file format, because you can start doing things like inserting, updating, and deleting records and performing transactions and automatically optimizing the data. All sorts of things that you couldn’t do before that really enable this lakehouse now to do all the things that you could do in a warehouse. And of course more than that, with open data, right? With multiple engines accessing the same data.

Apache Iceberg

Iceberg specifically as an Apache project, originally created by Netflix, and then Apple and many other tech companies and ourselves also contributing to this project has really taken off. Just if you look at the last year, the amount of downloads just on Maven of Apache Iceberg really growing at a tremendous rate, almost 40 million downloads now, right? And there are different open table formats out there, but what’s happened in the last year is that Apache Iceberg has emerged as the de facto choice, de facto standard for the different cloud providers for many of the data platforms out there. And while we started evangelizing Iceberg very early, if you look at it now, it’s supported by Amazon and Snowflake and Google and Cloudera and Tabular, the contributors to the project.

And so really, the ecosystem has come together around Apache Iceberg as that common open table format. And again, if it’s not a common agreed on broadly supported format, that of course is not going to work long term. That’s not what companies want. And so that’s great. It’s not just that, but we need the development ecosystem to also be, the development community to be healthy and very diverse. And Iceberg as an Apache project with contributors from many different organizations. You can see here examples, Tabular and Apple and Netflix and Dremio and AWS and on. Many different companies, web companies, vendors and so forth, all contributing to the project. And that means that the rate of innovation and how long this project is going to last into the future is all guaranteed by having this diverse community. And you’re going to hear from many of these folks. You’re going to hear from Tabular, you’re going to hear from Apple today. You’re going to hear from all these different companies at Subsurface both today and tomorrow.

Also excited to share with you that we’ve started working on the Apache Iceberg, The Definitive Guidebook from O’Reilly. And today we’re actually making the first first chapter available for preview. So you can go to this QR code, use your phone to scan that, or you can probably Google it, or Ask ChatGPT. I’m sure you’ll find a link to or the location of this chapter. But soon you’ll have the entire book. We’re going to make all that available. So I’m very excited about that. Just continuing to drive that adoption and innovation with Apache Iceberg

Dremio Makes Apache Iceberg Easy

At Dremio, in terms of our product and our technology, we’ve worked very hard to make Apache Iceberg easy. And when I say easy, I mean as easy as using MySQL and creating a table and updating records and just working with tables, right? You don’t really need to understand any of the technology, right? That starts with being able to create a table. So a create table statement, specify the columns, and you have an iceberg table, right? Alter table, add a column, very easy. Same thing as any relational database. We’ve made it easy to get data into Iceberg. So there’s a SQL command called COPY INTO. All you have to do is specify the source of the data and you can copy data from a bunch of dirty CSV files, some database tables, and get that into an Iceberg table with one SQL command. We make it fast.

DML operations, inserts, updates, deletes, merge, all the things that you can do with the database, you can do. You don’t need to know anything about Iceberg behind the scenes. That’s what’s get. Create that. That is what is getting created and updated in your S3 bucket, in your Azure storage account and so forth, right? It’s just simple SQL, in this case, it’s a merging of these two tables into one table. Performance, Iceberg is another. One of the things that Iceberg enables is higher performance by having built-in statistics and partitioning and so forth and automatically understanding how to use these partitions. So again, it’s a simple, as a create table statement, you specify what you’re partitioning by, in this case, the month of the sales date, and what you might sort it by. Users don’t need to know any of that.

They just select, query the table, and automatically the system will take advantage of these partitions in the sore order and so forth. Statistics. Optimizing a table can be done through simple SQL commands. You write the optimized command, the table gets optimized, the right file sizes are created. The right partitioning scheme is created if necessary. The table is sorted in the right way and so forth. And this can also be automated. You actually don’t have to run the SQL command. We’ll talk about that in just a second. Vacuuming Iceberg Tables. So, the vacuum command allows us to clean up old data, whether it’s data that you no longer need in terms of time travel or just metadata that’s no longer needed in the system. Again, one SQL command. Super easy, really focused on making Iceberg simple.

Dremio is Open and Works With a Range of Catalogs

We value a lot by providing an open platform, one that works with different, like I said earlier, sources, different tools, also different Iceberg catalogs. So one of the great things about Iceberg is that it has a pluggable catalog model. With Dremio, you can query data regardless of what catalog is being used. And so that could be the AWS Glue catalog, the Hive Metastore and maybe you’re just using S3 and or Azure Storage or GCS as the catalog, and just creating tables in a file system. All of that works. You don’t have to worry about what the catalog is. You can just work with Iceberg tables as if it’s a simple database.

Now you’re here at Subsurface and one of the great things about Subsurface is we have lots of talks for you to learn from. And you can see here these are just examples of the talks that we have here related to Iceberg. And so you can see it’s a combination of companies like Apple and Shopify and Insider and Pinterest talking about how they’re using Iceberg to really power their data lake houses. We also have lots of contributors to the project, committers, companies like Snowflake, talking about it, Tabular AWS. And so, if you’re interested in this, please just attend these talks. It’s they’re here and both today and and tomorrow you can find the agenda online. So that’s what’s happening today. We’ve integrated iceberg it’s as easy to use as a database and, but that’s not the end of the story.

Dremio Arctic

We think there’s an opportunity here to create the next generation of the lakehouse. One that’s even easier to use, more designed for the data mesh, and just makes it easier and easier for enterprises to become data driven. What do I mean by that? I mean, it supports multiple domains. It supports new concepts such as data as code, GitHub for data if you will, and it’s self optimizing. And I want to talk a little bit about that. And so we’ve created a project and a service called Dremio Arctic, and Dremio Arctic is a lake house management service really designed for this data mesh era. So what do I mean by data lake? House management service. So an arc that includes two main components, a Lakehouse catalog, right? This is instead of something like a hive meta store really designed for the cloud and built for iceberg from the ground up, much faster, much lighter weight, and then a data optimization service that automatically optimizes your tables and garbage collects things and so forth.

The catalog includes data as code. So the principles of Git applied to data, and then of course governance and security. And I’ll talk about each of these things. So first of all, Arctic is Iceberg native. It’s really designed to take advantage of Iceberg. It’s a catalog that’s built into the Iceberg project itself, the Apache project. And you can just get started using it from all of the popular engines. Second thing is it allows you to manage data in multiple domains. So you can have multiple domains or catalogs that are isolated from each other. So different groups can have their own domains. You can decide to share across domains and enable compute on the different domains, but you have that very simple isolation across domains. And that doesn’t exist in any other catalog. Of course, data access control and governance.

We just heard from TransUnion, from Shell the importance of these types of things. So Arctic provides fine grain access control, provides data governance where you can see exactly who changed what in what table and when down to the level of, “Okay, what SQL command has changed this table in the last week, or was there a spark job? And what is the ID that changed that data?” You can see all the history of every single table in the system. Automatic data optimization. So automatically being able to optimize the table without having to run an unoptimized command or automatically garbage collecting things without having to run a vacuum command. Dremio Arctic takes care of all of that in the background, uses elastic compute, brings it up, shuts it down, takes care of all of that.

And then finally, the thing that I’m most excited about is data as code and this is a new idea for enabling you to use data the same way you manage source code. Because we talk about data products, right? And managing data as a product, right? Well, how do we build software products? How do we build services? We use GitHub or GitLab, right? We have version control, we have branches, things like that. And so we’re bringing those principles to the world of data through Dremio Arctic. And that includes isolation, being able to create branches, and version control where I can tag and go back to a state and time in the past, not for a single table for my entire lakehouse, right? So let’s talk about five very simple use cases for data as code. First use case is ensuring data quality with ETL branches. So rather than ingesting data and testing it in the live production environment where you might negatively impact other people, you can now create a branch for the data ingestion that night.

Do all the work in that branch, ingest the data, test the data, maybe bring up your dashboard, see that the numbers look okay, and there’s nothing crazy there. And then when you’re done with all your testing and the ingestion and all the integration, you just merge that into the main branch. And with one atomic operation, everybody sees that. All the new dashboards are updated and people can start the broader audience can start working with the new data. Another example is experimentation. Most companies have lots of analysts, lots of data scientists, and they need to experiment and work with data. And so rather than doing that in the production environment, which often leads to thousands and thousands of tables lying all over the place. Every user can create a branch and then they can do their own work in a separate branch.

The same way when we use GitHub, we create a branch and maybe we work on a new feature in a branch or fix a bug in a branch. We can do that with data. We create a branch, we create a table in that branch, we update tables in that branch. We can do all our work there. And when we’re done, we either throw it away or we merge it back in domain. If, if that’s what we want, right? So very easy to do experimentation, the ability to reproduce models or analysis or dashboards, right? To go back in time and say, I want to see what this dashboard looked like on the 1st of January last year. Or I want to, in this case, recreate a logistic regression model using Spark based on data from a previous point in time, right? In this case, we’re looking at a specific tag that somebody made called Model A and we’re going back in time. So now, instead of creating copies of data, every time you update your model every time you train it, creating another copy just so you can go back and reproduce it. Now you can simply run one command, create a tag, and you can at any point in time, refer to the data as it was based on that point in time that was tagged.

Fourth use case is recovering from mistakes. And this is a very simple one. It’s probably happened to many of you where you accidentally go in and accidentally delete data or you mess up a bunch of schemas of tables or you ingest the wrong data that hasn’t been validated yet. And next thing the system’s in a bad state. And this typically happens on a Friday afternoon, right? And then you gotta work the whole weekend and you try to figure out what’s going on. And it’s a nightmare. That doesn’t happen with source code, right? We are never worried about messing up the source code because we can just go back in time with one command in GitHub or Git and go back in time and we’re back to how it was, right? We’re never worried about being in a bad state like that. And so the same thing is true now with data.

You can recover from mistakes. If you’ve messed up a bunch, you’ve done a bunch of work for the last hour, and then you realize, “Oh no! I integrated some really bad data. Propagated that to a hundred other tables. Oh, what a mess! How am I going to go back in time? I have to rewind all these different tables with one command.” You just moved the head of the branch back from where it is now to a previous commit and that’s it. One SQL command. Do it with Spark, do it with res. You can do it with a ui. Very easy to recover. You no longer have to worry about making mistakes unless you don’t catch them. And then finally, troubleshooting. So this is related to data governance. The ability to see what changed. So something seems wrong, this dashboard, it looks like our average sales amount went down 20%.

What does everybody see when that happens? The data has to be wrong. Something’s wrong. Something happened here, but finding out what happened, that’s the hard part. No longer the case. Now you can basically say, “Let me look at that sales table and see what changed.” And then I can see that somebody here, Daniel, made some changes on this table three days ago. That’s probably the reason. Let me go investigate what those changes were, what happened, and if I need to I can rewind that to a previous point in time. But to make this all more real, we have a demo here of Dremio Arctic that we wanted to share. And I want to invite, to the stage Anushka, who’s the product manager for Dremio Arctic. She’ll show you the product.

Dremio Arctic Demo

Anushka Anand:

All right, thank you Tomer. Okay, hello everyone. So let’s see how Dremio Arctic streamlines data management. Okay, in this demo, our company analyzes the top performing products by sales every quarter. So we’re looking at a Tableau dashboard of this data since January of this year. And this is summarizing over 21 million transactions. Now, let’s say that we get daily sales transactions that are ingested from multiple different sources and this production data has to be updated. Now, usually this data management process can be cumbersome and potentially risky as some partially updated data can leak and that can lead to incorrect decisions and possibly a loss of trust in your data infrastructure. Well, the key to having high quality data and being able to react to new data quickly is being able to isolate changes. So let’s see how Dremio Arctic can enable this.

Here we are in a sonar project in Dremio Cloud, and you can see our Arctic catalog with production data added as a source. So it’s supply chain data with information about customers, orders, line items, products, and there’s also a view that joins a bunch of these tables to produce that view that supports that dashboard we were looking at. We’ve also got an S3 bucket that serves as a landing zone for our new data that gets dropped in. So here we can see there’s two files that we’d like to add to update our production data. Well, with Dremio Arctic, we have a simple way to create a development environment and can go to SQL and create a branch from our production environment. I’m going to call this branch load orders cause we’re going to load in the new orders information. And when I run this, this isn’t making a copy of your data, it’s simply creating a snapshot of your data lake.

And this is possible because of things that Tomer was talking about in terms of open table formats like Iceberg. They have the notion of snapshots and that allows us to simply copy pointers to all of those table snapshots. When you create a branch, as you saw it happens pretty instantly. And so now any inserts, updates, deletes that we do involve moving those pointers forward. And that’s all isolated to this production, to this development branch, not touching our production environment. So I’m going to change the context to this branch so I can load in the new data. I’m copying into the order stable, the new file that landed as JSON from a third party feed. I want to show you that the Tableau dashboard remains untouched because this is critical. I’ve only updated one of the tables behind this view. So it’s critical that all of the consumers of this data not see a partial view of the data or any ETL that’s in progress.

So what we expect is that the number of transactions stays the same. So note that we still want about 21,573,000 transactions. So there, its updated number of transactions has not changed. And this is critical to allow for data integrity. So now we can go back and update the other table. We’ve added in a million orders. Let’s add in the corresponding line items. And so what you’re seeing here is that Dremio Sonar and Arctic work together to seamlessly manage data isolation. Now that we’ve updated the two tables, we want the view that joins this data to be updated all at once or atomically. Because all of the consumers of this data want to get a consistent view of the latest data. 

So what we can do is simply merge that data in. So we’re going to merge in the branch with all of the new information into our main branch, which is the production data.

And so when we do this, it’s taking all of the transactions that we did in that branch, squashing them and committing them in one shot. And so let’s confirm this. We’re going to go to the Tableau dashboard and refresh it, and we expect to see a new number of transactions. So I could. I’ve done that in Dremio sonar using SQL. I could have done that in the UI in Dremio Arctic or using another query engine like Spark SQL. Okay, you see we’ve picked up a new updated number of transactions. Dremio Arctic provides a safe way to ingest new data and an easy way to update multiple tables atomically. Now, besides ensuring data integrity, your data lakehouse management service should ensure fast data access. So with repeated ingest of new data into your Iceberg tables, they can get fragmented. And Iceberg recommends routine maintenance operations to compact these tables.

Basically rewrite them to enable faster queries. Well, Dremio Arctic automates this necessary but tedious task. So let’s simply go to our Arctic catalog, which is here. And so here you’re seeing the list of all of the tables and views that we saw before. This is our production data. So this is our main branch. And in Arctic you can also see more than just the list of your tables. You can browse all of the commits or the transactions done on your data. You can look at all of the branches and any tags that you have about this state of your data. But for tables that you know are frequently updated, you can go and set up when and how frequently you want these tables optimized. It’s that simple. No more wasting time managing multiple scripts. Dremio Arctic is the data management, data lakehouse management service that lets you automatically optimize your tables and lets your data teams manage their data as code. Thank you. Back to you, Tomer.

Real World Example of Dremio Arctic

Tomer Shiran:

All right, thank you, Anushka. That was a powerful demo. Thank you so much for sharing that. All right, so I want to talk now about a real world example of how a company is using Dremio Arctic. And this company is Merlin. If you haven’t heard of Merlin, 15% of the songs on Spotify and Apple Music and all these networks that you listen to music on would not available if it weren’t from Merlin. They work with independent artists, independent producers, make it possible for their music to be on these platforms. Make it possible to track the listeners on these platforms, provide the compensation to these artists and so forth.

So, a little bit of history here about Merlin. The first thing that they did when they engaged with Dremio was to move from a data warehouse based architecture to Dremio. And so they had data in S3, they were moving it and replicating it into a data warehouse, creating all these copies of data there. And then from there, like we talked about earlier, they were creating BI extracts for the purpose of their visualizations, right? So the first thing they did was they moved to a much simpler architecture. Basically a Lakehouse architecture. They had Dremio on S3 and then the various client applications and combination of interactive use cases and also reports that they generate, for these artists and collaboration with these partners.

Even with that, they had some challenges around data quality because if you can imagine, they’re getting data dumps from all these different partners, whether it’s Apple Music or Spotify or Twitch or all these different partners. They get these daily ingestions of data, right? And sometimes this data is not clean. Sometimes it has problems in it and they have to test all this data. There’s a complicated QA process that has to happen. But up until now, that QA process has to happen in the main system, right? They ingest the data for the last day, test it out, clean it up, a lot of complexity around that process, right? Because the development, the QA was being done in the live environment alongside everything else, right?

What they’re able to do with Dremio Arctic is to simplify that situation. To simplify that architecture and to take advantage of branches, right? The same way we all use branches when we use GitHub, right? We create a branch, we do the work in a separate branch so that we’re not risking making any inconsistent changes or bad changes to the production code, right? Same thing here with data, rather than ingesting data and doing all the QA and the data quality work in the main environment, they can do that in separate QA branches. So as you can see here, ingest the data into these branches, do all the data quality checks in the branch. If there’s a need to fix the data or get a new data dump from one of these partners, they can do that in the branch. And only when it’s clean and ready and they have the correct data for that day, then they integrate the branch, they merge the branch into the main branch.

And that significantly simplifies these workflows that they have to deal with on a daily basis. So that’s a pretty cool use case. 15% of the music on these platforms is enabled because of Merlin. Alright, so we’ve talked about a few different examples here. A few different companies that have created data meshes. We’ve had the privilege at Dremio, of course, working with hundreds of these companies that have created data meshes across every industry, right? Tech companies, financial services, insurance, retail, transportation. And that’s the most fun part of the job is just really understanding and helping companies take advantage of data and become more data driven.

We’ve also seen how things are changing and how there’s a lot of new technology being created these days. Both things like Iceberg and things like Dremio Arctic to make the data mesh significantly easier. Significantly more powerful, so that it’s easier and easier for companies, in all industries to become more data driven. So we got a great conference here over the next two days. Lots of different talks about these topics, about data mesh, about data lake houses, about Iceberg, about all these different topics. And so thank you for joining me today. Whether you woke up early in Hawaii to watch this or you’re up late in Israel watching this, thank you so much for joining and please join all the different talks that we have over the next two days. I’m sure there’s a lot you’ll learn. Thank you.

header-bg