May 2, 2024

A Pragmatic Paradigm Approach to Data Mesh

A Data Mesh is an explosively popular and effective architecture for data sharing across an organization. It’s also a key driver for data quality. Join this session to learn why data mesh architectures are being used by data-first organizations, along with common pros and cons.We’ll talk about how you can design, build, and operationalize a data mesh with Dremio – and how data mesh architectures can drive better data quality.

Sign up to watch all Subsurface 2024 sessions

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Yassine Faihe:

Good morning, good afternoon and perhaps good evening for all of our attendees. Happy Friday. So yeah, let’s get started. So maybe the first thing first, you may be asking yourself a couple of questions about the title. So this is why maybe the first slide is really to clarify what it is about. And as mentioned by Mike, my name is Yassine Fay, I’m based in the French Alps in Grenoble, France, and I’m looking after the solution architecture for Dremio. And during all of the conversations I’m having with customers, right, the topic of data mesh and more specifically how to implement data mesh is always at the top of the conversations that we’re having. And I thought it would be interesting to share the feedback and the experiences and the best practices that I was gathering from all of those customers during the past two years where the data mesh approach started to pop up. 

Pragmatic Paradigm Approach

So when it comes to the approach, basically it’s to understand how we can do it in a very pragmatic manner, because again, during my discussions, I found either customers working really in a very rudimentary manner trying to share data across the board with their users, or customers maybe taking a very theoretical approach, right, led by the CDO office that is resulting into a bunch of processes and a bunch of ways of doing things and constraints and everything else that finally would remain very theoretical and are not necessarily implemented specifically because of a lack of platform and lack of adoption and lack of the change management that goes with it. So this is why one thing that I started to notice that is really successful in gaining traction is to run it in a pragmatic manner, bottoms up, run it little by little in an incremental manner, and do it in an experimental manner as well. So design, test, deploy, codify, and then operate the appropriate change management to drive adoption. So basically this is the pragmatic providing an approach to, not necessarily to data mesh, but to any potential activity. 

So again, to set up the context, some people prefer to talk about friction or clash between business users and IT, each one of them claiming something that the others cannot deliver. So mainly on the one hand you have the business users who would like to have more and more access to data, would like to be more agile and bring data to the market in a faster manner, and on the other hand you have IT that is really strict on governance and security. And actually it’s not a matter of clash or friction, but more finding the appropriate balance, right? So how can we provide access to users, to business users, in a controlled manner to high quality data while without sacrificing, I would say, the governance and the security matters. 

About Data Mesh

So this is why data mesh has been very popular, it’s because it attempts to answer that specific question. You’re all familiar with the core principles of data mesh, so I’m going to share them with you one more time, but in a visual manner. So at the center of data mesh is a data product, so we can talk later on about data product or data as a product, but this is basically what’s going to be exposed and consumed by applications and use cases. And as you can see in the diagram, a data product can be built from raw data or can be built also from other data products, so perhaps you can use the term composite data product. And those data products are grouped logically into data domains that are owned and governed by data stewards. And finally, the way data products are being created and shared and the way data domains are being also owned is governed really centrally and through a central team providing a bunch of rules, but implemented locally and in a distributed manner. So this is why we talk about federated data governance. And of course, to make all of that work, we need the appropriate data platform that enables a self-service approach or way to do everything that I just mentioned. 

So this is the kind of one-on-one data mesh in a graphical manner. However, to implement data mesh, it’s not something that is going to be in one shot, right? And there is a notion of maturity. If you talk with many customers, each one of them would have a different maturity level and maturity according to multiple dimensions. So in this kind of radar diagram, you can see that each profile would represent a specific customer, and each one of them has made progress in one or more dimensions. But so far, very few have reached the maximum score according to those four dimensions which corresponds to the four core principles of data mesh. So it’s definitely a journey, and you need to start somewhere, need to make progress along all of those dimensions, but you don’t need necessarily to be at a maximum score in each and every dimension. 

We Were Doing Data Mesh Before Data Mesh

So something else that is popping up during the conversations that I’m having with customers is that when they start to understand what data mesh is all about, I’m very often hearing, “Oh, yeah, we were doing data mesh before data mesh, or we were doing data mesh without knowing that we were doing data mesh.” But my key question back to those customers is, “How were you doing data mesh?” And basically, there are three ways, okay, today. And I’m using the SCIA analogy because, as I mentioned before, I live in the French Alps. So what do you do with the freestyle, basically no governance, no rules, or basically in a rudimentary manner? And here we find customers who are still using perhaps spreadsheets, okay? I mean, at the end of the day, maybe it is data mesh because they are going to collect data from different systems, different databases or data warehouses, and combine the results in a spreadsheet that said, “Hey, this is my data product.” Maybe others will do everything but within a single data platform. 

Others would attempt to do it in a data lake, but without the appropriate governance, okay? So that would be one way to perhaps do it, but again, it’s the beginning. The second one is through quad rails, and here you already need the kind of data platform that would enable you to provide the capabilities for your customers to create and share data products, to organize them into domains, to have some level of governance, et cetera. And finally, the last one is when you start to enforce all of the best practices. Because quad rails are establishing the way you need to do it, but they are not enforcing it. Okay, so this is when you hear about computational governance or computational security or data products. It’s basically that if you create a data product, you cannot publish a data product unless you provide the associated documentation that goes with it, okay? So that is one way to run the enforcement. And what we advocate here is that quad rails, the second approach, is very appropriate to start the experimentation, because you need to have the appropriate platform to offer all of the capabilities that I mentioned before to your users, but educate them and guide them, okay? Yeah, let them publish maybe some data products, even if they are not fully documented, but use it as an experimental platform to drive adoption and to ingrain the necessary best practices. And once you reach that level and you are fully assured that this is the way to do it, then you can look at how to enforce the processes that you have defined and that have already been adopted by your business users.

Take Serial Steps

So therefore, it is important to take an incremental approach, okay? So first, you need the right data platform, okay, to incrementally deploy use cases. And while you deploy those use cases, you ensure that your team or your business users are exposed to the key concept that you would like to implement and that you have in place the appropriate quad rails to guide them through that process. Start with key data domains, perhaps maybe high-impact, low-complexity, okay, just to experiment and kick the tires. Then understand who amongst your business users is going to provide the appropriate stewardship to own a given domain and to make sure that what is happening in that domain is under control and provide them with a cover from executive sponsors. Then move to the next level of data products, right? Identify key data product owners and teach them how to manage data products, you know, as a product. So this is where we are going to make the switch from data product to data as a product when you introduce the notion of life cycle management and notion of full documentation of a data product. 

And at the same time, you foster the appropriate filter to drive the adoption across the board by promoting data quality, sharing of data products where appropriate, of course. And more importantly, the reflex to reuse and to scout for, to search for existing data products prior to creating new ones. So enhance and foster the reuse of existing data products. And finally, provide the appropriate oversize to monitor all of the activities that are happening and adjust the rollout of additional domains and additional data products accordingly. 

A Data Platform With a Powerful Semantic Layer

So the key to really be successful in implementing this approach from an experimental fashion and from the bottom to the top is really to first deploy the appropriate data platform with a powerful semantic layer. So first thing first is really to be able to access to all of your data. So as of today, companies are doing very well when it comes to authentication. You can have a single place to authenticate across the board. But authorization would be pretty different from one platform to another. And when you have multiple platforms across the board in your organization, it is difficult to understand who can access what. Therefore, the first thing is really having a platform that would allow you to access all of your data and build on top of this data a common semantic model, both from a content perspective and from a governance perspective as well. 

Then enable self-service to users to create, to document, to publish, and share specific data products. So this is also a key requirement for such a platform. And have also in place the appropriate flexibility to organize the said data products into domains and ensure that the domains can be owned by stewards, by super users that are going to manage them. Have the appropriate also capabilities in place to search and consume data products again to maximize reuse. And of course, the ability to define a governance model centrally and have the local teams or the domain owners or the product owners govern their domains and their products while respecting those set of rules that were defined centrally. 

And we believe that Dremio is the appropriate platform to enable such an approach, right? And a pragmatic paradigm, bottoms-up approach, experimental, by experimental I mean experiencing, experimenting with data mesh concepts. So in a nutshell, and as you may have heard it many times during this conference, Dremio really offers access to multiple data sources and allow you to establish a common semantic model across the board through the unified analytics capabilities that include self-service and governance and security. You have a SQL engine that would allow you to federate queries across multiple sources and to run queries or analytics on created data products with the appropriate performance and scale. And of course, offer you all of the lake house managements that you need in terms of cataloging data, optimizing data, and having the next generation of data catalog as well. 

Putting it All Together

So let’s see how we can put everything together. And by that I mean having the concepts of data mesh that we discussed before together with Dremio. So here we have an example where we have multiple sources, each of the sources have some raw data. This raw data is being accessed by multiple users and some super users would create domains as we can see here, marketing, sales, and product. And inside those domains, some product owners are going to create initial data products and share them across the board. As you can see here, there is users’ data that belongs to the three domains, whereas customer data and feature data belongs only to a single domain. And from there, those data products are being shared to specific users. And Dremio also offers you the possibility to isolate the access to those data products at the engine level, just to make sure that a user or a domain is not cannibalizing the resources of another domain. 

So in summary, Dremio is offering you the appropriate or the needed or required capabilities to establish from the bottoms up all of the foundations that are required to start deploying your enterprise data mesh. And as we do that, we establish the guardrails that we mentioned in the beginning, and we pave the way towards a full data mesh. So data ownership is one of those guardrails, therefore, to model domains and to govern the access centrally, a powerful semantic layer with self-service to enhance the user experience when it comes to building data products or creating those data products from multiple sources or from multiple data products and share them. The multi-engine aspect to make sure that you keep the separation and the usage separated at the domain level or at the user level, and finally, set you up for the future, in the sense that each domain may need some specific applications or specific tools, and Dremio will offer you the openness, allow you to connect whatever applications and tools that you may have while making sure also that the data is yours forever. 

This approach has been implemented by some of our customers, as you can see here. So some of them have presented this year, others have shared their user experience during the last conference, so I would encourage you to check how they did it, and all of them did it in a bottom-up manner. 

And finally, and one more time, data mesh is a journey, right? So the benefit is to make you speed up the time to market when it comes to analytics or analytics applications or analytics usage, and basically to increase your analytics and your data IQ across the board, is to allow you to expose your data to users, basically in a scalable manner, maybe start with 10, 20, up to a couple of hundreds, and do that in a seamless manner, and make sure that data is rationalized, right? So that you have a singular view on all of your data. If a data product is created and has a gold standard that anyone who needs that data or the information that is in this data product would go to that specific data product and get it. And not go to other places where it should be. The key to be successful is to have a data platform with a powerful semantic layer in order to provide you with the quadrants needed to experiment and roll out your use cases little by little. And finally, also to allow you to establish and foster the appropriate data culture and pave the way towards competition or enforcement.

header-bg