May 2, 2024

Michelin’s Journey to Self-Service Analytics & Streamlined Data Access with Dremio

In this session, we’ll highlight the challenges faced in data management and outline the deployment of Dremio, focusing on the different steps of implementation and the benefits realized. Through improved accessibility and faster insights, Dremio empowers users to navigate data independently, enhancing productivity and decision-making. We’ll also share the lessons learned and best practices we’ve implemented for organizations considering similar solutions

Topics Covered

Dremio Use Cases
Lakehouse Analytics

Sign up to watch all Subsurface 2024 sessions


Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Karim Hsini:

Hello, everyone. So, let me present me first of all. So, as Mark said, I’m an enterprise architect for Michelin on data analytics. So, within that role, we coordinate and support all the business strategy around data and we define our strategy and the technical sourcing regarding all the different trends that the market of data is full of. If you want to connect with me, just get contact with me by LinkedIn. I will be happy to give you more information after that talk if you need it. 

The Michelin Group

So, just before going on our journey implementing Dremio, just let me introduce a little bit a Michelin group for those that does not know Michelin. Of course, it’s a big enterprise, more than 100K people working within Michelin, 25 and more billion of euro of sales and available quite in all the country. Another big number that we can share is yearly we sell more or less 200 million tires. So, maybe some of you are having Michelin tires on your cars. But Michelin group is not only tire more and more, it’s tire-related service and solution. And we use data more and more to provide services and provide new features, new product that we sell as a service by Michelin. It’s also the mobility experience. So, you may know the red kit, but also tablet hotel or some other experience that we provide. And more and more we develop the group in the high-tech material expertise where we design and produce new high-tech materials. So, it’s not only tires. 

Implementing Dremio

This being said, now let’s switch to the topic of our journey implementing Dremio. And I’ll start with the why. Why choosing Dremio? Why using Dremio? At Michelin, we have started data transformation four to five years ago. And this data transformation is organized around three pillars, platforms, governance, and product. So, platforms, it’s all the technology available and ecosystem available, made available to our users. Our users can be business users, IT teams, for them to be able to build data product. And the third pillar, the governance, helps to organize the standard, organize the right access and all the roadmaps about what data is available at Michelin, what will be available in the future in terms of data product. 

So, this data transformation has started with platforms. First of all, we have put in place Data Lake because four to five years ago, it was, let’s say, the key trend having your platform build around the Data Lake. And very soon, we faced difficulty around this Data Lake. Of course, we were able to ingest and transform data to store the data and then to consume the data that was stored in our Data Lake. But the way to provide access to this Data Lake was you connect to the file. There was no SQL engine available. We were not able to manage access to the data at a low level. We were providing access to directory directly in the Data Lake, and our governance wanted more and more to provide more fine-grained access. And on the other side, we were storing more and more data in various formats, CSV, Delta, Iceberg, Parquet, whatever files were. There was no standard on the format that we were using in our Data Lake. 

So, when it comes to consume this kind of data, not all the tools were able to consume the data. So, we choose to enhance our corporate Data Lake, our platform to a lake house. And when we decided to provide this lake house capability, so providing a SQL engine, providing capacity to federate data across multiple storage, we were looking at what could be the best tool for us to provide access of the data on top of our Data Lake. And we choose the Dremio because it was not only a SQL engine. In fact, on top of the SQL engine and the SQL capabilities that Dremio provide, at this time when we choose it, it was also one of the only tool that provided a federation of different storage, but also a UX and a portal where we can provide information about our data set that we publish in Dremio, provide capability to better manage the security and the access right management of our data through low-level security and many other things. 

So, that’s why we chose at this time Dremio. We wanted to provide to our user a more simple user experience than what we had before in the Data Lake, where they were accessing files that were stored with obvious naming. We wanted to organize the data set that we want to provide to our users. We wanted to better manage the access and we wanted to provide with one portal, one way to consume data and to manipulate data to make the last mile and consume data to build reports or analytics capability. How did we start this implementation journey and what was the challenge we faced once we’ve chosen the technology and when we said, okay, let’s start using Dremio. 

Journey of Implementation

How to implement and to start our implementation journey? Our first question was around who do we want to address? In fact, many people will use this kind of technology and we have separated our users in two kinds of categories. First of all, the producers, the data producers, the teams and mainly IT teams that are exposing data into the Data Lake and want to provide access to their data and some other data team that consolidate data from many source systems to produce a business meaning of a data set. To provide, to give you an example, we have multiple teams that, we have multiple ERPs in the group, so we have multiple teams handling some ERPs and each team is accountable to expose this data in the Data Lake. 

Each team, we want them to be able to expose their data into Dremio, but then after we have another team that will take all this data coming from all the ERPs to provide a consolidated view of the sales of the stock or many other data that you can find on an ERP. On that kind of users, we wanted that they used Dremio in an industrialized way. We mean by industrialized with the best practice from software engineer. We don’t want that these people use Dremio through the user interface and create their physical data set or virtual data set directly through the interface. We wanted that this kind of population provide industrialized business semantic layer inside the Dremio and we will see how we have used the Dremio API and how we have implemented that in the next topics. 

Data Consumers Application

On the other side, we have data consumers and for them, we wanted them to be able to use the Dremio interface to consume data, but also to make the last mile for their application. So being able to merge data coming from different data set, join them, filter them, aggregate and being able also to then produce reporting based on this new data set that they have created. So we have split these two kind of category of persona and we have worked on both side with each of these kind of persona differently. 

First question to handle when it comes to implement Dremio is how to organize your spaces repository and our spaces repository, we organized it based on the persona we wanted to address. As I said, we have people that are handling IT application that we want them to expose the data of their IT application. So we have built a first repository inside the spaces area for them to be able to expose in a directory named source application and then we make for each of them a sub directory where they are able to expose their data for their IT application. Then we have put it in place 16 different directory that represent the different business data domain that we have within Michelin. I will go into detail just after. Here we wanted that the data is organized by business logic and governed by the governance in the same way that the governance was organized into domain. We wanted that the data governance could find easily what they govern, the space they govern and to be able to see really the business meaning of what you will find into that kind of directory. 

Then we have put it in place last folder which is for application where inside this folder you will find sub directory for each application that ask to create an application on top of Dremio. We did it using the structure that Dremio push as best practice by putting a private area where all these different personnel can prepare their data so it’s really not shared it’s an area where that is shared only for the owner of the data set to make a first the last mile of data exposition so to make the physical data set the mapping between the file and the data being able to change the name of the column and doing some small stuff but then after all these data which is which come in the private area is used and organized into a business semantic layer and then an application reporting layer in the fit for purpose. 

If we look at how it looks inside Dremio inside our Dremio so it may be a little small but if you look at the spaces area you will find everything organized so as I said the first one is really IT oriented which is the which source system does the data come from and then after you have all the domain that can implement their data product so the first one is around product and services then we have customer and contact then we have a domain around finance another around purchasing marketing and sales supply chain and so on. 

So this first step was not very difficult technically but it was a huge journey to align everyone on how we want to organize the data and one first challenge is that one aligning everyone on the way you want to see the data and how to make it available to everyone is and take a couple of weeks to align everyone to understand and have a common understanding on how we will organize the things and how we will provide the capacity to consume the data. 

As I said for many of this team we want them not to use the Dremio user interface but use the best software engineer practice to do or provide this data product inside the Dremio. What does that mean? It means it simply means that we want to do it as code so we have used one tool called DBT which is open source we use the open source version so DBT is for database tool it’s a python library where you can find a plugin to connect it to Dremio and with the DBT we are able in fact to simply put all our SQL that you will find in the virtual data set into files and then in git we have put it some kind of CICD where we are able to deploy these DBT models inside the Dremio as virtual data set. We had to tweak a little bit DBT to be able to put in place the first step that you have to do to make a data available in Dremio which is the physical data set. If you look at the top where I have put in a screenshot of our yaml file that describe the physical data set we just have to put the link to the file where is it located in our data lake the tip the type of the format of the file underlying and then we have put it in a script in python that just deploy the physical data set into Dremio once this is done then DBT take the hand and we deploy all the different model that we have developed as SQL code in DBT and it will be available inside Dremio. So we have put in place this kind of framework around DBT to be able to and to make sure that everyone every IT team which wants to expose data are able to do it as code. 

Row Level Security to Implement Data Governance

If we look a little bit deeper we wanted to better manage the right access of our data product so inside a data model you can find in fact some closes that implement whole level security so we were able to do it just by in our SQL statement using is member function which can look of the group of the users and be able to filter some data based on the group each user is linked to. So that first achievement was also a big challenge because you can have as many number of group as user if your data governance is not well aligned and if you do not work well with them. So here also even if the technical side of it is quite simple in fact you have to look at it more widely more globally to make sure that you can have a kind of strategy of how you put in place your different group and how you structure the the group of your data product. 

Link to Enterprise Data Catalog

One thing also we’ve made with Dremio is to connect it with Colibra and this was for the benefits of our consumer. We’ve seen here how data producer were able to expose data inside the Dremio. One thing that our consumer was were looking for is to have a simple interface and not going from one tool to another. So here in the Dremio we have linked it with Colibra we are able to scan in Colibra we are able to scan the Dremio space repository so all the metadata of the Dremio are available into Colibra. The data owner organization which is the one in charge to describe all the tables and the attributes can do it in his tool. They are not consuming data they are describing the data so they are doing their job into Colibra but then the consumer will quite never go to Colibra. He’s able because we have then extracted the data from Colibra and through the API injected it into the Dremio. They are able directly in one place inside the Dremio to be able to have all the description of the tables and attributes directly available into the Dremio. And here as you see the wiki is directly completed automatically filled with Colibra information. 

Using Dremio Reflection

Last thing we’ve done to enhance the performance was to use the reflection which is the cache capabilities of Dremio and for this we can do it also at the business or semantic layer stage and do it as code using the dbt the Dremio plugin. In fact you just have to use a materialized view option in your in your model and it do it by default or you can do it if you are an application team a fit for purpose team you can do it using the interface and you are able to enhance your performance.

Let me just conclude on that so it was a real quick overview of how we implemented but if you have three takeaways to take to and to understand what has been our challenge really to deploy and implement Dremio at Michelin. First of all really start by understanding who will use your Dremio implementation and how you want to serve them because the organization and the right management in the spaces area is key. It’s very difficult to make changes and then make the people transform and deploy a new tool through a new population and change every day the way it is organized or the way the right management is done. So really spend time on organization and right management and make clear where you want to go and how it will look at the end.

A second takeaway that I will push is really yes the user interface and the portal of Dremio is great and you can do many things in it but you have to manage your staging and business layer as code because if you don’t do it you are not able to re-expose, redo the code to manage your code properly and it has been a big investment for us to make sure that all was done as code for all the data producers but it’s an investment that has great value. 

And last takeaway keep it simple your underlying data you should not put complex logic or data transformation into the Dremio. You should prepare as much as possible the data before exposing it in the Dremio and let the Dremio do only the last mile for the people that want to make their application, their reporting and make some simple joins, simple filters and so on. That’s how we think our implementation should be get the full value or provide the full value it’s by doing and making it simple to use and simple to implement.