March 1, 2023
8:00 am - 9:30 am PST
Shell’s Journey to Data Mesh
Data and Analytics organizations have worked to balance improving access and self-service for the business with achieving security and governance at scale. Data lakehouses are ideally suited to help organizations provide both agility and governance. Deepika Duggirala, SVP Global Technology Platforms, at TransUnion and Tamas Kerekjarto, Head of Engineering, Renewables and Energy Solutions at Shell, will share their journeys to deliver governed self-service.
Apache Iceberg development and adoption accelerated significantly this year, enabling modern data lakes to deliver data warehouse functionality and become data lakehouses. With Apache Iceberg, the industry has consolidated around a vendor-agnostic table format, and innovation from tech companies (Apple, Netflix, etc.) and service providers (AWS, Dremio, GCP, Snowflake, etc.) is creating a world in which data and compute are independent. In this new world, companies can enjoy advancements in data processing, thanks to engine freedom, and data management, thanks to new paradigms such as Data-as-Code. Tomer Shiran, CPO and Founder of Dremio, will deliver Subsurface’s keynote address.
Sign up to watch all Subsurface 2023 sessions
Note: This transcript was created using speech recognition software. It may contain errors.
Like I said, worked with now hundreds of companies that have implemented data mesh, and I’m super excited today to have Tamas Kerekjarto with us. He’s the engineering lead and senior architect at Shell, and he’s going to talk about Shell’s journey to data mesh and also tell us about how Shell is really bringing in a new world of clean energy. Thank you, Tamas.
Thank you very much. Thank you folks. Thanks. Well, fantastic. It’s so great to be here and I’m really, really excited to share our story today. Before we dive into the nitty gritty details of our data journey, I need to talk about a few things to give you a little bit of context, so there are three things here. The energy transitioning, one, the company and how it relates to it, and try to take a leadership role in that transitioning. And a little bit about digitalization, how machine learning, AI data and software is really tying this all together.
So energy transitioning is basically the world trying to move towards clean energy. They try to move towards clean energy, but at the same time try to decarbonize and try to drive towards net zero emissions. And this seems to be quite a bit of a challenge because the world is still needing more and more energy, right?
The demand is growing, and this is really where our company is coming into the picture at Shell. We like to take a leadership role in this transition and help with that challenge. But in order to do that, our company really realized that we needed to transform its business, right? And this transformation is what we are referring to powering progress. What you see on the screen is the four pillars of powering progress. First and foremost, we really like to generate value to our shareholders and do this in a profitable manner. We like to partner up with our customers, businesses and governments across various sectors in order to help them drive towards net zero emissions, at the same time still respecting nature by reducing waste and contributing to our biodiversity. And we are powering lives and livelihoods and really trying to ensure that this transition is happening successfully and profitably.
So, energy transitioning. There are two major mega trends that we see that’s going to really shape our lives in this next 10 years. Besides the energy systems becoming really decentralized, digitalization is also a key factor as machine learning, AI and software really now, not just having a large impact on all these different business models that are being created, but at the same time they are defining new business models, which is a quite interesting phenomenon.
Power Value Chain and Load Forecasting
But let’s go back to the power value chain and this decentralization kind of concept a little bit. So a couple years ago, the power value chain was quite simple. Simple in a sense that we had a few large power plants that were basically generating electricity. That electricity got transmitted and distributed across the wire. And of course, because electricity wasn’t something you could store indefinitely, it needed to be balanced in real time, right?
Supply and demand always needed to be in balance. So that was one of the difficulties. But in a sense, it was fairly simple because these large power plants were just generating the electricity and consumers at the end were just tapping into the grid and consuming it. So what changed? What really changed was the emergence and the appearance of these different renewable energy sources, right? Solar, wind batteries, electrical vehicles. And so now what really happens is that these large power plants are being completely distributed across the overall grid. And millions and millions of these minor smaller generations are starting to contribute into the mix. And this really makes things complex. If you think about it, people can put solars on their rooftops and they can generate electricity and contribute back to the grid. Then the batteries come into the picture.
People can actually do this thing called load shedding or load shaping, which means basically that you are using your battery when the electricity prices are really high, and then you are charging it actually when they’re really low. So this really introduced a lot of new behaviors, a lot of new challenges and all kinds of facets. And that’s what really brings us to the story today. Because one of the things across this complexity that really makes a company or anyone stand out is the ability to be able to forecast the consumption of electricity, right? If we know how much is going to be the demand and how much is going to be the consumption, then we can really prepare for it better. So this is where one of our internal organizations, a power retail organization, really decided to expand on their existing forecasting capability and build their own.
So hopefully now you are able to see that with all these transformations; whereas this has always been a compute problem because time series forecasting has been always around, but now with all these different facets, it’s really now transforming into a data problem. Okay? And this is the data problem that I wanted to talk to you about today. Not necessarily about the forecasting, that has its own kind of intricacies in terms of what techniques and approaches are being used, but really about what the data challenges brought us and needed to solve.
So it all really came down to speed, right? And as Tomer was talking about cell service and performance. So what we were dealing with is the classical situation of having large volume of data residing in different data sources.
And we really needed to be able to tap into these data sources without building our complex ETLs, which would take a lot of time, allowing our data analysts, data engineers, product people to really contribute to this and get it to the data scientists so they can start doing their magic, build all those magical algorithms, and then eventually put them into production and operationalize them. So the two key challenges I’d really like to hone in on was one of these around the data volume and again, enabling these people without writing a whole lot of code to serve the data and move these data products along this chain up to the data scientist. And then just one interesting example, when we were supplying data to the data scientists, and then one day they come back and say, “Hey, Jupyter Notebook is crashing, what’s happening?”
So, well, let’s take a look. No wonder it’s crashing because you’re trying to load 80 gigs worth of data. So in memory, that’s not going to work. And by the way, you have about 15 step join in here and oh, let’s just see one of the unique IDs might not be not so unique, right? So what do you do? You basically refactor, push it down the stack and try to enable them and move forward as quickly as possible. The other one was around when we tried to put these inference models into production, and it turns out that we need to run about a hundred of them concurrently. And it was about 6 or 8 billion records that needed to be retrieved within about a couple minutes, right? So that whole timeline really needed to be condensed to be able to allow people to go through this and, and achieve this.
Data Platform Architecture with Data Lakehouse
So here’s the architecture. Dremio was a natural choice and really gave us a huge uplift as a compute engine to be able to address this because we were able to again, tap into these resources, data sources and have like really quick iterations before we unleash the data engineers to write ETLs and so forth. So VDS’s were flying, and then PDS’s were flying, and then we all got to the reflections, right? And we were joking about like, how is your reflection doing? And when they started getting some love and some hugs, then those got hugged and then got a little bit choked. So we needed to allocate a little bit more juice behind it and isolate them. But overall, I can say that it’s been a pretty positive journey and I can’t say that we are completely over it, right?
And we have those high accuracy forecasts and so hunky dory, but where we have really got to is in a stage where now we can push through this, the different high volume of data with relative ease. We have the process, all the collaboration, all those people safely can publish their data sets and it’s all working.
Distributed Data Mesh
So here’s another view of what we’ve created. And what really made us realize was that this actually became like a mini data mesh, right? The data has been residing in different sources, mostly distributed. And this unified access layer really provided a great abstraction from all that complexity, allowing the data engineers and data analysts to contribute, provision different spaces, and then really in a visible and very dynamic fashion, allow the data products to evolve and really hit the various customer levels.
And in this case, we had basically customers, the data scientists, and then we also had the end customers who were consuming these forecasts. And along the way, we also realized that we have actually generated a lot of very valuable data sets that they actually really like consuming. So all these learnings really gave us the impression that now we have most, if not all the characteristics of a data mesh and really like to bank on these learnings in the future. So some of the additional things just to mention, the iterative data model was actually a pain in the button at first. But we realized that we really needed to refactor this constantly and just going back to that scenario where that 15 plus joins with the not so unique ID, right?
We were able to really jump on it because of the visibility that the lineage provided us. But something we had to deal with. On the other hand, the Dremio compute engine is a sophisticated kind of beast, right? So when you have these reflections, then you really want to be careful as to how you treat them and allocate enough memory and isolate them and so forth, because once they become successful and famous, then they really get famous, right? And take on. And then finally the fine grained access control, that was an absolute key because this way that we can provision these spaces and safely, securely assign it to Azure active directory groups really kept our IRM comrades and friends at bay and they remained our friends. So that’s a really good thing. So overall, really great experience. If you are interested in more details, please go to our breakout session. Thank you very much. Really appreciate being here and listening to our story. Thank you guys and enjoy the conference.