March 1, 2023

9:35 am - 10:05 am PST

Putting Data Engineering & Data Science into the Next Gear Through a Modern Data Lakehouse at Shell

In this talk we will present how Dremio Data Lakehouse came to the rescue when a use case from the electricity retail domain with a large scale machine learning problem around power consumption forecasting imposed some significant challenges on the data engineering and data science teams in Shell.

We will describe how we 1.) addressed the large variety of data sources that needed to be ingested, processed and served to the Data Science team 2.) streamlined some of the Data Science workflows around data exploration, feature engineering and model testing 3.) operationalized and scaled ML training and inferencing

Session Id: BO110

Topics Covered

Lakehouse Architecture

Real-world implementation

Sign up to watch all Subsurface 2023 sessions

Speakers

Kalidoss (Nata) Natarajan

Data Science and ML Engineer for Power Digital Core

Raja Perumalsamy

Architect and Technical Manager

Transcript

Note: This transcript was created using speech recognition software. It may contain errors.

Kalidoss (Nata) Natarajan:

Hi, my name is Naja. I am a leading data data science and animal engineer team for . My colleague Raja, he is leading the data engineer team for power. Today, we will present modern data lakehouse at Shell, and we will walk through one of our project with the data Lakehouse.

Yeah, so Raja and I going to talk about one of the project which we are working for Shell Energy. It’s a called load forecast. So shell energy offer all the electricity need for the customer home, and one of the requirement is to predict long-term and short-term electricity demand. And but the customer electricity usage will vary based on the weather conditions or the social events or holidays, weekdays, weekend, and sometimes weather. It’s unpredictable and the weather condition is not same for all the customers. So it’s very difficult to predict the feature load for cost with high accuracy. In this project, we are trying to predict individual customer demand with high accuracy. We will walk through the project details.

In this project, we are using classical mission learning algorithm with clustering approach. So initially we tried to create model based on the load profile, but load profile is static over the time, and actually it’s not given the expected result. So we have created our own cluster using our clustering algorithm, and we have created separate model for each cluster and applied the cluster model for every customers. And we try to monitor and we actually improve the performance of this H cluster. And we deploy all this model in the aws and we are running daily forecast with hundred plus model.

So this is a very high level architecture. So we hosted everything in the AWS cloud, and we are getting data from multiple data sources. We are getting the usage data from Shell Energy, and we are also getting data from other data sources. And we are getting customer and contract data from Shell Energy and whether data we are getting from external service provider. And we are getting a, whether actual and whether forecast data. All this data we are storing in the S3 bucket and we are accessing through the U. So we are using vds reflection and multilayer lineage to access the data from U. So our, we are actually using Dream U to our train, our model or run the inference, and we are writing the result back to S3 bucket. And our customer is reading the result through that review.

In this project, we have faced a few challenge. So when we try to collect the data, historical data from external source to our cloud, it was very slow because of the large volume of data. And the other challenge is when we tried to read the large volume of data from our Jupyter Lab. So it was very slow. We tried with multiple options. We tried to connect through odbc with ao, we tried directly connect through ES three, so we tried multiple option, but it was not enough for us. And we need to train hundred plus model and our expectation is trying in the short time period. And we also running daily inference with a hundred plus model and it’ll fire hundreds of query simultaneously. And our system should cap bluff, handle all those queries with minimum infrastructure, and we are running with complex processing and we also need multi-layer lineage to handle the complexity. So I will hand over my hand over the space to my colleague Raja. He will walk through how we address all these challenges.

Raja Perumalsamy:

Okay, thank you nata. Thank you for giving a overview of the project and moving to the next one. So basically NATA is is heading our data science team. And when they are working on this project, they run into these challenges and we have to make sure it is efficient, cost effective, and we are able to address it so that like we are able to deliver this inference and model trainings on time. And this is our data platform architecture with data Lakehouse. you may have Thomas just presented it in the keynote. but to start with, we have, we collect data from various data sources on the left side. And we have, there is this data orchestration layer. We use Spark, Python, and I have put in review also, maybe I’ll explain that part later when I go through fu future slides and for the data storage, our data storage is in Yes, three.

So we are mainly aws shop. So all of our pro products are deployed in aws. So we use yes, three with Delta Lake format too. And uhu is the access layer and the lakehouse engine and even as a metadata serving layer for us. And all the users, like on the right side, if you look at it like any users can go to ramo. It, it’s like one, one interface where people can go and look at all, all this data and then do ADHA queries. Unless there are things like, so product analyst or product managers or business analyst that an new data science team, they don’t have to keep hunting on ment data, which data source has it. and it, they don’t have to go and keep looking for it. Rather it’s all available in one place.

So is the one interface through which all of these uses come and interact. And for the data science use case, most of our data science team uses Jupyter Lab and they connect through remu and also for our model training and inference. they also go through right now for all this daily daily inference jobs and like monthly model training on the right. We also have this power bi power bi have, we, we have daily reports that run on Power BI that also goes through odbc odbc through, through the, to get, get this data.

And this is again another view of the platform architecture from Theo perspective. So where, how we have set up in and how it is being used by different teams. So here, there, there are these two teams. one is Shell Energy team, who is actually an energy provider. And we are Power Digital Core team actually doing the forecast. We are helping the team to, by providing this forecast resource to them. And when this project started, like, so we are totally in a separate domain, like, so aws they have their separate AWS account. We have our separate AWS account. And they have this data coming from different sources on the left bottom of Shell Energy stuff. If you see, they have this data stored in s3. They have multiple SQL Server, R Bs service and also MongoDB and all this data they have like our put it in remu.

And then they use all this data here and they can share this data also with us. So in some cases, like we directly get this data through them. if you have used remu, you know, like Damo provides data sets, virtual data sets, so they create virtual data sets and share it with us, and we use that and from our team, like, so let’s take a look at the usage data. cell Energy team, they have some internal usage data, but we also go to external vendors to get usage data. We get this data and then we use it. And similarly, like after we do the forecast results we share this results back to cell energy team. We store the forecast results in our s3, and then it is shared back to Cell Energy so that there is a, a clean isolation between the environment.

we, we store our data in our s3, they, they have it in this, and then we can share across using. Okay. And moving to the next part. Okay. our is again, deployed on awc. We are not it in cloud. Maybe we’ll move to cloud in future, but right now it’s all deployed in aws. So it’s, they’re using the, we both teams are using the same environment, but still, like the computations are totally different. So they have their separate renew elastic engines. We have our own elastic engines, so that jobs run by them and by us. They, they don’t interfere with each other or they, they don’t impact each other’s. So it destroy it’s, it’s a clean isolation and we don’t have challenges with that. And on the top, if you look at it, like, so it’s be being used by various users and we have like really fine grain access control. So we have access control set up at space level, folder level. And sometimes we we don’t do it at data set, data set level frequently, but some cases we do that too. And the user, the access control is set up so that like cell energy team users will have only access to their data and they cannot see common see P D C data unless it is shared with them. So similarly, same for the PDC team too. Our users can only see the data that is accessible by us.

And one more thing I wanted to point out here. Like, so we get weather data. So we, we have like a lot a large historical weather data that we store. And tomorrow, in future, if there is a new energy portfolio company that let’s say, wants to come and use the weather data, we can easily share it. We don’t have to get the data again or copy it or move it to them, like we can simply share it through review for them. And this is a typical process flow which is very common. Like, so we have the data in this layer data transformation, and we do data processing for the machine learning use cases, and we do model training and inference. I’ll talk about it in more detail in the next slides.

Yeah. So I want to expand more on the data ingestion layer that I was talking data ingestion that I was talking about. So to start with as not pointed out, like we one of the challenges like getting the data from like various weather, so different sources, data sources, right? So initially when data science team was doing this job, like they were trying to put all this data in s3, like in CSV format, this format, different different format, tried to load it in Jupyter Lab notebook and use it, which is like cumbersome, painful time taking, it slows down everything. So now we made sure, like we have all these things retrieved and stored properly in S3 as Spark A And for this, like we have different type of interest and jobs. we use Spark and Python, they’re all orchestrated in airflow.

our airflow is deployed on AWS Kubernetes cluster AKs cluster. And our Spark ran on Kubernetes as well. Like we use Spark on Kubernetes. So all of these jobs are containerized done. they’re, they’re run daily as of now. We have all these use cases running daily. And but we use par for like different re couple of use cases here. One historical data. So our historical data is large, and in some cases we go back up to five. We, we go for five years of historical data. And in some other use cases, we go up to 30 years of historical data to get this large amount of historical data and process it. spark was the app, the candidate. So whereas we, we were able to do a distributor processing of this and then store it and also to parti them effectively and and store it in year three.

So after doing this, like this really helped the data science team earlier they had challenges loading this historical DA data. So that was one of the things that we help them with. And it’s not only about loading, after loading even to retrieve storing it effectively helps them to retrieve this data faster. Okay. And other thing to mention here is like even when we store, we do different types of partitioning. I, I’ll get back into the partitioning little later, but some partitioning we do when we load the data itself, like straight, simple, straightforward partitioning, like doing it on the load date, things like that, but complicated partitioning, we leverage renew also to take care of that. And and the bottom left, if you take a look at it we use the cell energy data.

So cell energy team they have their own retail jobs, and also they have this RD B M service that are put in review. And then they do they do provide their data. So, and for us, we don’t really go and hit their data source or copy the data, like, which is really replication of data. We don’t want to do it. We directly observe the data that they provide through review. And that’s why I put in the in layer though it’s not really in this, but it, it at least helps in avoiding this additional et l jobs that we were that that has to be done. Okay.

So the next part of the thing is data transmission. Again, I’m not going to talk about the typical et l transmissions here. what I’m focusing on is mainly the transmissions that we do in Dr. U. again as I told we use like this transmissions like so when, when there are different RDBMS tables cell energy team, they join this they cleans do some like basic Lansing of the data. they change the format and in some cases they do some joints. And all this data is available for us. and in some cases it’s, there are really complex joints like we do, like customer for, for example, like, so even to get a customer data and contract data that cell energy team has to go to three or four data sources, like get, get this data, join them, and then make sure which is the proper data.

And then they provide us the end data to us. So we really don’t need to worry about that stuff because they are the domain experts. They know, which is the right contract data. Suppose like contract data may be coming in from different place, but we need to make sure we are using the exact right contract data that we need to use. And even after we get that data from Cell Energy, we have our own usage data, weather data, and we do all these complex joints, right? So by it, it goes like multiple levels, and it’s really helpful for us to see it in review and do it in do this joints, because most of our data set is sql, and it’s sometimes it’s very easy for even people like product analyst or product manager who knows this think to, to go directly to Dron, write this joints and give it to us.

Okay? And other thing that we do is data standardization. Like as I told, like, we get a vendor data usage data from internal, from Shell Energy and from external vendors. So when we join this data, we have to make sure this data is same across, like, in terms like column names could be different from when we get from different vendors. So it simple things like changing the column names, making sure the metrics that we get is in the same standard, right? So we don’t want kilowatts in one place and what’s in other place. So just standardizing you those kind of simple, simple standardization. We, we sometimes we do in two, okay? And all this logic and then the changes that needs to be done, lights with the respective team. If cell energy team does it, like data ownership of the data. Similarly, for our, and also because of all this, too many data sources and too much layering, we, we have easily lineage, typical lineage goes up to easily more than five levels.

And this is one sample lineage screenshot, sorry, we had to blur, blur the table names and stuff, because I want to keep it real. So it’s from our existing environment production in ment. So the middle is the actual one of the virtual data set. And if you take a look at it there, it has like five five parents. And similarly on the right, if you take a look at it, there are four childrens, and on the top children, if you look at it, it has 20 further children. And maybe if you drill down, it’ll go down another two levels or three levels. So the easily it goes to multiple levels. And where it really helps us is to really this data is there what happened, right? So we have this data now says something doesn’t look right with the data, so who, who are the parents, right? Or where, who is the actual source of this data, right? So we can drill down and keep going to track, track it to whatever level we want. So it really helps us to track this lineage and make sure we can have the right data to further forecasting.

So the again, the next part, like, so tell now what I have spoken is the regular data engineering stuff where we load data from different sources transform it and save, save the curator data. And then this particular data pre processing step, it is for the data science use case. So again, this is another thing where NAS team, like, so they were writing some of this pre-processing as part of the data science jobs for the training and inference, and we thought, Hey this is taking more time in doing this, so how can we reduce the time? So then what we did, like, we moved all this pre-processing logic also to as airflow pipeline. So we, we have a Python jobs and against spar jobs, depending on the use case. So we do all this pre pressing and standard pre pressing, like removing outlays like if there is no setting it up, data cleansing kind of typical data science cleansing, whatever has to be done, we do it here. And we store this data again in s3, and they can queried as feature set tables. And the model training and inference jobs mainly hit the data pre pressing layer. So right now we don’t have a separate feature store assets, but we are using the itself like a feature store here.

So mo moving to the next step we have this data all done, like curator now pre-processed ready for data science, then how are we going to enable it for model training and inference? So for model training we, we run this model training like once a month. again, like during development times like maybe far two frequently, maybe once a week. Or if performance is not good, we want to create a new model, again, retrain it. So we do it depending on the need. And when we do this training, because training typically queries like lots set of data, then we wanna make sure it doesn’t impact other uses. So one thing, we went with this, like we had separate renew Elastic engine to cater to this need. So we wanted to make sure it does not impact other renew user or other users who are using at that time.

And data partitioning. So this, I referenced earlier the, this was one of the challenges that data science team data scientists were facing. So when there is like huge set of data, when they try to load the center dataset into Python notebook or anything, it can be challenging. And it’s, it’s not efficient. So what we did is we partitioned the data. So there are two types of partition. We did one, when we loaded, we did time-based partitioning. And for training use cases, we had to sit down with data science team, understand what type of queries they do, right? So we apply additional ing using reflections too. So RAMO has this reflections, which is similar to a materialized view in if you, if you talk in database terminology, but we use reflections to, so that this can be faster.

for example, like, so they query this data as a if they use cluster always to query the data, then we make sure it is partitioned on a cluster so that it only queries that set of data. And we also use the sorting feature of Dr. U, where if there are, where high cardinality columns we partition them, like account numbers, it, it’s really high cardinality, we cannot partition them, but we still sort them. That also adds a lot to the performance. So when this training happens, so they are able to retrieve this subset of data for that is needed for the training and much much faster and efficiently. And typical query can read read this data. Like right now it takes couple of minutes to read the data that is needed. But when we initially started we used ODBC to read this data.

And when we used to odbc, like it used to timeout after like 10 minutes or something, like even to read few million records. But now, like we used arrow flight and we are able to read this data like billions of records in like in around five minutes or so. So around, like, so again, it depends on the data set size, but I’m just giving a general idea here. And our flight, as you know, like it’s used for bulk data transmission because it doesn’t need serialization, ization uses network properly, I mean network efficiency with G P C and dp. So it’s, it really had helped desk like, so. and right now almost all of our data science use cases, we use AFL for this.

And this comes to the last part here we run like this inference jobs every day. It’s a daily inference, jobs. Maybe in future we’ll be moving up to hourly inference jobs and for, even for inference also we set up a separate on demand inference engine. So early morning when it transit, it’ll kick off this engine and it’ll run. and for this, we, we had to make sure that it is tuned efficiently. So that is separate queue, separate engine, separate user for this. And it is tuned properly because when the inference runs, there will be hundred inference jobs running parallelly, and hundreds of queries or thousands of queries can be triggered can run. So we had to make sure the queuing is proper and also to make sure it not cause memory issues and other things.

but we were able, so it, it was a little bit of an odd to ma to fine tune this engine, but we were able to do that. And right now our inference jobs run without any failure, and typically, most of these queries run in a, in few seconds. Actually the next screen, I have a sample screenshot of our inference jobs. This is, this is from our actual inference job daily. And if you look at it like most of the queries on the the duration column, if you look at it, they, they’re all running less than a second. And maybe if it goes sub, it may, maybe it takes like five to 10 seconds max. And which is really efficient for inference, like, so we, we, we have tuned it for con concurrency here. And this how we were able to help our data science team to, for this load forecasting project. thank you everyone. And we, we will open the stage for custom now. Yeah, sure.

Audience Member:

Sorry. Hi. so can you move to the previous slides a little bit earlier where you separate the different team to control exercise?

Raja Perumalsamy:

Yes. This one. Yeah. So

Audience Member:

First question is that are you using dreamer Cloud or your host?

Raja Perumalsamy:

yeah. So actually we are not using drer Cloud. We are using uhu AWS engine. So we, we have it deployed on our AWS account, but our plan is to moveo cloud in future. And and right now we want, right now, one of the thing is with with the existing , like we don’t have like dynamic scaling of the engine. So we, we have on demand engines, which will come up. So let, let’s say I have four nodes or eight nodes, it’ll come up. All the eight nodes have to be up always. So it, it cannot scale dynamically. we don’t have that in future, if we go to cloud, then we can easily scale up and down as needed. Okay.

Audience Member:

Yeah. So when I see the three verticals, so those space, are we saying the different dream meal cluster, or is virtual space?

Raja Perumalsamy:

No, it, it, it is one we are using only one DR environment. Okay. but spaces can be different. So here for, just for simplicity, I put it as one space. Say for cell energy, they can have multiple spaces, right? So in some case they have, let’s say their, their sales team could be one, and their risk management team can be a different space, things like that. And kind of like in the same ment, the, these spaces are separate and different set of users can access that depending on the need. And and we have, in some case, we have made sure the engines for the k kind of retire, like, so a simple use case for a shell cell energy, like any querie is run by cell energy users, right? So it, it’ll always go to the Shell Energy engine. So I just try to put everything in one place. Like, so separate Ramo engine, the coordinator will be the same, but Ramo engines will be separate for each of them.

Audience Member:

I see. Yeah. So when you define the data sources we are also separated by different concern,

Raja Perumalsamy:

Correct? Correct. So, so so let’s say cell energy team has access to three or four SQL service that they use. And among odb, we don’t have any of those credentials. We, we don’t even know what, what it’s needed. What we need is what is the data we need to use? So they make sure they add those data sources, they create the videos or whatever changes they have to do, right? Cleansing, they have to do the joints, they have to do and fi they give the final vi virtual data set to us. So for our team, we just make sure, okay, we use this videos.

Audience Member:

So given you have one controller for entire class for solving for three potential different teams is that all the interactively query goes through the same controller? Do you experience any you know, the bottleneck?

Raja Perumalsamy:

Yeah. that, that can be a bottleneck. So where the only option we have is vertically scaling now, so maybe, yeah, we did scale vertically once, right? So, but hopefully we may be moving to cloud soon, you know, so,

Audience Member:

So I was saying that moving to cloud, so that won’t be an issue.

Raja Perumalsamy:

Yeah, so when we move to cloud, like REO provides a control plane, so we don’t have to worry about the coordinator. And other benefit we’ll have is like, we’ll have different projects. So SEN can be a separate project and they can run that completely, and PDC can be a separate project in we’ll won separately. The environment itself will be separate. when we, when we move again, it’s too early for us for us to talk. We are able to do POC on cloud to move that. So

Audience Member:

Yeah. The other question I have is that how do you manage users?

Raja Perumalsamy:

Users? we have we have it integrated with active directory. So we have active directory groups tied up to the spaces. so, and let’s say, let’s say for sale energy, right? So I don’t have to manage the users for that. So we we, what we do is like the admin work here, what they do, like go associate an 80, 80 group with a, let’s say, for a particular space, right? Then whoever is the owner of the ad group, they can add users who can use that. And for fine grain, like we try to set different type of uses, right? So we have like only uses, which means they’re just only going to query you. And then like rewrite uses, they’re all set up as separate groups. Rewrite means they can create VDS and stuff. And there is another set of users we have like admin users who can actually manage the reflections and other things too. So we wanna be very careful about who enables the reflection and if they understand how this reflections are working in Romeo. So only power uses, like, kind of like I see.

Audience Member:

Thank you.

Raja Perumalsamy:

Yeah, thank you.