AWS Glue Amazon S3 Subsurface LIVE Sessions

Session Abstract

The JPMorgan Chase (JPMC) Corporate Technology (CT) Architecture and Engineering team’s mission is to provide technology consulting services to the CT organization to help influence, design and implement technology strategies and solutions that facilitate the modernization of CT’s application portfolio, solve key challenges, mitigate risks and add business value to the organization and the broader JPMC firm. During this session, James Reid, Managing Director and Head of the CT Engineering and Architecture team will discuss JPMC’s data lake via data mesh architecture and Nisha Yerrawar, Chief Architect for Wholesale Credit Risk Technology within CT, will discuss the wholesale credit risk use case for data lake via data mesh.

Webinar Transcript

Lucero:Before we begin, there are a couple of things that I want to run by the audience. At the end of the presentation, we are going to have a live Q and A. So if you want to participate and make it interactive, I will queue you in. So for you to be able to ask the question, you will need your camera and your microphone on. So you can talk to our speakers today. Also, if you have any questions, [00:00:30] and you decide not to do it live, go ahead and put your questions on the chat window which you will see at the right-hand side of the screen.And also at the end, please go ahead and visit the Slido tab that you see on your screen to provide feedback on this session and also the rest of the Subsurface conference as well. This presentation will be available in the following days on the Subsurface websites, so just stay tuned for [00:01:00] announcements there. And without further ado, I want to introduce you to James Reed, managing director and distinguished engineer at JPMC and also [Neisha Gero wor 00:01:14] executive director of wholesale credit risk technology at JPMC as well. Without further ado, the time is yours.James Reed:Thanks, [Lucero 00:01:22]. Further, appreciate that. Welcome everyone to the Data Lake the Data Mesh at JPMC. Before we get started, I want to just set a context. [00:01:30] As we know, JPMC is a very large bank. You’re talking petabytes of data. As the head of global head of engineering and architecture of a corporate technology, which is spans the entire bank. What we’re about to show you is how we’re approaching really establishing Data Lake, leveraging the Data Mesh architecture, within the public cloud. So it’s definitely a journey, but hopefully we can share some good tidbits with you and have a good Q and A discussion at the end. So, let’s get started. [00:02:00] So [inaudible 00:02:01] Data Mesh architecture. So like with any other company, any other corporation, you’re really looking to unlock the opportunity and the business value that’s locked in your data, which includes democratizing that data, so all your stakeholders have the Liberty to be able to leverage and use that data to really drive business value.And we certainly want to take advantage of the big data in the cloud. So we’re maximizing our availability and accessibility to analytics. So having that [00:02:30] separation of compute and storage and the ability to scale up and having the compute that you need in order to do the compute and calculations against large amounts of data, which certainly we experience at JPMC. And then definitely there’s this balance between business and technology outcomes. So this is not just a technology exercise, or having the most technical elegant solution, it’s really about how can we drive the value for the business. So the way really [00:03:00] look in to approach that is really around these three major principles. Typically, when you start talking about a data Lake and the public cloud, usually everyone just focused on that first one, which is the cost savings.We’re taking a slightly different approach, where we want to look at all three of these principles, or look at it from these three lenses, which is really important in order for you to democratize your data. So one, definitely cost savings. So looking for the opportunity to reduce infrastructure costs while still providing that business value. [00:03:30] Unlocking the business value, removing those pain points so we can create new opportunities and how we leverage and use our data, especially for machine learning and AI. And then the last one, which we’re going to talk about a little bit further in our presentation is around data reuse. And if you’re familiar with the data mesh architecture, definitely it looks great on paper. Definitely it makes a lot of sense, but it’s not something that’s easy to be done. So we’ve actually started that journey and we’ll share a little learnings behind as we started [00:04:00] that journey and how we’ve approached it, and some best practices that we’ve taken from that journey.So, with those three major principles and really trying to unlock the opportunity and the business value of your data, we really have to look at how we modernize our data platform. And that’s really around four key principles as well that we want to look at when we’re modernizing a platform. So, typically when a data Lake is built, it’s built as a monolithic data Lake. I’ve built data lakes in the past, both on premise and in the cloud. [00:04:30] Here at JPMC, we want to shift away from really building on a monolithic data Lake. So based on the second bullet point is really taking a microservices architecture approach for your data. So just like you have loosely coupled architecture and loosely coupled services in a microservices architecture, we want to take a same approach for our data Lake.So we want to have a loosely coupled architecture for our data. All for the same benefits of why you have a loosely coupled architecture for microservices, is the same benefits that you want to gain [00:05:00] for your data. So being able to move at scale, be able to have that resiliency, having that separation of compute and storage is also reducing your attack surface area when it comes from a security perspective. And then the third one, which is the hardest piece I believe, the technology is pretty straightforward. A lot of the cloud providers have this capability there’s a lot of open source out there and where you can start really building your own data Lake and even a data mesh, but the aggregate fit for purpose data products. So really taking a domain- [00:05:30] driven design approach, to how you think about your data as products. And then lastly, the self-serving capability around these data products being owned by the data engineers. So this is where you want to have the distributed pipelines, you build and you own it construct, when it comes to really landing the data that’s in the cloud.So, let’s go to the next slide and talk a little bit more details and actually how we’re doing that. So this is a high level view of how we’re looking to modernize our data platform [00:06:00] on the public cloud, using AWS, as an example, and [inaudible 00:06:04] this journey, we are actually looking to put this in AWS and we’re headed on that path. So as you can see to the left, it’s your typical source in data relational database, Hadoop clusters, files, APIs, and then to your right, you see the consumption layer from the many different type of users that you’re looking to service within this platform.What’s key here is where we wanted to leverage and use the [00:06:30] data infrastructure as a platform, which is a key construct within data mesh, and the way we’re achieving that is through Lake formation. Now, if you built data lakes before in the cloud, there’s a lot of work around the entitlements fine-grain core screen, the governance, the ability to connect the high-level managed services on top of those entitlements using IAM roles and Glue. It’s a lot of work. This is why Lake formation in AWS has actually given us the ability to speed up because it’s really [00:07:00] behaving as that data infrastructure as a platform. So first we want to be able to ingest that data from these sources. So this is where we’re looking at leveraging and we’re using airflow to be able to land the data into the cloud from one premise, actually into the cloud and leveraging some of the other managed services as part of that airflow orchestration.Metadata management. So using other technology and some open source technology, to be able to put the proper management around our data. Data [00:07:30] catalog, this is where Glue comes in. Glue, given us the ability to be able to catalog that data and put a logical model that really represents what the model looks like from a logical construct, and how the business will actually use that data. And of course, Telemetry from CloudWatch and other type of tools for us to be able to really monitor into end our ingestion pipelines, as well as once the data’s in use that really putting SLO and SLIs around the data. And that’s a learning for us. So we’re going through that journey to really understand what it means from an SLO and SLI [00:08:00] perspective. When you start thinking about your data as products. And of course, security bacon controls and the purpose security controls, highly confidential data using the higher managed services to be able to properly encrypt the data at rest, as well as, also while it’s in transit. And then this concept of data Federation.So you’re probably like, “Well, what is that?” So this is where leveraging the higher managed services within AWS. We’re really looking to build out a hybrid cloud. So that’s where you see to the right there, [00:08:30] the data Lake engine using Dremeo for data Lake that we have on premise. So we do have a data Lake on premise, and as we’re going on this journey, we can’t necessarily move all workloads into the cloud. So as we’re building on our data products on AWS or out in the cloud, we need to still give access to the workloads and sitting on premise access to that data. So we don’t want to have a copy of that data. We want to be able to give them access and make that as seamless as possible. So this is where using Dremeo and the ability to connect to a rest shift [00:09:00] cluster do Retrospection, which is a SQL API, which actually integrates with Lake formation, honoring all the title MITs and the fine grain control working with Glue and the data that’s sitting on S3.So as you see, the bottom there what’s most important is these data pods. So this is where you start getting into the product, thinking around your data, and we’re going to identify our data domains. And what we really want to do is stop the proliferation of data puddles that are on premise. So what is the data puddle? Data puddle is something that’s very specific [00:09:30] to a project, and we have a lot of those data puddles on premise. And what we want to do is shift the data pond. So think a data pond is a bounded context, and it’s a combination of one or more data domains.So this is where we want to move from puddles to ponds and ponds to lakes. And that allows us then to build out our Lake, based on data as products. And as we begin to shift that product thinking around data as products, it really then shifts the culture and the way we think about our data and really start to [00:10:00] incorporate into an ownership of those products, which allows us to begin to follow, just like you do with applications, you build it, you own it, you run it. That actually improves the quality of the data, improves, focus around how that data gets ingested and landed and ensuring it’s following all the appropriate governance and controls.So that’s a shift for us from a JPMC perspective. So that’s going to definitely take time as we shift into that model, understanding who’s going to own those products and manage those products, working with the source [00:10:30] systems to land that data into our data Lake. And then lastly, another key principle’s around the data democratization. So we really want to liberate this data. So by having this architecture, leveraging Lake formation with Athena, with Redshift cluster, with Retrospection, with EMR, it allows us really to build a domain-driven distributed architecture, because by nature, the AWS cloud, having the loosely coupled AWS accounts, as you can see our data [00:11:00] ponds are separate AWs accounts and having cross account access that’s built into the Lake formation. We take advantage of the decoupled compute and storage, and we take advantage of the plasticity, which then allows us to really build out a domain-driven distributed architecture.And of course, by leveraging the higher managed services, it gives us the ability now to have a self-service infrastructure as a platform. And then ultimately our ecosystem from a governance perspective is really important. So Glue giving us the data catalog, having [00:11:30] a metadata driven approach, leveraging some other open source technology for really capturing our logical models in our physical models. Of course, data lineage, time-travel as we looked in at data lake and hoodie, to be able to bride that capability as formation begins to build out even more of its capabilities, which some of that was announced in AWS reinvent on 2020. And then of course, because we’re corporate technology because of regulatory constraints, adjustments, and reconciliation [00:12:00] becomes extremely important to make sure we have that capability as well. So let’s dive a little bit more into the data products and then I’m handed over to Neisha.Who’s actually going to walk through a real example of how we’re approaching it within the credit space and she’ll touch on as well, both the hybrid approach that we’re taking as we’re going on this journey. So the domain driven design for data as products. That to me, this is the hardest piece of it all. The first day that we started this [00:12:30] journey last year, and the first thing we recognize as being on this journey, it was quite amazing, even though we’re always using the same terms, we are really speaking differently about these terms. So a lot of times when we’re on the meetings, thinking we were aligning, we weren’t in line and we quickly realized we really needed to establish and align on the common taxonomy and terminology. Like, what does the data pond? What’s the data Lake? What are data domains? What is the data offering? What constitutes [00:13:00] a data pond?So we really needed to align on that. And that really helped to really then steer the conversation as we begin to do it. Then we’ll apply to domain-driven design technique. It’s just the same domain driven design technique that you will apply to microservices to really establish the bound the context for our data product. So that’s what you see there to the right. Those are some examples of the data products that we begin to formulate based on the workloads that we’re moving to the cloud. So wholesale credit risks, which [00:13:30] Neisha is going to touch on more are general reference data party. Our, the bounded context. Now what’s key here is really finding that balance between course and fine grain bounded contexts, very similar to microservices. So you don’t want a service to be too fat, too big or too fine-grained.So it’s all about having the roughly right data products. So that’s the approach that we’re taking. But because the approach that we’re taking in reference to having these AWS accounts, having one or more data domains, which we’re at represents our [00:14:00] data products, it gives us the flexibility to be able to evolve that data product over time. So if we have a data data product that’s too fine grain, we can then pull it up into another data product. If we have a data product, that’s two cores, we can then decompose into a more fine grain product. So we have that flexibility to learn as we’re going on this journey. And then one of the other key things that we did, we really aligned the consumption use cases. We looked at our workloads that we’re looking to land into the cloud and what data it was using, to really begin [00:14:30] to build out the data products that are needed, so we can build a plan and execute the delivery of the data products on the data Lake.So this is where I will use an analogy. This is where from my years of experience working with the cloud, you really got to think like a city planner. So just like a city planner looks at all of it’s resources, to understand how the best use those resources to build the most efficient city, is the same sort of approach you have to take with building your [00:15:00] city out into the cloud. So that’s why we took that approach, really looking at our workloads and understanding what data domains was associated with those workloads, the business value that we was trying to deliver to really help us to shape the boundary context is that we wanted to put around the data, which is really from a consumption perspective and a business perspective.Now, that allows us to identify the opportunities of reuse, by looking at the various workloads that were landing in the cloud and see what data domains that we’re using. So then as we begin to build out and move those [00:15:30] workloads, we’re also simultaneously building out our data products within our data Lake. Back to moving puddles, to ponds and ponds to lake. And then lastly, we don’t want this to be an academic exercise. So this is not being academic around data mesh. So we know we’re going to have to make some performance considerations, and we may have to make some trade-offs to loosen those bounded contexts. So in some cases where we may have to copy specific data across these bounded contexts, because we don’t want to generate that on the fly. So those [00:16:00] are the sort of use cases we’ll look at very closely and we’ll pay attention to the performance numbers to see if we have to make some trade offs, as we begin to build out our data products in the cloud.And then lastly, it’s all about learning by doing. So getting starting on this journey, actually going through understanding what is a data product, the ownership around the data product, the bounded context, how are we going to actually formulate that in the cloud, what workflows are we’re using. Bias getting started, we’re moving [00:16:30] and we’re moving forward and that’s what’s most important. So as you get started on the journey it’s all about learning by doing, and then you evolve from there. So with that context and that background, I’m going to hand it over to Neisha. Who’s going to walk through a real example of how we’re actually applying this to an actual, real use case. That’s going to deliver some business value for the firm Neisha.Neisha:Thank you, JR. So as JR said, really, what I’m going to do is walk through what data mesh means. How is it applied to a business use case, which is a credit risk? [00:17:00] JR, do you want to move to the next slide?James Reed:Yes.Neisha:Now in general, to classify credit risk as what is stress gets an uncertainty of a planned outcome. And as a form, as JPMC we have risk in every activity we do, whether it be trading activity, lending activity, any kinds of activity that the firm is involved in does generate risk. This function is about being able to quantify that [00:17:30] risk, and then being able to manage that risk, that the risk is within reasonable bounds. Now, specifically that [inaudible 00:17:39] risk, specifically credit risk is about understanding the exposure that we have that arises due to credit activities. Again, these can be because of underwriting, lending, or even just generally operating services like cash management or clearing activities. In [00:18:00] general our guiding principles are really key to creating a culture of responsibility ownership, keeping our people first, essentially helping us unlock the value of data at scale. Now to the team of the talk and especially to what JR was talking about.Some of the key data characteristics that we have as credit risk technology within credit risk technology is [00:18:30] really the first one is we carry MMPI, which is make it really non public information. What that entails is, we do need a very good entitlement and entitlements are not just core screen at the data set level per se, but really at the row column level. So really fine-grain entitlements. The second key thing is, we process data at least legal entities, which belong to both [00:19:00] what we call restricted countries and non restricted countries. Now, essentially data that is on unrestricted countries, cannot go to public cloud, the non restricted countries, and some key lucks after we get the regulatory approval, can move to the public cloud. Now that is driving our hybrid cloud approach, not just that, but our ability to handle operational risk. Any one public cloud [00:19:30] carrier at least caries is a risk.So the hybrid approach really helps us handle the operational risks. Now, again, as JR mentioned, really, we have a lot of strict audit and financial degree to commitments. So every data or every transformation that happens on the data does need to be stored. We need to have an ability to help understand how we arrived at the data. Where [00:20:00] this data explainability becomes key. Along with it, what we need is washing data. That is, we need to be able to run, let’s say our exposure when the desk closes, that is an end of day snapshot or for trading view for example.Along with that, not just an end of day of one product, it really is end of day cross data domains. And that is key. That is I have a consistent version across all my dependencies. So [00:20:30] I can earn the fact that I can have time travel to go back and explain how I arrived at this number sometime later, when an audit comes in. So those are the key criteria that we would need that our data has within credit risk. Now, going on to the next line.Really it’s about our current and the future state. So the modernization that JR was talking about, is [00:21:00] really to help us address the challenges that we have in our current state. The first one is we have a large monolithic flow. And that’s where the domain driven microservices really play a huge role. So these help us… We are following a blueprint, that firm has certified and approved for microservices, that help us develop and deploy at scale. Really, we’re looking at Blueprint as code, and systematic generation of blueprints for microservices. Now the next key [00:21:30] thing is really about data. So unless we have the right data and availability of data across our staff thinking about data as products, the next level, or us being able to unlock the value on the data is really difficult. So hence our journey on the Federated data Lake, why our data mesh. That’s the journey that we have embarked on in this year using the [00:22:00] vision and the strategy that JR has just laid out. Now onto the next slide.Is really where the division that JR was talking about is that examples, what does that mean? Now, in data as products, we have data domains. What we are calling them as data points. Now, along with the taxonomy of making sure everybody understands what the data point is, where the data domain or sub product is, [00:22:30] it’s really important for us to understand the business functions also. We really need to have a common taxonomy of what it means to be wholesale credit risk as a product. Also, what it means when we say there’s a party data, or there’s a client data, that’s another one that we work closely with the business to define it. But we also closely work with that information architecture to define what those concepts mean. Now, once we have those concepts, those are the ones that become [00:23:00] these data ponds.So wholesale credit risk is a data point. Now wholesale credit risk cannot run on its own. It needs data from other owners for example, piping, which is all the clients that the firm has. Now, the moment dive domain gets enable, yes, you are enabling credit risk as well, because we need that data for us to understand what exposure you have against these clients that JPMC has. Along with [00:23:30] that other reference data that are needed, definitely now they become another domains that are loosely coupled with credit risk, so that credit risk can actually do its job. Now, the key thing about it is as data points, come on board, you are unlocking a large value, not just for credit risk, but for every other data product that is in the data Lake. For example, the moment I have something like demand deposits there, my analytics is unlocked because I can now [00:24:00] use deposit information along with exposure information, to gain some insights for handling risk for the firm.So with that, I think that the factors as we use data infrastructures platform, then all of the data domain owners or data point owners get some of these common services by default. So we get a consistency of using the service, but with federated ownership, [00:24:30] that is, we have specific a business owner and a tech owner for the data points. Now this really is helping us, or it will help us, It’s just a side of the journey, so it will help us unlock the value in the data for credit, but also it gives us just one view of the data for all the processing, reducing the need for reconciliation’s, but also making it really easy for us to explain the data and do it in a [00:25:00] very accurate way. I think just the punch line is really that we are making data that explains on the data easily available to all of our use cases, be it reporting business analytics, or really unlocking the value of the data for our ML purposes. With that, I would want to open up the question and answers.Lucero:[00:25:30] Great. Great presentation, Neisha and James. Thank you so much for sharing all the information. So I have a couple of members of the audience who are geared up for live questions. I’m going to try [Justin Gulas 00:25:43]. Let me go ahead and I’m going to cue you in, if you have your camera and mic. Nope. This person changed their mind. Let’s just go ahead and try [Phy 00:25:52]. And Nope. How about Mark? No, Mark is not around. All right, so let’s go ahead and we [00:26:00] have a question in the chat here. Comes from [Elaine Soc 00:26:03], “How do you handle cross domain connections/ interactions?”James Reed:Sure. I’ll take that one Neisha, and feel free to jump in at any color commentary to it. So this is where, and back when I talked about the data as products, and thinking about performance considerations. So remember the data’s landing on S3. So once it lands on S3, it’s registered as a Glue catalog [00:26:30] tables. Once those are registered, the blue catalog tables leveraging the managed services that’s in AWS I think in Redshift spectrum. And then of course understanding the actual logical model of those tables that are landing within Glue.You can actually then connect data across those domains and across accounts. So this is where Lake formation provides the ability to cross account access. So when leveraging [00:27:00] a lot of the capability that’s already built in Lake formation, built into the managed services, but what’s key is that logical model and the connecting of that logical model, understanding what’s your primary keys and your foreign keys, which is this is where having a metadata catalog comes into play. And that’s not just… Glue is more of your technical catalog, we are using other technologies like Collibra and some additional things around Collibra that we built to give us that metadata catalog.Lucero:[00:27:30] Great. There’s another question. This one comes from Hamid. This person is asking, “Is SQL the main way you federate your data, an example through Dremeo?”James Reed:So Neisha, I’ll share a little bit, and I think it’d be great for you to share a little bit around the credit use case, because I know that certainly what you’re going to be doing. So, yeah, SQL is the main [00:28:00] way Dremeo is going to give us that capability, especially for our hybrid for on-premise working with our data. That’s in the cloud. On the cloud, when the data’s landing in S3, since it’s all sitting on S3, there’s not really much of a Federation. We’re not pulling data from RDS or another operational store. We’re actually accessing that data. That’s sitting on S3, but we know that the managed services within AWS is going to give us that capability to federate through sequel and through queries. [00:28:30] Neisha, if you want to touch on how you’re doing it on premise using Dremio to do the Federation, I think that would be great.Neisha:Absolutely. So far [inaudible 00:28:38] we definitely use a Dremio cluster. So it gives us an SQL capability on top of the data of the odd [inaudible 00:28:44] in elastic search. So that’s the key enabler that we have, but our consumption needs to know [inaudible 00:28:53]. We do have what’s called a data [inaudible 00:28:57] blueprint. So [00:29:00] other thing offering direct S3 API then necessary to be able to retrieve it. But we also use Kafka as an example, where data is available for streaming processing. So all of those are available to our consumers.Lucero:Great. Thanks. And we have on the live Q and A [Jellison Gulaz 00:29:22]. Jellison, do you have a question for the team? I think you’re muted by the way.Speaker 4:[00:29:30] Excuse me. Can you hear me now?Lucero:Yes. Loud and clear.Speaker 4:I’d like to ask about data steward site. Do you have data stewards and data governance procedures applied?Neisha:Yes. Yes we do have. I think that’s the key. We have… Our information architecture plays a major role. We have inflammation architect across each of [00:30:00] the tubers who define what the data model should be, what the data product should be. But within a model two, it is what are the attributes? What are the definition of the attribute? But also retention policies of that data along with any linkages that attribute has to something else. So yes, there is a very comprehensive governance that’s in place.Speaker 4:Okay. Thank you.Lucero:[00:30:30] Lucia. I think we lost you Lucia.James Reed:I’m back. Sorry. I gave me a myself.Lucero:I think we have time for only one more question from the chat. And let’s see, there was a question here from [Shashin 00:30:44], he’s asking, “Which cloud product is being used to store data in the data ponds that were represented in the architecture diagram?”James Reed:Yeah. So the data’s landing on S3 using Airflow and then AWS Lake [00:31:00] formation is a managed service that’s in AWS, that provides an abstraction on top of S3 to simplify or make it easier to build data lakes. Prior to Lake formation, it was your responsibility to build all the IAM roles to get the fine grain, of course, grade entitlements around that data. Now Lake formation that sits on top of S3 provides that attraction and that capability. So lake formation, [00:31:30] Glue and S3 are the major services that’s providing us the ability to build the data ponds.Lucero:Excellent. Thank you. So I think that is all the time that we have for today. Neisha, James, thank you so much for such wonderful presentation to the rest of the audience. Thank you so much for being here today. I hope that you enjoy these session and the conference in general. We hope to get all the rest of your followup questions that you may have on this Slack channel. And other than that, I hope everybody [00:32:00] has a wonderful day, a wonderful evening, and we will talk to you later. Thank you. And bye-bye.James Reed:Thank you.