March 1, 2023
12:55 pm - 1:25 pm PST
(Meta)data for developers: building a data API
Marsh McLennan’s Enterprise Architecture team is focused on increasing technology use and reducing development time. Providing engineering teams with a single way to surface data and metadata through common patterns (such as REST) seemed like a great way to achieve that. See what they achieved, how they leverage technologies such as Dremio to achieve consistency, what lessons they learned, and what they plan to do next.
Sign up to watch all Subsurface 2023 sessions
Note: This transcript was created using speech recognition software. It may contain errors.
All right. Hi there. we have a session today on metadata for developers. This is about how we’ve kind of built out our, our whole ecosystem using REO and, and, and fueled it with some, let’s say, interesting connections between our technologies . So today we’re just gonna be spending a little bit of time introducing ourselves, giving you some context on the organization that we work for, which is Marsh McLennan. Lay out some of our technology strategy, some of our data platform, and the challenges that, you know, we have as a, a large and sprawling organization in dealing with them, how we’ve been using Dremio in order to solve some of those problems. and then we wanna explain the details of our data api, the problems it was intended to solve, and walk through that implementation. And Ed will be giving us those, those wonderful gory details. and then we’ll also talk a little bit about what we’re working on adding to the platform, because we, you know, we see this as kind of a foundational capability as we move forward.
So, just brief introductions. my name’s Ian Blem. I’m the head of data strategy here. ed Marsh McLennan. I’ve been in this position for about half a year now. I spent the past 15 years working with data, so I used to be hands on as the kind of guy who was developing the code, doing some of the analytics. and then I, I did a detour in consulting for a little bit, which is where I first met Ed, you know, and there I ended up heading up our data engineering and data science practice. So I’ve really been steeped in the data for a long time, and dealing with some of these challenges has been a you know, it’s a, a passion that’s near and dear to my heart. I live up in Portland, Oregon with my two kids and my wife.
Hi, so I’m Ed Olson Morgan. I’m the core API in innovation lead at Marsh McLennan. have been with Marsh McLennan on and off for about 15 years. began in the consulting side of our business and then helped stand up all Oliver Wineman Labs, which is one of our digital teams. wents Cross to one of our competitors did this the same sort of thing for a little while, and now I focus on enterprise wide APIs and API strategy for the organization. Live just up the road in Sacramento with my wife and three kids.
All right, thanks, ed. So, a little bit about Marsh McLennan and our technology strategy. We’re actually one of those big companies you’ve probably never heard of. we’ve been around for about 150 years. We just had our anniversary last year. we are a consistent presence in the Fortune 500. We flirted with the Fortune 100 over the years. and we’re in a couple of kind of related businesses around professional services. So we do insurance, we do reinsurance, broking human resources, benefits consulting, and you might have heard of some of our businesses. So for Marsh that’s our insurance brokerage. We deal with you know, both kind of smaller scale consumer facing insurance products, as well as kind of those large complex insurance cases helping larger companies to to implement their own kind of self-insured ecosystems as well.
Mercer does a lot of HR benefits consulting, a lot of analytics on top of that, as well as compensation data and benchmarking there. Guy Carpenter does reinsurance broken which is something that I wasn’t even aware of before I joined Marsh McLennan, which is insurance for insurance. and then we have Oliver Wyman, which is where Ed and I cut our teeth which is a management consultancy. within Marsh McLennan, we’ve actually been evolving what our organization looks like from a technology perspective very rapidly. So historically, we were very decentralized, we operated independently. and then three years ago we had a change in direction, and that’s where we established the central capability where both Ed and myself work which is MMC Tech. And the goal here is to accelerate and standardize the adoption of technologies. So the answer to, you know, are we using a technology is not just yes, it’s, you know, if it’s the right one, you know, and certainly we’ve seen dremeo is is one of those right ones. When we look at the overall technology strategy, we have, you know, kind of some, some simple slides here is that, you know, we wanna win with that digital experience. We wanna make it fast and we want to do it, be flexible in terms of supporting our businesses. we wanna support cost efficiency colleague productivity. How do we actually get things done? How do we get them done cheap so we can, you know, get them into production and get them being used by our, our frontline staff?
My goal with data strategy is, you know, enabling that differentiation through the data and analytics. So building up those central analytical capabilities, building up those data products, reducing the unit cost of the technology that we’re actually delivering. you know, and assuring the data privacy as we go along the way, you know, part of building this big data ecosystem and with an organization that has some you know, legacy products that span back decades is assuring that we’re actually thinking about privacy as a, you know, a first class citizen. So we really need to, to focus on that effort. and that’s been something we’ve been trying to integrate into the system actively. So what data, we’re trying to really democratize it. That means everybody wants to have access to it. We need to give it the, the single point of access. So I don’t wanna have to, you know, point a developer to one system for this data, another system for that data with separately managed permissions, separately managed APIs that you have to deal with the inability to pull that data from those source systems into production because of the complexity of our network topology, you know, and so it’s, how can we actually provide this one single point of access?
I, I bet you you might know one of the technologies that we use. you know, and then on top of that, we also have to catalog and integrate what those data assets are. So building out that enterprise data catalog into a collective view, you know, prota, again, proactively identifying and protecting our data. you know, within this, one of the big pieces that we, we get our traction with within our organization is just being able to use the same thing twice. You know, a lot of what we do is we build it once, and then we build it again, and then we build it again because we have a slightly different use case. Our team is really focused on building these centralized capabilities that we can reuse, and that’s really gonna help reduce our costs overall. and we’ve already seen that just in terms of, of the speed to development where we’ve already put drio into production use cases.
So with the data technology platform, we have, you know, again, challenges cuz our infrastructure is largely actually on premise. You know, today we have data centers that we own in, in Texas and in in Europe and elsewhere throughout the globe. You know, we’re a global industry. the data that we have is stored in numerous databases. Again, kind of four different operating companies as well as a central corporate operating environment. you know, and each one has made their own individual technology choices. So we have data that’s on top of Azure. We have data that’s on top of aws. We have data that’s sitting in the Oracle cloud. I’m sure we have data under Ed’s desk. maybe not anymore. Not anymore. Not anymore. All right, . but we need to be able to access it. We need to be able to pull it into one place, you know, and what we’ve been using dremeo for is to service that bridge.
So we’ve been able to stand it up, you know, in these areas where we have sufficient data gravity. So it’s, you know, really focusing on, on, on building kind of what is that American cluster look like? How do we make sure that all of the, the servers and databases that we have in in that data center are accessible, you know, via this one DREMEO endpoint. How do we actually embed dremeo into a standalone network appliance that we can put into some of those smaller smaller market countries where we don’t necessarily have the infrastructure to stand up an entire data center, but we do need to have two or three servers there. You know, how can we put that into containerized envir environment? And, and that’s where we see dremeo is, you know, really being this very powerful bridge that, you know, connects those legacy systems with the modern analytics that we need to be doing and adding that insight in order to actually deliver our products.
so as we look at the overall ecosystem here, you know, we’ve got dremeo on the data virtualization, you know, again, sitting off across this, you know, myriad data sources, you know, everything again from CSVs and, and other on-premise data, you know, and data that we’re using kind of one off as well as enterprise data that is, is appropriately cataloged and classified. and we need to tie that in. You know, we need to tie that in with our data catalog cuz we do have an enterprise data catalog that we’re also actively building out at the same time and we’re using big ID to proactively secure that data. So identify where is our personal data, how can we tag that? How can we make sure that you know, that data is pushed through in, into our data catalog and then ultimately made available as part of our data interface. So how do we do that? And this is where ED starts to take over, which is we make an api
Great. So our objective and what we’re trying to do is fits onto a slide, which is always nice. we wanted to make her a single request to arrest endpoints, get data from every one of our systems that have been plugged into dremeo so far, and also be able to pull back metadata that developers and data scientists would need to use and understand it. And so before we launch into what we did, quick question. You know, Dremeo has APIs, so why didn’t we, well we did start there, but why didn’t we end there? So we did a lot of experimentation with the native Dremeo APIs. If you ask Ian, my and my mutual manager, we did too much experimentation, but we couldn’t quite get them to meet our needs. And one of the biggest challenges we had was around authentication and authorization. So we have a organization-wide strategy and sort of principle that everywhere throughout the stack, we want to be referencing data referencing services in the context of the user who re who re requested them, which we want, want to drop into service accounts and application wide mode as late as possible.
And it’s not that we didn’t try and figure out how to do that, I’m, I didn’t bring the diagram of trying to create something that maintained 80,000 dremeo user pats in a database somewhere. but that was one thing we drew up. but that does, even that doesn’t work. And so being able to sort of make things available to a user in their context within dremeo drove a lot of the decisions we then made for the rest of the api. So how did we implement this? I’m not gonna try and take you through the eye chart. This is more for people who are looking at this sort of after, after the fact or looking at the presentation, but I only
Took you about what, you know, three hours to take me through it the first time round.
Yeah, so, well, you’re about to get the cleaner version of it now. but the thing to take, the thing that sort of launches us into is there’s basically two flows that sit behind this api. The first is an asynchronous one, which is basically combining all of the metadata together and bringing that into Dremio. The second one is synchronous flow, which is actually where the API is making calls and sending data and metadata back. So let’s talk about the metadata to start with. So as Ian mentioned, informatic ADC is our source of record for business metadata. What we’ve then done is actually brought that metadata into dremeo as data. and so we do that daily through an ojs adapter on Aron job. And that does a couple of things. One is actually basically translates between EDC and Dremeo s taxonomies for data sources.
So actually the way the, the, the underlying objects in both systems are pretty similar, it just varies a little bit depending on what the actual nature of the source system is. EDC tends to normalize things a little bit more. Dremeo tends to keep a little bit more of the idiosyncrasies of the base system. and so we then use that to match up source systems. And then matchup fields, we then actually have, you know, just a, a little SQL server database that holds all of that metadata at both the source level and the feed level. And then that’s actually a dremeo data source in its own rights. We don’t then go back the loop and make that its own metadata source cuz that would get a bit something. but effectively we then expose that within dremeo for, for any user to consume as long as they have permissions.
The other, as part of the process, as we actually build, as we build out that metadata and as we bring it together from ED EDC in drio, we also translate it into markdown. And so if you’re familiar with the wiki within drio and sort of the catalog capability, what we actually do is write pages to the Wiki that contain all of the information that’s in those metadata tables. And that means that our consumers that aren’t API users, you know, data scientist business analysts can see all of the same information, see if it’s current or not sit, just see have everything pulled together for them at the same time that our developers are able to consume it through the api.
And this is also where we’re able to call out kind of those data sets that have the PI and to flag that data just within our ecosystem so that we’re aware of it as we’re proactively exploring.
It’ll be able to, it’s not done yet , but that’s, that, that’s that, that’s, that’s this sprint on next sprint.
I jumped the gun just a little bit.
. So actually making API calls themselves. Okay. So we start off by fronting everything through an API gateway. again, we talk about this idea of technology unification. So we use Google Apigee in this case, Apogee hybrid. and that allows us, we, we’ve done another sort of common service that allows us to have tokens that can be con generate OAuth tokens for both individual users, but also service accounts that may need to act on behalf of in individual users for things such as testing and other automation. What we then once we’ve validated the token and we can do that inside of Apigee, we then attach the user that we want to make the request for in header to the downstream request. And then we basically secure we then build an H max signature for the request. And the reason we do that is basically to lock communication between the gateway or we’ve done this authentication and dremeo downstream so that we couldn’t have a bad, we can’t have a bad act to come in the middle and try and take advantage of the other things we’ve built to impersonate users.
The only, the only p only people who are allowed to impersonate users are uss and we validate that you’re impersonating the right user before we let you do it. And the reason that that’s important is the next piece that comes on the end is how do we get user context into dremeo? And at this point, the only way we had available when we put it together is O d bbc. So what we’re doing is we’ve got this sort of separation of concerns. We have a service account in DREMEO that’s approved for o d BBC impersonation of any principle. The only access that service account has in its own right is to metadata. It’s to that data source that we’ve embedded in that portion. The SQL server data are basically assembled. And so then when we request data, we use OD BBC impersonation of the user who we validated in the gateway, and then that makes the call to get the actual data that was requested. we then have go request the metadata directly using the service accounts and assemble those together in an adapter within node js before we send things back to the API consumer.
And just a little bit more on the metadata side of things. So we have metadata, especially when you get into a pure production flow becomes maybe less useful, right? If you’re a developer who’s already knows what you want or you may wanna start slimming down your requests, you may wanna start slimming down your payloads. And so we’ve got query parameters in our API requests that actually we can use to handle the metadata provision and say, do you want metadatas? You not want metadata. when we receive requests what we actually do is we take the sequel query that’s going into Dremio, okay. we then pass it using app, an app abstract syntax tree that pops out what all the individual data sets are and then we can go query our metadata records. So either get the source level data, the field level data, or both. Okay. So what comes next
Other than the PI that I already jumped the gun on,
We’ll even have a little bit more on that. So one of the thing, again, as we think about discoverability for developers, as we think about flexibility for developers, one of the things that really jumps out to us is GraphQL. it sort of gives you very similar schemer introspection capabilities. and it really in terms of then allowing developers a lot of flexibility to build their queries and test their queries. It gi I I feel it gives you a, a friendlier and a moral automatable interface than sort of, you know, SQL queries within the body of arrest request. So we’ve got a proof of concept going on right now. This is some screenshots from a demo that I saw on Friday where again, we’re using a similar, a similar method to basically build GraphQL schemers from all of the metadata that we have loaded up inside of drio. We’re also actually using things that are already present inside of drio. So you know, the information schemer views and things along those lines to build GraphQL schemers and then expose an endpoint that has the GraphQL schema for the whole piece and process.
The second piece and we talked about expanding metadata provision. So big ID is sort of the next system on tap for that. and so the goal there is to, if we, if you sort of go back to that wiki view we had is to be able to provide information both at the source level and at the field level that marks the data sources, says containing p i anywhere, and then tells you where the p i I is and what it is underneath the hood. and that’s something again where we’re actually sort of vacuuming the all of this together. So I believe it’s data goes into ed Big ID puts information into edc and then we’re able to use the EDC dremeo interconnect that we’ve built to actually vacuum information from big ID the whole into dria.
Yeah, that’s exactly right. We’re informatic. I was kind of serving as that broker there and we’ve partnered with BIG ID in order to, to build that first integration into edc. So we’ve really you know, let’s say done a little bit of experimentation there for what’s the right format, how do we actually materialize this, and then the next step is how do we make that visible to our, our developers and our users? And of course there’s an added functionality I’d love to add in there too, around auto masking and, and other tools that are already available within dremeo.
Yeah, the other thing we’re seeing with sort of, it’s little bit off topic on the DREMEO piece, but as we put more information into edc again, info masker exposes, APIs, my, my roommate’s core APIs. And so we’re also looking at how do we front Informatica’s APIs again to make, to basically expose some of this metadata information in a standalone API that doesn’t require you to request data from dremeo. so we’ve got some use cases there such as developer portals, both for internal use and for client use where it’s actually useful for us to be able to expose that level of data and information. And so we’re sort of building a, an API ecosystem across the top of what we’re doing here for jea.
And then last couple of more detailed things we’re looking at whenever we can bring in sort of the error flight based ODBC driver, the thing, there’s some performance up upgrades available for us there. we’re looking at a API design. So right now our API design is pretty, it’s pretty, I’m about say something is gonna get me in trouble here, , I’m gonna say the API design is pretty crude. And the next thing I’m gonna say, it’s exactly like the existing Dr. You want, and that’s not how that’s come out how it sounded, but right now, sort of the Dremio data consumption API is very much around posting queries, sort of, it’s it’s SQL driven, it requires you to have, you know, some knowledge of sql. It requires you to have even more knowledge of how to escape characters in HTML requests I’m sorry, HGTP requests.
and what we’d like to move to for sort of your really simple use cases, so give me data from one table that does this, is actually having a resource-based api. We fundamentally have all the nuance under the hood to do that. And so being able to return just, you know, records from a single data set, potentially filter records from a data set filtered on either row or columns, and being able to do all of that again within user context is something that’s on our list. And then there’s some things coming in more recent versions of Dremeo as Ian mentioned. We’re lo, we’re currently exclusively a dremeo software user, so all of this is on premise. We’re looking at can we take advantage of some of the token based authentication methods? and that may alleviate the need for some of the things we’ve got in the middle right now, like H max signatures and other validations. We may just be able to provide the user information directly to Drio through that. Okay. So with that a couple of minutes in, anything else you want to add before we wrap up a look for questions?
I I don’t think there’s anything specific. Just I wanna emphasize the, the power that kind of both dremeo gives us as a bridge, but also what Ed’s team has done because it enables us to, you know, not just give that access to the end user, but enables us to do this, you know, with these applications, with these client facing applications that we have. And that really is just a world of difference from, you know, where we once were just in terms of, of the ability to combine data sources and to actually bring that all the way through it, it helps us to move forward as an organization. It’s really, you know, upping our capability. So, you know, many thanks to you and your entire teamed.
Yeah, definitely. The, the credit goes, so it costs of thousands of engineers both within Marsh, McLaren and some of our partner organizations as well.
Awesome. Well thanks a lot guys. Yeah. Let’s give him a round of applause. Cool. any questions from anybody in the room? Should,
Hi. You mentioned about the adapter and the service account. So how do you all manage or translate the user permissions apples to what the service account can actually provide? Who does that? Is it the adapter?
So it’s actually Dremeo does it for us. So because of the, the way that we’re using impersonation bec once we give the user con, once we give the user context to dremeo, dremeo is then able to basically take that user and if, if you sort of follow the logs through in the traces from the point that it comes out of the odpc adapter as far, okay, it’s close. So there’s one exception. you’re basically that everything is being performed in dremeo, not as if you were the service account, as if you were the user. The only exception, which is a little fun, is if you ever try and find the job that was created. It doesn’t show up as a job created by that user. It’s just sort of an ephemeral query answer that appears from nowhere. but it does do the thing that we need to do, which is get us data and send data back. That’s probably more an operation how od of how OD BBCs implemented under the hood.
So in that case, how do you do auditing and compliance if it’s just a job, which does that?
so we are actually log So that, that was a that’s a good question, Ian, this actually predates you joining the organization. Yes, so this, this one is, this was on Yvonne who’s one of our teammates. we’re actually logging every, we’re actually logging every query that has been sent and who the in-person and who the impersonating account was as well as any re as well as any rejection. So we’re actually logging that to Datadog, which is our enterprise. And so if you need to go back and search that, anything in front of the other thing is if generally anything that was a lit sort of a little off kilter would probably be coming be getting rejected the app and she left first. But again, that all ends up in Datadog. But yes, from a client compliance perspective, we’re basically collecting logs throughout the stack, both what succeeded and what failed.
Great. Thanks. Any other questions? You’re in the cool, there’s a question here online. Can anyone else in the room comes up with one, feel free to raise your hand. that I, Thomas, just to clarify you need an Informatica API to connect to Dremeo your guys architecture? Yes. You can’t just bring it in as a, as a flat file or whatever.
We, we, so we actually are, we’re connecting to, the way that it’s working is we’re connecting to the Informatica APIs and then we’re actually transforming the data inside. We never convert it to flat file. If that, if that metadata was available as a flat file, we probably, we could adapt the, you could adapt the adapter to use it as a second source. Effectively everything we’ve got is already in edc. but yeah, that’s it. That would be possible as an extension if we wanted to.
Makes sense. Cool. Thanks. Any other questions here in the room? Cool. All right, well thanks. Give it up for editing in here. Thank You.