Gnarly Data Waves

Episode 44

January 30, 2024

How S&P Global is Building an Azure Data Lakehouse with Dremio

Join us in this webinar and learn how S&P Global built an Azure data lakehouse with Dremio Cloud for FinOps analysis. If you are looking for ways to eliminate expensive data extracts in BI cubes, then this will be a great episode to check out.

S&P Global is a leading global financial services company headquartered in New York. It provides credit ratings, benchmarks, analytics, and workflow solutions in the global capital, commodity, and automotive markets. As a company, data is an essential asset across all of S&P Global’s solutions offerings. Watch Tian de Klerk, Director of Business Intelligence, as he shares how they built a data lakehouse for FinOps analysis with Dremio Cloud on Microsoft Azure.

Tian will share about:

The hidden costs of extracting operational data into BI cubes
Simplifying traditional data engineering processes with Dremio’s zero-ETL lakehouse
How Dremio’s semantic layer and query acceleration make self-service analytics easy for end users

Speakers

Tian de Klerk

Experienced Solutions Specialist with a demonstrated history of working in the information technology and services industry. Skilled in Data Analytics, currently specialising in cloud cost and utilisation analysis for AWS and Azure. Strong professional with a Bachelor of Science focused in Informatica and an MPP Data Science.

Tony Truong

Tony Truong is a Sr. Product Marketing Manager for Dremio. He has experience building go-to-market strategies that drive product adoption and solutions offerings across various industries.

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Opening

Alex Merced:

Hey, everybody! This is Alex Merced, and welcome to another episode of Gnarly Data Waves. But with no further ado, let's get on with our feature presentation. We'll be talking about how S&P Global is building an Azure data lakehouse with Dremio, and with us today we have Tian de Clerk, Director of Business Intelligence at S&P Global, and Tony Truong, senior product marketing manager at Dremio. Tian, Tony, the stage is yours.

Tony Truong:

Sounds good, thank you, Alex. Hey everyone, thank you for joining, welcome back to Gnarly Data Waves, and greetings if you're new. So on today's episode, like Alex said, we're joined by Tian from S&P Global. He is the head of the IT BI team at S&P. He'll be covering his data lakehouse journey, and how he's able to remove BI extracts and cubes from his data architecture while reducing the total cost of analytics cost. Tian, I'm going to go ahead and share my screen. Now, I'll let you go ahead and take care of the rest here.

Tian de Klerk:

Thank you very much. Hi everyone, just waiting for the slides to come up…

Tony Truong:

There you go.

About Me - Tian De Klerk

Tian de Klerk:

Cool. Yeah, hi everyone. My name's Tian de Klerk, I think the introduction was covered. I run the IT business intelligence team at S&P Global, and so we're in a corporate function, but just a bit about me. You might hear from the accent, I'm from South Africa. I currently live in the Netherlands. I have a background in a ton of reseller things, but I come from a cloud analytics background, and through reporting there, I learned Power BI, and that accelerated me to where I am today, working for S&P Global, running the IT business intelligence team.

IT Business Intelligence Team

Tian de Klerk:

The IT business intelligence team––our very pretty logo there––currently, our responsibility is to do internal reporting. It's very important to understand how your business delivers products––how much does it cost to run those products? And that is basically what corporate does, so we try and host data internally, regarding service management data, cloud financial asset inventory, and more. Anything cloud-related or IT-related, we try and go get that data, host it, and then report on it, based on customer requirements, and customers technically being divisions. I run a few developers whose main jobs are to understand REST APIs and the various cloud sources that they could potentially pull data from, and pull that into a central spot. That used to be our Azure data lake and Cosmos DB, which I'll hop to in a second. And then the primary reporting function is Power BI. We're presenting everything through Power BI, and in the background, we also try and deliver these data sources. Some teams just want to see the data and build their things, so we try and deliver that. So foundationally, we started quite quickly.

The Business

Tian de Klerk:

As I said, my background came from cloud financials. The main purpose of the team was reporting cloud financials, but it quickly accelerated and grew underneath us where we grew inside of an Azure data lake, and we started pulling in the service management data, the CMDB data, and we outgrew ourselves. So we ended up at a spot where we needed to improve what we were doing because we were doing a bunch of our data manipulations and cubing inside of Power BI, which is great. Power BI can do it, but it's not where it belongs, especially as the data started growing, it got out of hand. It also became costly to host the service, and the data the way we were doing it. We were putting it into Cosmos DB, just because of legacy things. As I said, we were moving a bit too quickly.

And the problem, as you can hear from our initial statement, is that I'm an internal function. Therefore, I need to be able to get the best bang for our buck, and initially, that was just reporting straight from the data lake. But Cosmos DB was needed to enhance that table functionality––we have a table we can query.

Data Lake Challenges

Tian de Klerk:

But with that, [what] we ended up doing besides combining data inside of Power BI, we ended up combining data outside of Power BI too. We had our poor data engineers trying to build these cubes that are then dropped separately, and duplicating data inside of the data lake, and it started to grow out of control, as you can hear from what I'm saying. I heard about Dremio, looked it up, and understood that this could maybe solve some of our problems, based on the silo data being separated from everything else, having those sources that are duplicated just throughout our entire environment.

Challenges with Existing Architecture

Tian de Klerk:

With regards to the rougher architecture, so this is our layout. The key, the pivotal point here was ADLS. We were heavily using Azure storage accounts with the data lake switch flicks on, which Power BI can consume from, quite performantly, until a certain size. And we were pulling that data into Cosmos DB, to reflect the ServiceNow data, and then joining it back together inside of Power BI, and sometimes joining that external from it, and then picking it up. This ended up taking up a lot of time, and a lot of effort to maintain, and was also difficult to share access to this, because it's it's not a secure method, or it's very difficult to set up a very easy way to consume from Data Lake for a standard end user from Power BI. And so that's where we ended up looking at Dremio. Like I said, it just popped up––so we looked at a few solutions.

Why Dremio?

Tian de Klerk:

What stood out for us was, first off, just the ease of use, that was a big thing. It functioned the same way that we were used to with Power BI, where we could simply pick a source and import it. Now, Dremio had different terminology, that’s formatting it and importing it. And then, once all the data is on the lakehouse, we can join it to build views, which doesn't mean we're copying any data, it just means there's a view now created that Power BI can now use. So that was already more performant for Power BI, just to reference that, it also had better data security just purely integrated into it. We could set up––we were using Okta, for example, we were able to set up Okta quite easily, and we were able to onboard users straightforward onboarding process. And the connections Power BI was also seamless, of single sign-on of the company's credentials, so that solved that quite quickly, and, we were sold from there and started integrating more and more from our workflows, building on top of it.

Architecture After Dremio

Tian de Klerk:

So this is our architecture afterward. It's growing. This is our story so far, so hopefully, it might be a follow-up where we can explain a bit more what we ended up doing. But we could reuse a ton of the effort we did. So, that part that I mentioned where we have our ETL engineer tapping into various APIs to try and get the data, we could reuse that. We did need to clean up how we were doing it previously because it was wild waste. But as long as your data is in such good order, which it should be because you're running a data lake, this lakehouse, overlaying on top of it, worked seamlessly. And then from there, we worked with our stakeholders and with ourselves, because we're our primary stakeholder, currently, to try and replicate the views we did in Power BI, and succeeded. And so we were able to move that knowledge that was trapped inside of Power BI one step backward, into Dremio, and that enabled us to potentially reshare that too. So if someone wants to go query a particular model, which is a term we used to use for something that's built inside of Power BI, they could go look at Dremio, because they can log into it simply. And yeah, so that alleviated that pain.

And also, Dremio is quite accessible. I know it lists there Power BI and Blow, but it has a very, very powerful ODBC connector as well. So anything I can utilize ODBC can connect to this. In addition, the API calls were incredible. The Arrow Flight connector inside of Python just worked awesomely, so our ETL guys could do something in the background. If they ever needed to reuse the data for some reason, they could tap into that and help us deliver those to our end users.

Business Outcomes

Tian de Klerk:

So what does this mean? So Cosmos DB, which to be fair was the wrong tool, we ended up being able to remove that completely. It also––we got some more value out of it––but we ended up being able to completely remove that, cutting our Azure costs in about half. So that means our normal running costs as an internal team have been halved. We can do more things now because our budgets are freed up. Direct access, yeah––time to live for accessing the data lake was accelerated. [As] I said, if we can pull in the data, I can get one of my data engineers to look at the data, pull it in, and work on views, basically, as soon as the ETL engineers are done. And the query time improvement is vast––I don't know how much everyone knows about Power BI, but you rely on the mercy of Microsoft's processing that they make available in the service, or you have to pay quite a lot of money for premium features, that gives you some acceleration. But being able to move that processing a step back to where we have a bit more control. Yes, you're paying for the processing, but it's way more efficient, and it's less than you would for inside of Power BI. So we were seeing about a 30% increase, and that's with the manipulations already happening inside of Dremio.

What’s Next?

Tian de Klerk:

And I hinted towards these, our plan has always been to centralize more and more of the IT data, as much as humanly possible, because if the data is next to each other here in Dremio, we can combine it, and that's one of the biggest features we see for it. We see that our data engineers now have the ability to basically––that our product has changed from a report potentially, to more of a dataset, and we can run a dataset as a product, using the Iceberg feature. I see Icebergs listed there, but Arctic has recently been enabled on Azure, and it works awesomely. It means we can version our data sets, and we can run that as a product.

We have robust access controls––we're looking at how we implement this in a seamless way, where we can replicate the access end users would have had in the source systems. For instance, in ServiceNow, we want to give people the ability to be able to query ServiceNow data quite easily, but there are some sets that they're not supposed to see. Well, if we can pull their permissions from ServiceNow, Dremio has quite impressive row-level security features that we’ll be able to roll out, basically looking up what permissions the user should have, and then implementing that.

And the final bit, as I said, we serve divisions, at the point where we start cross-charging, or at least charging for our own service that we provide to them, we can map how much they cost to run by routing them to particular engines. This is a feature inside of Dremio, where you can see what group or who is querying something, and they redirect them to certain compute. This means a few things, that we're leveraging, is that we can ensure that there's quality of service––we're not stepping on each other's toes––but also we get to allocate that cost, and we now know precisely how much it costs to host that person inside of Dremio. So if they were to do large queries, we would know.

Closing

Tian de Klerk:

And that is it, everyone. Our cloud journey is ongoing, we're very excited about what's next in terms of what I've just shown. I'm hoping to have a follow-up where I can tell everyone about the awesome things we do we're going to do in the Arctic, and how we got around all our access limitations. But Dreamyo has been key to us saving a large amount of money while increasing what we can output from my team. I tell my people we're a team of a certain size, and we should build to be able to serve a product using a team of a certain size. And this has enabled us to do so. Yeah. That's me, everyone.

Q&A

Alex Merced:

Thank you very much for that presentation, Tian. That was phenomenal. It's cool to see how you guys have overcome the challenges that you have, and what those challenges were. Now we do have a question––the Q&A, again, if anyone wants to ask a question, please post it in the question and answer box, and then I will introduce that question to Tian––so the first question we got was from Gourav, which is, why not Delta instead of Dremio? What were the criteria for choosing one or the other? Or reframed is, why Iceberg over Delta, since Delta is a table format.

Tian de Klerk:

Yeah, so that's a fantastic question and fundamental to our team. So our core functions started with pulling in data, and we picked the worst format possible––everyone's going to cringe hard––but we picked CSVs. Why we picked CSVs was the ability to share those files with people. And everyone using data doesn't necessarily know what Parquet is, and so they ended up using CSVs and opening them up in Excel. So our current infrastructure, our current data lake is mostly built on CSVs, and one of the things we're leveraging is precisely why Iceberg was quoted there, was taking those CSVs and transforming them into Iceberg. And you're right, we are getting files as Deltas––but CSV Deltas––and Iceberg allows us to upset into it, which gives us another cool feature that gives us the ability to query over time because that's a thing you can do in Iceberg, you can actually snapshot and go backward in time, which is part of the versioning. I hope I answered that.

Alex Merced:

Yeah, very cool. And then yes, on top of that, like, if you want to use features like Arctic, you have to be using Iceberg. On that note, you talked about using the Arctic and creating like versions and productizing them. Can you tell us a little bit more about the sort of like, how you're approaching that and how you've been using it so far?

Tian de Klerk:

Sure. So most of our data, we realized, is updating every day. That's our normal cadence. We want to have data at that level. And so what we ended up doing was we built an orchestrator outside of Dremio. It's not a big thing, all it's asking is, poking at Dremio, using the APIs that you've documented, to take a new file as it drops, and upset it into Iceberg. We are looking at branching that, so running it like you would do of code, branching it out, doing the update, merging it back in. And that way we have the ability, if something was wrong with that file, we can just roll that day back. And we're looking at doing the same for the next semantic layer, so we're doing Arctic and different layers. In the next semantic layer, we're going to build views. And when we build the views, we can now version those views, too. So if someone's like, yeah, but I wanted that column included in this view and you excluded it, we can include it as a version. So that's basically what I meant by that product.

Alex Merced:

Awesome. The next question is, Dremio semantic layer––is there a performance difference than, let's say, using a SQL server?

Tian de Klerk:

Good question again. To be honest, we never even considered SQL. Why? Because the base architecture we ended up having is just those tons of files. It was easier to get the files in order, and then place compute next to it. We were––to be fair––we were looking at something like Synapse, which was Microsoft's data warehouse, which is more complex there. But to get data into it, we needed to add so many checks to make sure that the data fits nicely into Synapse. I don't know why, but we ended up struggling with that, and one of the key selling points of Dremio, for us, at least, was it just plugged and played. Yeah, there is a bit of potentially moving files that look the same under the same folder, yes, I just said that some people didn't do that––so my poor developers had files of different things inside of the same folder, we couldn't just format an entire folder. So yeah. Hope that answers that one.

Alex Merced:

The next question we have is. did you have to implement any physical ETLs inside of Adls, besides creating virtual views and semantic layer in Dremio?

Tian de Klerk:

That's a very good question. So, no, most of our use cases––no, that's precisely why we have this. So all of our data that we're picking up into Dremio are all data sets. We've run into completely fair limitations. For instance, super nested JSON––you can read them in Dremio, but they're a pain, so we've had some additional ETL loops outside of it, just opening up that JSON for us to flatten it out, and then putting that back into Dremio, Dremio can handle 1 or 2 layers, but this one was really wonky. So things like that. I can't think of another example off the top of my head, but they're exceptions, not the rule, So in most cases, we just pick them straight up from the data lake.

Alex Merced:

Got it. Okay, so for a lot of it, you're just basically curating your semantic layer right there on the data. And that's awesome. Cool. I think that answers the question. I'll give you a moment to see if anyone has any additional questions.

Tian de Klerk:

But Alex, on that point, just on the curating right on the roll––you would notice, because CSV's are not necessarily the best performant files, Iceberg are the better ones––and so what we're trying to do to improve this, using that Arctic that is essentially created––so I'll use the ServiceNow dates as an example. If we have an inventory of servers, we would create a base foundation of that––this is the inventory of servers as in Iceberg, and then daily as the Delta drops, you would upset those into it and version as you go along. But that means from this layer, the Iceberg layer, we now have a more performant layer to query. We noticed immediately when we started picking up our case, for instance, with Dremio, it was just night and day compared to querying CSVs. But the capability is there to reformat your entire data lake if you want, and it's potentially going to be cheaper in the long run, because instead of now querying an entire folder of CSVs, we now just have to query that Iceberg and update it every day.

Alex Merced:

Very sweet. I love that cool, and then any last thoughts, or any recommendations, as far as like, if someone is thinking now, like, hey, I want to explore Dremio. What would be your recommendation, as far as the first step they should take?

Tian de Klerk:

Interestingly enough, I had someone ask me a few days ago––spin up the community edition. Just spin it up, connect it, and see what I'm saying. Go drop a few files and S3 or in Azure storage, and pick it up with Dremio and mess around a bit. Honestly, it was eye-opening, how simple it could be, and as far in the journey we are now––we're now a year about into our journey––I can see that there are more complexities to it than we [knew]. Everyone kept telling us, reformat to Iceberg! Reformat to Iceberg! [And we said], nah it's fine, we can read the CSVs, why do that? And then, when we saw the performance, it just made a lot of sense. I would highly recommend spinning up Community and having a look.

Alex Merced:

Awesome, thank you very much. And again, for those who do want to try the Community edition, just head over to dremio.com/get-started, and there, you can get directed to either the Docker container, if you just want to try it on your laptop, or the K8’s Helm chart if you want to go spin it up in EKS or something like that. But yeah, no again. Thank you very much, Tian. It was amazing to get to hear the story and hear the challenges, hear how those challenges were overcome, and we appreciate having you on the show this week.

Tian de Klerk:

No problem. Thank you very much for having me. Cheers.

Alex Merced:

Yes, thank you. And then, everyone, we’ll see you next time, and have a great one. Again, make sure to check out Subsurface, over there at dremio.com/subsurface, and head over to Dremio.com, where you can find also great stuff like our state-of-the-data-lakehouse report. Also, just so you know, if you need a copy of the deck, any resources that are paired with this presentation will be available when we post the presentation on dremio.com/gnarly-data-waves. There you can find all the presentations from previous episodes, along with any resources that are attached for those episodes. So this should be on there within the next 24 to 48 hours. Thank you very much. Have a great week. I'll see you all later.

Gnarly Data Waves

How S&P Global is Building an Azure Data Lakehouse with Dremio

Speakers

Transcript

Opening

About Me - Tian De Klerk

IT Business Intelligence Team

The Business

Data Lake Challenges

Challenges with Existing Architecture

Why Dremio?

Architecture After Dremio

Business Outcomes

What’s Next?

Closing

Q&A

Ready to Get Started? Here Are Some Resources to Help

Whitepaper

Delivering AI ready data with an Intelligent Iceberg Lakehouse

Case Study

Major Investment Management Firm Achieves Data Democratization with Dremio Lakehouse Platform

Webinars

Unlock AI-Ready Data with the Intelligent Iceberg Lakehouse

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?