May 2, 2024

How S&P Global is Building an Azure Data Lakehouse with Dremio

S&P Global is a leading global financial services company headquartered in New York. As a company, data is an essential asset across all of S&P Global’s solutions offerings. Learn how S&P built a Data Lakehouse for FinOps analysis with Dremio Cloud on Microsoft Azure.

We’ll discuss S&P’s cost savings associated with a lakehouse architecture, how the lakehouse simplifies traditional data engineering, and how S&P is enabling self-service analytics for users across their organization.

Topics Covered

Dremio Use Cases
Governance and Management
Lakehouse Analytics
Performance and Cost Optimization

Sign up to watch all Subsurface 2024 sessions


Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Tian de Klerk:

My name is Tian de Klerk. I’m originally from South Africa. I’m currently living in the Netherlands. I have a background in cloud reporting. Originally, I started in BI. And that’s about seven years experience. And yeah, I’ve been working towards more of an ETL role and basically lead that in the IT business intelligence team, which I now run within S&P Global. 

What our team does, we are located in the corporate division of S&P Global. And so we’re responsible for highlighting IT costs, and costs around IT, and vulnerabilities, and assets, and everything around that for visibility for our stakeholders, both in corporate and in the divisions. And team’s responsible for gathering data from various service management, cloud financial, asset inventory, and so much more, to be fair. Our developers, we essentially have developers that do ETL work through REST APIs or whatever automations to pull that data into a lakehouse. And we essentially deliver either reporting or data. 

The Business

So at our conception, we had a very simple design. And so at this point, we needed to improve how we were using our data lake. It was unoptimized. We were essentially connecting Power BI directly to it. So we were losing a lot of knowledge within Power BI. We also had the goal of basically, as I was highlighting, the financial operations– highlighting financial operations to leadership. And we needed to get more optimized to deliver that. We also had difficulties in terms of decreasing our total cost of ownership around our infrastructure. We chose poorly when we initially started up, and we used a product called Cosmos DB, which is a NoSQL database, and not meant for what we were doing. 

And so as I was saying, rising costs on Cosmos DB, and we were essentially building up more and more tech debt, lack of self-service analytics. We were reliant on Power BI exports. People were going on to Power BI reports, simply pulling up a table and downloading data from there. And we also had siloed data architecture, meaning that people were struggling to get at certain parts, which meant that people came to us specifically to combine data sets for them constantly. 

Challenges With Existing Architecture

So this was our old architecture. As you can see there, we essentially had ADLS data factory ingested into Cosmos DB, specifically for our CMDB data. We also had Power BI connected directly to ADLS. We were then creating snapshots of some of the data. So we have version 1 and another version of the data that we were then also connecting to Power BI. The reason we did that was to essentially save costs on queries going all the way back to Cosmos DB. This became very inefficient over time as we were doing more and more and more manipulations inside of Power BI as the data was, for instance, from AWS. But we had to match that with CMDB to make it more business relevant. 

Architecture After Dremio

So we started investigating Dremio. And we essentially picked it based on these four things. The semantic layer and the ease of use. Setting up Dremio was really simple to get going. We also had very fast performance querying CSVs on a data lake. It had the ability to actually govern our data properly at this layer already. No more data copies. No three versions of data. So this is our architecture we ended up on. Later in the keynotes, I’ll touch on the more advanced architecture we now have. But this is what we started out on with Dremio. We essentially laid Dremio straight onto our Azure data lake and created views from there. This performed really well. And we basically directly connected Power BI to that. This meant there was only one copy of data and then potentially another copy, which is essentially the reflections inside of Dremio, which meant that we were getting faster performance inside of Power BI. And the added benefit of actually getting that data set was more usable, because now I can also ingest it to Tableau. Or someone can query it directly. Or we can do an export for them. Or we can connect it through Excel and ODBC connector. And so this created a single source of truth where previously we did not have that. 

So for us at least, this meant we reduced our Azure cost based on poor architecture choices by about 50% by getting rid of Cosmos DB. We had direct access to our Azure data lake, which unlocked so much. So instead of someone having to spin up a Power BI report, connect to the data, do manipulations there to answer a question, we could query our data live inside of the data lake. The faster query time in this sense meant our Power BI reports refreshed about 30% faster, just based on moving that processing backwards into Dremio, meaning that the numbers are crunched, the tables formed, and we are essentially just pulling up that view. 

What’s Next?

So what is next for us? So our goal is to centralize as much of this IT data as possible so we have a single platform that can answer everyone’s questions. The idea of the advent of all these JNAI applications and so on is so that we can essentially have a place where the JNAI or LLM model can ask its questions. We also have the ability to root queries based on the engines. And this gave us the ability to also allocate costs to divisions if required. So therefore, if we have certain sets that we’re delivering to particular divisions and we needed a way to cross-charge, this would enable us to do that as the costs were bucketed based on the engines they were using. We also get the benefit of robust access control. So if you’re not aware, Dremio supports user-defined functions, which can be leveraged to essentially establish access control, row-level security on data sets, as well as the ability to actually get access control implemented through AD or Okta. So the access controls then offloaded from my team, so no one has to permission it, and onto AD. And so we can have different approvers for different data sets. 

Finally, we started leveraging Git for Data, so Okta, by essentially spinning up our tables within those Okta catalogs. I’ll touch on it a bit more in the keynote. But we started heavily using that and essentially building out a way for us to process our changes to our data. Now, I blazed through my slides because I was hoping this would be more of a discussion. So thank you for that.