May 3, 2024
Supporting Nutanix transformation journey with Data-as-a-Service platform with Dremio
Fast tracked data provision to key Nutanix Business Verticals by building Data-as-a-Service with Dremio.
Topics Covered
Sign up to watch all Subsurface 2024 sessions
Transcript
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Sukumar Bairi:
Hey, folks. Good morning, all of you. I think you’re having a good day. So today, I want to talk about data as a service platform. So this is Sukumar Bairi. I’m with Nutanix for the last three years, and we have built a platform called data as a service, which we are using to have the data accessed across the platforms, right? So we use this platform as a one-stop shop for data democratization, and we have data coming across from different data sources. So today, I want to talk about more on this platform and talk about the problem statement, why we have to go with this kind of a platform, and what is the need of data like hubs, and what is the cluster architecture and use case details that we want to discuss on.
Data-as-a-Service Platform
Yes. So this is a data as a service platform, which we use as an internal data management platform. So we have built this platform so that most of the teams that have data across different data sources can join and have their data at a single place and build analytics on top of that, right? So this platform, it helped us in data provisioning, wherein we can have the data from wide varieties of data sources and integration of that data to different API platforms, or even the Python programming languages. And we are also using this platform for data management, wherein we have the data from different sources, which is the confidentiality of the data varies between one, wherein one is the topmost confidential data, and it goes up to five, level five, where we have some regular public data. And this platform is integrated with RBAC system. So the access is controlled to role-based access, right? So we usually have created all the spaces or the scheme, as to say, on project basis.
So whenever a user is joined into a project, we can have that access to that space to that user. So they don’t need to go to somewhere else and get access to different data sources. And this is highly scalable. So we have built this system on top of a Kubernetes, Nutanix Kubernetes engine. So we have the scale up and scale down model integrated with that. And coming to security. So as we are controlling the data authorization using RBAC, we have the LDAP integration, kind of Okta integration in our system. So we have built a lot of Tableau dashboarding on top of this platform, wherein we can get a lot of analytics and insights into different data models and use that for data-driven decisions.
Problem Statement
So what is the first point that comes into our mind when we are talking about this platform and why we have to go with this kind of platform? So as we have migrated from a small company as a startup to a company we are right now, we started as a hardware company and then moved to a software company, to a software company right now. So the data models have been keep on changing. The data that flows into our systems keep on changing. The application changes. There is a lot of schema changes, and we have to come up with a system where we can get auto schema evolution, kind of, which is supported by Iceberg right now. And we have data scattered across multiple data sources. So as we have started as a company, there are multiple teams, which have their own databases, which are running in silos. And we don’t have connectivity between those databases, because we have something on MongoDB, let’s say, and the other one in MySQL or PostgreSQL. So it’s difficult for us to join data between those two data sources.
And we, as a growing company, have to build some dashboards which are low latency. And these are customer facing applications. So we thrive towards building a system which is low, which has a low latency, and that doesn’t impact any of the cluster activities. So the structure of data, as on when there is a new patch released for the software, the data model changes, or the data structure changes. So we have to cope up with all those changes. And probably we have to redefine all the structure of the data, or the tables, and all the upstream and downstream applications, depending on that data source.
Need for Data Aggregation
So why do we need data aggregation between these data sources? So we have data, let’s say, like we got some finance or accounts data in some system, and the logs or the pulse data of those systems in different data sources. So that is the main reason why we want to have a data aggregation to build data driven analytics or decision on top of that. So this gives us improved decision making with better insights, and it also increases accuracy of our data driven decisions, so that we can plan our future activities accordingly. So here is a diagram, which actually is funneling data from different data sources. And we have created a physical data sources, which we call as PDS on top of Dremio. So these PDS, we are restricting access to PDS in Dremio, and then we create the virtual data sets we call as VSS. And on top of that, we are creating a business logic.
So all the users, or all the entities have been created in spaces at project level. So all those who require access to that particular data sources will be granted access to that space. So we are using this platform for the last mile of the data pipeline. So we have a system which is highly performant when we are talking about millisecond or second latency dashboarding. So we have integrated these systems with multiple business integration tools like Tableau, Power BI, and also we are having an endpoint that can be used in different Python applications.
Why a Data Lakehouse?
So again, when we are talking about why we want a data lake house, we have systems like the RDBMS in MySQL or PostgreSQL, or even AWS RDS instances, as well as we have data on AWS S3 or Nutanix objects. So this data, it might be application specific data or application logs or some pulse data, which has to be integrated and we have to drive some decision out of that data. So we have to write a logic which can go through all these data sources and then come up with a target table. So because of these kind of heterogeneous sources, we have to go with a platform which helps us as a data lake house. And the volume, variety, and velocity of data that we get on a daily basis, they’re also being supported by this platform.
Coming to data governance, as we are managing all the data access on one system, it is easy for us to manage the data access to different data sources. So this is a diagram that gives us an overview of how we have built this system. So towards the left are all those data sources. They are not limited to this, which is in the diagram, but just for overview, we have these many data sources on the left, which is connecting to our Dask platform. So we have built on the system on Kubernetes, Nutanix Kubernetes engine, and we have integrated that with Nutanix object storage. So we have all the data that needs to be stored in objects as iceberg tables, so that we have a pipeline where we can use this highly performant Apache iceberg tables and query on top of that. So this system, we are connecting to the system from different APIs. Again, we are using object storage, Nutanix object storage, as a storage solution for us for different use cases, as well as we are connecting from Tableau and different business integration tools to that.
Architecture
And this is an overview of architecture, how we did. So towards the left is a kind of use cases that we deal with on a daily basis. So we got telemetry data, we got accounts data, support data, and to say we got the telemetry data on S3, accounts data in PostgreSQL, and support data in MongoDB. So having all these data sources in different places, we have to come up with a solution, which is the task platform, which is built on top of Kubernetes engine with Dremio. And this platform, we are providing that to the users. So towards the right is the actual system that we have built. So it’s a five node cluster with Dremio coordinator and availability set up in place. So we have all these data sources that are input data sources, and towards the right are the data sources that are output sources.
Use Case
So coming to the use case, I would like to talk about a use case that we have in our environment, which we have provided a solution with this platform. So we have enterprise Jira data. So what we have in this data is like data from different projects in our company, which deals with multiple projects, sub projects, type of productivity activities so that managers can plan the workload among their developers and a single source of truth for Jira data so that the management can go check on the process and the progress of different projects. And so if something is not going through, then that’s the place where we can go and check what’s happening and go help the team or provide more resources to the team, or probably plan that management.
And the problem statement here is we have to query large datasets on production database. This actually hinders the performance on the production database when we are doing a recursive query on a large dataset. So we have to have this load moved to a different system so that production applications are not impacted when we are trying to do a recursive kind of query. And the complex query logic with hierarchical queries, it works very fine when the system load on the database is nominal, but when it’s a quadrant or a peak hour of the day, the query runs longer and it would impact the dashboard refreshes. And we have to have an option of refreshing the backend tables on demand, which is not possible on a regular RDBMS kind of thing.
So yeah, and the last one is low latency dashboard refreshes, which helps us to get the latest data being visualized onto the dashboards. And the solution that we came up with is having the base table, which is used as a main table for that dataset to be migrated to Apache Iceberg format in object Nutanix object storage. So what we did is we have created Apache Iceberg table in Nutanix object store for that large dataset, which is acting as a main table. So to give an overview of this table, so it is a view or a table that is built on top of like 17 other tables. So we are getting data from different data sources and probably it can be 20 plus million records table, right? So this 20 plus million record table has to be joined with other tables, other five to six tables, which are in like six to seven, eight million record stable. And we have to go with a recursive query so that we can pull all the records that is required by different algorithms. So when we are running this type of queries on a database, it is impacting the performance of the database, or even a query fails with some memory issues.
So we have created a pipeline to incrementally update these tables on a scheduled frequency on Apache Iceberg, as it is also supporting the current operations. So we have created a business logics on these datasets as different VDSs and we have enabled reflections. So when we are enabling reflections, we have integrated our system to have the distributed data store on Nutanix objects, which is on-prem. So that is other requirement for us where we have to put in all our conventional data on-prem. So we have to search for an option where we can get our distributed kind of storage on-prem instead of going to a public cloud storages. So we have integrated the system with Nutanix object storage, which is on-prem solution for distributed data store. And we have enabled all the reflections on their business logic VDSs. We have also built some APIs that are scheduled on a frequency and those are event driven. So even though that is scheduled on a frequency, it will check on the backend tables for some conditions. And if they are good, then only we’ll be refreshing the APIs. And all the downstream reflections that are dependent on the master table or the base table, they will be also refreshed. So we have built the Tableau dashboards, which connects to this business logic VDSs. And as they are very low latent in transferring all the data over the bandwidth, network bandwidth, the refreshes are pretty fast.
So what is the achievement for us using this kind of model? We are able to query these large datasets when with hierarchical or recursive kind of query and get the data back in a millisecond. So when we have moved this kind of large dataset to an Apache iceberg format, and which is also residing on a Nutanix object store, it became a local table to the system. So we were able to query it and get the results even like the results, which is a joint between a 20 million plus 20 million table. We are getting that result back in a millisecond or at least a second. So we have shifted all the high compute workloads out of our production DB to our dashboard.
So this way we have not only achieved this use case of success, but also we have helped all the other applications, which are running on the production database with freeing up a lot of resources on the system. And we have opened up the access to the users where they can refresh the reflections on demand. So whenever there is an ad hoc request or ad hoc data load to the base tables or the dependent tables, then user can directly log in and they can refresh the reflection, or they can call the API, which is also integrated.
And then the last one is the data refreshes by reflection refresh. So whenever we are calling this API, which is scheduled, whenever it touches the base table, it also refreshes all the reflections that are dependent on this base table. So this use case, we were able to demonstrate the ability of Dremio as a DAS service, as well as we were able to leverage the performance benefits from Apache Aspire format table, which is using Nutanix object store as a data source in the backend. Yep. So this is all that we wanted to present today.