Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Hey, everybody! This is Alex Merced, and welcome to another episode of Gnarly Data Waves where we’ll be talking about how Memorial Sloan Kettering accelerates cancer research with Dremio's Data Lake House.
But before we do that, let's talk about Dremio Test Drive. So make sure to get your hands on a Data Lake House at no cost, no obligation by trying out Dremio Test Drive. Just head over to dremio.com, click that test drive button and you're off to the races. Also make sure you pick up an early copy of Apache Iceberg: The Definitive Guide from O'Reilly, which is slated for release early next year, but you can get yourself an early copy now by scanning that QR code right there.
Also, we got several great episodes upcoming on Gnarly Data Waves, such as “What's New in the Apache Iceberg World”; “Machine Learning, Experimentation, Reproducibility on the Lake House”; “Dremio and the Data Lake House Table Formats”; “How Dremio and Duck dB Can Accelerate and Simplify your Analytics”––a lot of great topics coming up.
Dremio will also be doing these events with AWS and Tableau in New York, Chicago, and Toronto. So if you're here on any of these days, make sure to check out those events. and we'll also be all around the world at different conferences, such as Big Data AI, Big Data London, Coalesce Data Festival, Big Data and AI World. So if you're at any of these events, make sure to check out the Dremio table and go pick up some Dremio swag.
And with no further ado, let's talk about data mesh in practice: accelerating cancer research with Dremio's Data Lakehouse. And for this presentation, we have Arfath Pasha, senior software engineer at Memorial Sloan Kettering, and Tony Truong, senior product marketing manager here at Dremio. Arfa. Tony, the stage is yours.
Alright. Thank you, Alex, for the warm introductions, and, like Alex had mentioned today, we have a very special guest: his name is Arfath Pasha. He is a senior software engineer at MSK, and in today's episode he'll be going over how he uses Dremio for data mesh in order to accelerate cancer research. And before I start, if you have any questions any time throughout the session, go ahead and leave them in the Q&A section. With that being said, Arfath, the show is yours.
Memorial Sloane Kettering Cancer Center
Thanks, Tony, for this opportunity. We've had the pleasure of using Dremio now for 2 years, and I'd love to share the story. So I'm a software engineer at Memorial Sloane Kettering Cancer Center. It's one of the leading cancer care Institutes, in the US. MSK was founded in 1884, and not many people know but MSK has 2 wings to the hospital––the Memorial Hospital, where patient care happens, and Sloane Kettering Institute, where there's a lot of research being conducted. MSK treats more than 400 different cancer subtypes, and in 2021, there are a lot more outpatient visits than inpatient visits, as you can see here, and this is deliberate to reduce the amount of time patients spend in the hospital, so their their recovery and overall care is much better.
On the research side, there are over 1,800 research protocols being conducted in MSK, and it's very common to see physicians who take care of patients, a member of the hospital, also participating in research activities and being principal investigators and research grants in the Sloane Kettering Institute. The Sloane Kettering Institute itself hosts many teams that are staffed by data scientists, researchers, students, post doctoral and doctoral students, and of course, engineers and other support staff that are needed for research. The Institute is very well known for discovering many cancer types, many different cancer genes, learning about cell signaling pathways, and learning about immune system response, because all of these are very important for treating cancer patients. Because of the large volume of patients that we have, we have very unique datasets across these various modalities: clinical, radiology, pathology and genomic modalities, and these are very large and rich data sets that are being used right now in research. And lately we started using them in conjunction, because we are finding that the signal that we get by combining these data sets is much stronger and give us a better confirmation when we are looking for new biomarkers or improved diagnostics, or even per patient stratification where patients are put into subgroups based on their disease types.
MSK’s Data Management Challenges
The big challenge with doing all of this is, of course, the data management challenge dealing with different modalities of data can be difficult. These are not simple datasets, and also there is a big regulatory concern. We want to make sure that the data is handled well and patient information is maintained private, so data governance is a big challenge. And there are also many technical barriers when bringing this data from the data marks, where they exist, to the researchers, and making it easy for analysis and computations. So I belong to a team of engineers, and this is our team on the right-hand side, and our focus is to build a scientific data management and computer system for research. And we are sort of a multi-faceted team––we wear many hats. We build infrastructure, we also build data products, and we try to be as involved as [we] can with the analysis as well. And by doing this, we are able to get a really strong understanding of what the needs and the use cases are for the researchers. That has allowed us to build the right system, and to also build system-wide. So we sort of brand ourselves as recess software engineers, but we do a little bit of software engineering, a bit of data engineering and a bit of data science as well. So one of the things that happened over time is we started addressing the low-level data management challenges, like governance, sharing of data, making sure we can track copies of data when sending them, and so on. And as we were building out solutions to address these challenges, what ended up happening is that we realized over time that we were actually building a data mesh.
High level, the concept of a data mesh is really about moving away from centralized data repositories and teams, to a more decentralized approach to data where data is maintained in its decentralized form and maintained by teams that are also decentralized. So as you can see here, data mesh is a modern approach to data management that emphasizes distributed ownership and governance of data within domains, who then build, manage, and share data products across the organization. So this is, I think, the third generation of data management systems or data platforms. The first generation being enterprise data warehouse, second generation being data lakes. And now we have data mesh, which is all about decentralization.
So Zhamak Dehghani wrote this book, which came out last year about data mesh, and the 4 principles of data mesh as she laid out––data-as-a-product, self-service data platform, domain ownership, and federative competition governance. Data-as-a-product really means data that is discoverable, well documented, understandable, and accessible with decreased lead time and ultimately usable. So [we’re] thinking about data literally as a product, and thinking about how easy it is for consumers or customers of that data to use the data. The cell service data platform is about using the right sets of tools and technologies to remove the friction and technological complexities between from the interaction between producers and data consumers, data producers and consumers. And domain ownership is, of course, about trying to get down to the source, having consumers interact directly with the data producers and reducing the hops that the consumers have to go through to get down to the source of the data. Federated competition governance is about automating data governance policies without a centralized authority, like a super admin. We end up deploying a document with a decentralized process.
So here are some of the high level challenges that we face on the research side. Data is high dimensional. It comes from different modalities––predominantly clinical, genomic, radiology and pathology, but not limited to these teams. The research teams typically tend to have a diverse set of skill-sets. It's very common to see a team with engineers, scientists, pathologists, physicians, radiologists, administrator, staff, and so on. And all of them need a view into the data, and it's hard to, you know, build the right set of interfaces to make it easy for technical and non-technical users to access this data. Research by its very nature is evolutionary and requires a lot of exploration. And through this exploration, it is common to have different versions of result sets and datasets being created. And for this reason data versioning is very important to help manage this data and make it easy to understand and find later on. The data we deal with is both structured and unstructured data––very large binary objects like pathology, scan pathology, images and radiology scans, genomic files, which can be large as well. And we deal with data sets of all kinds of sizes. small data sets with maybe 100 rows to very large data sets that can be up to a billion or hundreds of millions of rows, and typical datasets that have about 10 columns, to very wide data sets which can have up to a thousand or more columns. The data can be very messy. Oftentimes there are errors from data entry, or there's missing data. Data is not always standardized. So it's when you're trying to merge data coming from 2 different sources, that it can be difficult, based on the values that are assigned to each of the data elements. And of course, as always, I completed this issue with the data that wasn't collected, or somehow went missing through the connection, through the process of getting to the consumers. So numerous challenges related to data, collection and curation with the data, and data is always siloed in data marks. You can't do away with data marks when you have a department of pathology, but they will end up maintaining their own data mod, and youneed to source the data from that department, with other departments like radiology, and so on. Privacy is a massive concern. We care a lot about patients’ identifiable information and try to guard it with the best possible means that we can.
So this is what our data management process and workflow look like before Dremio. We would source structured data––the tables, and unstructured data––the large binary objects from the data marks and departments most often directly, but sometimes we will get structured data from the data warehouse. And this was a tedious and time-consuming process. For each project we would have to make requests, gain access, and then run siloed, bespoke ETL processes for each project, oftentimes overlapping our effort with other projects. And then sometimes we would have to go through the process of de-identifying, and after doing all of this, we would end up putting all of this data in yet another database or the file system. And given that our users are very diverse, not all of them were able to access this data through a database interface for the file system. So it wasn't very uncommon to have people extract data from the database and then share it across the team using excel files. And when that happens, it's almost impossible to track copies of that data as they are being circulated within the team. And this was a big, massive data management challenge, huge time latencies in getting the data to users, and also tracking the data once it was available to users. So again, high level, the solutions, or the requirements that we wanted to satisfy, the architectural data management, and the personal level was to have an on-premise deployment because the data was on them to try to bring the data from the source as quickly as possible.
MSK’s Data Mesh Journey
And that's where the data mesh concept came in. It's something that we evolved into and chanced upon, and didn't move from the start, but realised that what we were doing was really building up a database by mitigating the intermediate steps needed to get the data from the source to the consumers, a good and powerful query engine with a shareable interface through which we could govern the data, and really a no-copies architecture where we could reduce or completely eliminate the need for users maintaining their own copies of the data for the data management side. I had already mentioned, we wanted to completely eliminate the ETL processes, the silo ETL pipelines that we were building for each project and have data accessible at a reduced read time. The time documentation is a big problem. It's very common not to see good documentation for data. It's expected in code these days. But for data, it's still a huge missing piece, especially in the machine learning community. That culture or the habit of documenting data sets is not around. So we took on this challenge after coming across a really interesting paper called “Data Sheets for Data Sets” by Timnit Gebru, which I shall get more into. And also have a really nice, simple, and mature governance model, because providing access to data has always been a challenge in the past for data democratization, and having interfaces that are easy for users to consume the data is a big challenge. And we wanted to make that one of the considerations in the data management system that we built.
Why MSK Chose Dremio for their Data Needs
So we evaluated a number of technologies, and came across data virtualization. And in the data virtualization space, we looked at the various vendors and providers.And we decided to go with Dremio for these reasons. Tthe first reason was an important one––one of our engineers came up with this rule––‘the one hour rule’ is what we call it. If we can't stand up that technology or that tool within an hour and actually see it working on our system terms, then we would decided we weren't going to go forward with it, and Dremio satisfied that rule.
We also needed a tool on-pem with an eventual path to the cloud, because our data is on-prem right now. But there is an interest within the organization to eventually migrate to the cloud. And Dremio also satisfied that.
Access control, as I mentioned, is a important need, and we needed one that was very simple to implement or to use. And Dremio’s unified semantic layer gave us a very simple way to go about access control in the semantic layer.
The user interface that Dremio has, it almost feels like a spreadsheet like interface where you see the tabular view, and we have the autocomplete features with SQL, which makes it easier for people who don't know SQL very well. Also the UI widgets that allow the user to generate the SQL help with some of our users who are completely unfamiliar with SQL, so it has helped democratize the data quite a bit. There is still some hesitance in users who are not technical from using the interface, but we found that if we allow them to learn SQL, and understand how the interface works, they start warming up to that to the UI, which is a very fun and interesting experience that we've been working through as we get more people onboarded with the system.
So the very nature of data virtualization means that we can make and share copies of the data without actually having copies on disk and that was immediately satisfied through Dremio’s semantic layer.
With low code and no-code––we would, in the past, write a lot of pandas code to do simple operations like evacuation data integration, or even just inspecting data. Right now, with Dremio, we can, we've been able to do pretty involved data curation tasks, including things like pivoting a table, doing rejects-cleanup operations, without having to write a whole lot of pandas code, with just a few lines of SQL. We found we can do quite a bit with respect to inspection, curation, and integration. So this is a also a huge time saver on the data engineering side.
We were interested in performance, and Dremio has definitely satisfied this need through its horizontal scalability. As concurrent load increases, we are able to just add nodes, and distribute that load horizontally. And the Arrow flight interface has also proved to be really fast and efficient through its column-up format and its ability to paralyze requests.
Data sheets for data sets, as I've mentioned, is a huge part of our solution, and the Dremio catalog wiki allows us to document or take notes on data sets as the data sets are being developed, literally alongside or adjacent to the dataset, which is a really nice experience, because it's not like we have to go to another third-party tool to maintain a documentation, and it's hard to link the data sets. On Dremio, we got our third party documentation. It can be done within the tool and literally adjacent to the data size.
And data versioning, as I've mentioned, is important. We're just beginning to explore Iceberg and Nessie, which Dremio has adopted, and we plan to start using it in our workflows and in our projects in the future.
MSK’s Workflow with Dremio
So after working with Dremio for about 2 years now, and also the linear object store on Tableau, we think we've built a self-service data platform through which many different types of users can access this data easily. And also, we've been able to completely eliminate the hops
required for structured data, because the Dremio connectors enable us to connect the data sources directly to the users through the Dremio interface, and the queries going in from the from our researchers is isolated from the data sources through Dremio’s bigger reflections capability, which is also a nice feature, because the data marks do not wanna deal with analytical query loads, as they're very fine tuned for their operational needs.
On the binary large objects or unstructured data space, we do have to make at least one copy to bring the data close to compute. And we do that in Minio, but once we do it in Minio, you're able to very easily make it available to other computer units within the organization or on the cloud by Minio’s replica management and be able to track those replicas that we have. And by doing that you're able to prevent the users or eliminate the need for the users to make their own copies of this data, which can be difficult to track, and also ends up taking a lot of space and takes a lot of time to clean up the fire. So overall, we've been able to reduce the time of delivery to data to the consumers from literally weeks to months to now, minutes to days, because of the increased connectivity and reduced costs. And also the zero copies architecture has been fantastic in eliminating use of copies, the need for having these are copies.
So the top 3 benefits or the lessons learned, I guess, from building such an architecture for scientific data management is the trust that we manage to build between data product owners and data consumers through the Dremio lineage that users can see exactly where the data came from. As that underlying data gets updated, the semantic layer data also gets updated. So that's an assurance that they're getting the latest data with all the recent connections applied to it, which gives the users assurance that they are working on data that is rich and clean, as opposed to getting a snapshot of the data, and then completely having that snapshot pass through many hand before it comes to the consumer, and not being able to talk to the producer of the data, or have any interaction with producer to know if there are any updates to the data. So it's a big trust factor that Dremio’s brought to the fore for our dat management.
Also, we've managed to reduce the delivery time quite a bit by eliminating a lot of ETL, especially on the structured data side. And through Dremio's interaction with Minio, and Minio as a beautiful on-prem object store, we also may manage to reduce the amount of material that we need to do for the structured data.
User data copies are pretty much eliminated now, because everyone's happy working with the semantic layer and sharing data on the semantic layer. The user interface has been gaining more usage across the skill-sets that we have. And we've had to help users warm up to it and teach them how their interface works. But outside of that, I think, just using a pretty happy
working with the interface. And it's also easier to share and track their data through the lineage.
“Data Sheets for Data Sets”
So “data sheets for data sets” was an idea, and a paper that was written by Timnit Gebru, the link is below. And it was a revealing paper for us to read because we were dealing with these challenges day in, day out, and had a need for this, and I think Timnit Gebru, in her paper, does full justice to highlighting this problem, and also providing a really elegant and nice solution. The nice thing about Dremio is it provides us the tooling that we need to make data sheets for data sets happen. As you can see, the data tab contains the actual data and the catalog tab which sits right adjacent to the data tab can contain the documentation, so as Timnit Gebru argues in her paper, it's really important for data producers to document the data set composition, to say, what does the data actually comprise of? Does it contain any patient identifiable information? Is there a label, a target associated with each instance? Is any information missing? Or are there any quality issues, noise, errors, and so on with the data?
And also, it’s very important for analysts to know what the collection process here is, whether the data was directly observable from raw data reported by subject, or if it came from some other source, and what the provenance to that source is. Whether the data was manually collected or came automatically, or through a software program is important to know, whether it's validated and verified, and so on. A lot of times, this information is being exchanged between the producers and the consumers by word of mouth, over zoom calls and emails. And it's very, very inefficient and time-consuming. So Dremio is now allowing us to bring in the culture for documentation, getting the producers to actually document the data and this format, and make that documentation available right adjacent to the data sets that they present to their consumers.
So now that we have addressed a lot of the low level data management challenges. And we've chanced upon data mesh and sort of realized that what we are doing is really building the data mesh, enabling data marks to become data product owners to connect it directly with other consumers. We now have, we're trying to address the one of the tenants of data mesh, which is the Federated Computation governance piece. We don't want the super admin being the bottleneck to the process. We want to federate this access out to tenant admins, where each term, tenant admins belongs to a team of that's made up of the data producers with various users, and that tenant admin can govern the access to his or her team as well as govern access to data to the data consumers. This is something that we've started dealing out with Dremio and Minio, and so far it's been going pretty smoothly, and we've developed a process and a workflow
for the super admin to pay tenant admins, and then for the tenant Admins to provide access to users within the tenant Admin team and to the data consumers
And also, we are interested in automating a lot of this process, including running audits on the logs. And Dremio allows for all of this, so we have better understanding of how data being shared and who has access to what types of data. And I think that's about it. And that's where we are with Dremio and Minio right now.
Thank you Arfath for walking us through your journey with Dremio and data mesh. Now we'll go ahead and jump straight into Q&A. If you have any questions, please leave them in the box below.
Hey, everybody welcome back. Okay. So you guys know the drill. You have any questions put them in the Q&A box below. We got a few questions coming in. But let's start off with, okay, Tony there’s questions for you:
I'm in a similar scenario here. But my data is in cloud object storage and some data warehouses. Could I use Dremio?
Yeah, that's a good question. So that's actually a pretty common use case that we see with customers that want to use data mesh, as you can see with Arfath here. FYI everyone, he's not able to join the Q&A, because he had to drop off for emergency––a personal thing. But yeah, you can use Dremio if you have data in cloud object storage. But also, if you have data that needs to stay on-premise, either for compliance or because you don't want to create any net new migration work, then, Dremio––as long as we have a connector to it, we can support a federated query against that.
Awesome. Our next question is: How does the Dremio semantic layer work?
Yeah. So the Dremio semantic layer, basically how it works is that you can create domains within Dremio. So it's kind of like a view of your data, so you can organize it into hierarchy of voters. So the cool thing about it is you don't you have to get a external tool to build a semantic layer. You have our query engine, and then you can create spaces. Right? So you can do one by, let's just say, very simplified example: sales spaces, marketing spaces, view spaces, and within those spaces itself. You can create individual data products. So for example, for marketing, you could have one for channel attribution, and then for sales, you could have one for customer sales, or you know, sales rep, like bonus, for example, I'm just kind of giving one off top my head. So you can organize it and grant access to individual users or groups, however you want to set that policy.
And next question is: Is it purely data virtualization or some data can be copied? Arfath mentioned some unstructured data, one copy was created.
Yeah. So I think so for him, in that context, what he meant was he had biinary images from the original source file. And that's what he was calling unstructured data. And he copied it into a cloud object storage for data cleansing, and that's where Dremio consumed the data. So that's the part of the copying but is it purely daily virtualization. Yes and no. So when we're querying the data lake, it's just like your lakehouse query engine. But if you need to connect to data sources outside of the lake, then yes, we do support those federated queries against those data sets.
Next––would be interested to know more about your security model within Dremio. For example, who has access to what data and along the organization of the semantic layer?
So I think that really just depends––I don't know if you're asking how we're doing it at Dremio, or with our customers, but I assume it's the latter. But one way to do it is, you know, you can grant access at the group level. So, for example, if I'm creating a demo, I could create a organization called like, marketing or product marketing. And within that product marketing domain, right, I can grant it specific access level, so maybe only ‘view’, maybe only the ability to ‘create/edit reflections’. And then within those groups, you can add users. So let's just say we have Alex joining the marketing team. Then over time. I'll just add him to that group, and it's automated with his access level, so we don't have to configure everyone at the individual level. I don't know if that answered your question, Abdul.
To add to that––along with sort of the role in user-based authentication that that Tony mentioned, there's also a row and column-based rule. So you have really granular rules that you can do with with Dremio over all datasets. Cool, and I think there's one more question…
Oh, nice! They said they're doing that with Azure AD. Okay, cool.
Can you talk more about the Dremio searchable catalog and graph?
Yeah, so the searchable catalog, right? So if you're new to Dremio, this is a feature that we have where, let's just say, I’m a data consumer, right? And I need to go in and see, do we have New York taxi data? I'll go ahead and type in New York taxi, and then you can see in the view here that, hey, we have a spaces, or in the semantic layer we have a folder called New York taxi data. And then from there you're able to search it because Dremio has tagging features. So when you're creating data products, or think of it as like virtual views of these data, you can tag it. So I can put in taxi data, and then we also have a Wiki feature which Arfath showed in the episode here. So basically, what you could do is you can add business context to it. Right? We've had customers where they wanna use the Wiki to use it as a source to target mapping documentation. So basically just translate the source data into business friendly terms. They can add context, such as who the product owner is. So that way, if you see this data and you need access to it, you can request it internally without having to create a a ticket with it to get access and curate data with. This way, you know, you go straight to the source. So now, you're building trust between the producers and the consumers without having to wait weeks and months just to get access to your data. And then the third component to it is the data lineage capability. So basically, you know, let's just say you get a curated data set right? And it's joined between 2 or more tables across the enterprise. So with this data lineahe capability, what happens is you can see tables are being joined, on what fields, what columns are being pulled in, and what queries are being used with that data set. So let's say, as a data producer I'm joining data with S3 snowflake, and a table in ADLS. And it's in Delta or Iceberg, whichever format you wanna use, and then you can see all the fields are being joined together, and then you can see the queries. And then, as a data producer, I may see that, hey, a lot of folks from the sales team or the marketing team are accessing this data. And I actually wanna make this perform faster, and add reflections on top of it which will accelerate the query, because I know every day I have 2030+ users accessing this query. The old way of doing this was, you had to copy data into materialized views which are very complex to manage on a data warehouse, or copying into like a BI extract or Olap Cubes, which brings its own set of complexity. So now you have built-in data reflections which accelerate the queries behind the scenes, and it's transparent to end users.
We have one more come in. Arfath mentioned that a lot of ETL was reduced specially for structured data. Is it because data copies are no more required and data stays where it is, and the data is access via Dremio?
Yes, exactly. So, just based on what he shared, he could probably go in more detail with this one. but yeah, so what they had was a internal IT/data engineering team. So before, they were managing the traditional ETL pipelines creating silo data sets, and so now they're no longer creating data copies, they're querying the data where it lives with Dremio. And now they're they're basically using the data IT teams to support the data consumers and producers and making sure that the data is in a state where it can be consumed by the business. So it's repurposing into working on higher value projects.
And I think that is all the questions we have for today. There were a lot of great questions, and those were some amazing answers. Thank you, Tony, for being here on Gnarly Data Waves, and again, also impart my gratitude to Arfath. What an amazing story they have over there at Memorial Sloane Kettering. And yeah, thank you very much.
And again, everyone make sure to keep coming every week, we got a lot of great episodes coming up. And again, this will be posted on Youtube and on Spotify within 48 hours, if you want to listen to it again, and thank you very much for coming and joining us every week.
Thank you, Alex. Thank you, Arfath. Thank you everyone for joining. Have a great day.