March 1, 2023

10:10 am - 10:40 am PST

How a Data Lakehouse Architecture and Dremio are Addressing Memorial Sloan Kettering’s Data Challenges in Medical Research

Memorial Sloan Kettering (MSK) is one of the nation’s top cancer care and research centers. As such, MSK faces many of the usual data challenges in the medical research space, such as maintaining patient privacy, siloed data, data democratization across a diverse skillset, and overlapping data requirements from different projects resulting in many copies of data (e.g., multiple versions of tables, data sharing via emailed spreadsheets). In the past, MSK’s data infrastructure posed many challenges, like delays in answering even simple questions like what data was available in the system, difficulty tracking the version of the data copies, and difficulty adhering to governance requirements for the data.

Dremio and the lakehouse architecture has begun to help us address these challenges. Some of its benefits include easy sharing with access controls to eliminate emailing of spreadsheets, data provenance, high performance analytics, and ultimately effective data democratization.

In this talk, we’ll dive into the challenges we’ve faced, how Dremio and the lakehouse architecture have helped us address these challenges. We will also highlight a use case for obtaining patient and demographics counts across many ongoing studies for IRB review. This task previously took days and is now reduced to minutes.

Topics Covered

Customer Use Cases
Real-world implementation

Sign up to watch all Subsurface 2023 sessions


Note: This transcript was created using speech recognition software. It may contain errors.

Arfath Pasha:

. hi everyone. Thanks for having me here. And I’ll be talking about our experience in M mss K with reo, as have many others that have come before me in these conferences. I guess the difference here is that we are using ramo in research slash academic setting and that seems to pose its own set of challenges. how do I switch slides? Oh,

Slight technical issue here. alright. so a little bit about Memorial Sloan Kettering. it’s a cancer hospital in New York City, and it has a lot of locations in and around the city. not many people know that it’s a research hospital. There are two sides to m sk. it was founded in 1884. It has a long history in this city and there’s Memorial Hospital where the patient care happens. And there’s also the Sloan Catering Institute Sloan Catching Institute, where there’s a lot of research that goes on in cancer in to support the patient care on the hospital side. today we treat more than 400 different subtypes, cancer subtypes. And and in 2021 Memorial Hospital treated over 20,000 inpatient had over 20,000 inpatient visits and over 700,000 outpatient visits where the focus, of course is in treating patients outpatient treating patients with outpatient visits.

So they spend less time in the hospital. And on the research side, there were over 1800 clinical research protocols that were conducted. And it’s very common to see physicians who are treating patients and memory hospital also being principal investigators On the research side stone Kettering Institute hosts a number of site data scientists PhD, md, PhD students, and has a number of laboratories and is very popular for. I mean, the institute is very well known for having made discoveries with discovering cancer genes. learning about signaling pathways in cells, and also understanding the immune response system all which are very important for cancer treating cancer. so given the large patient volume naturally the hospital has a very large and very rich data sets in multiple modalities, like clinical, radiological, pathological, and genomics resources.

 and these resources are now being used to look for biomarker discoveries on better diagnoses and for patient stratification where patient populations are put into subgroups based on the disease types. and a new phase in this research is multimodal integration where information or information is gleaned from across these different modalities and brought together to get a better signal for the type of information that we are after looking for. Of course, the big challenges in all of this is the regulatory concerns with patient privacy data governance, and a number of technical barriers with the type of data that we deal with.

So this is my team. we are of course, on the research side and Sloan Kettering Institute, and our focus is on building a scientific data management system that can satisfy the needs of the data scientists and the principal and the scientists and principal investigators. so in order to do this, we need to be, we need to sort of put on multiple hats. we are an in infrastructure team where we build this infrastructure out to support the research. we are also a data products team. We built data products that can be consumed by the analysts, and we also try to work very closely with the analysts, do some analysis ourselves so we can better understand what their needs are and in that, so that kind of gives us an understanding of what sort of data to build or how to model the data for better consumption on the analysis side.

And both the analysis and the data products depend on the infrastructure. So when we are involved in these two aspects, it helps us build better infrastructure and be able to choose the right tools know what tools will work or what tools will not work. so it’s a bit of software engineering, data engineering and data science that goes into this sort of a work. And we sort of branding ourselves as research software engineer. Not very easy because we kind of spread ourselves thin at times. but it seems to be working. And so far we’ve been able to build a system that is actually helping with analysis and accelerating research. The main focus is on trying to accelerate research. so in terms of challenges in the medical space these are, I think, the top three challenges as I have seen them over the last few years.

 obviously very high dimensional data with examples like clinical genomic, radiological, pathological, but not just limited to these, these data sets can be pretty complex and require a lot of demo domain expertise in order to work with them. typical teams that are involved in these multi-year research projects tend to be very diverse with different types of skill sets, like requiring engineer, scientists, radiologists, pathologists, administrative, so staff and so on, all needing a view into the data. and it’s a very, it’s a highly collaborative environment where the more access people have, the e the quicker and faster research can happen. So that’s another major challenge in trying to make this data accessible to different skill sets. and then comes the iterative native research itself. it tends to be a very evolutionary experimental type process. And for that data versioning is a huge requirement and also a big challenge outside of these challenges come.

The other challenges, which are more of a technical nature where we deal with the unstructured data, binary large objects as well as the related tabular data. And the two have to come together. we have data sets of all shapes and sizes, tens of millions of rows, and sometimes thousands of columns. And that can pose its own challenge, sets of challenges, and the data sets tend to evolve. So it’s very hard to kind of maintain strict se schema. It, the system has to be very flexible. and of course the data can be very messy. there is a need for a lot of data curation and verification and validation processes. So we need tooling in to support of that as well. silo data is sort of a necessity. You always need to have these data marks coming from the various departments like radiology and pathology, and we need to have a means to bring this data together and make it available to researchers.

And last but last not least, is privacy, which is probably the most important to protect patient patient information and their privacy as this data moves into research. so a year and a half back, or I’d say two years to a year and a half back, we evaluated a number of technologies. And this was a trajectory over the last year and a half with ramo. We started with looking at the community edition, playing around with it, and we ran some proof of concept over a couple of months. We kind of liked what we saw. of course, our understanding of the tool was very very little at that time. It was in the initial stages, and we got in touch with the sales team. They helped us understand the tool a little better. And in a very short period of time, I would say two, two and a half months we were able to make an internal sales pitch for the two to get the enterprise version going so we could avail off all of the features that the tool had.

And around this time last year, we deployed the enterprise version OnPrem. started off with six projects that we were very involved with as research as engineers. And very quickly moved on to hosting other other teams in self-service mode where we were not really handholding them so much, but allowing them to just onboard themselves onto the platform and start using the platform on their own. we’ve also onboarded an entire team as a separate tenant, so they take care of their own governance and we don’t have to worry about their data sources and their users. And right now we are up to 75 users. and of course, the professional services team from REO with Kona, Nora, Sarah, Ben Mars, have been really instrumental in this growth so far. they’ve helped grow the numbers for us. it’s obviously there’s a learning curve and using the tool, understanding how you know, what features are available and how best to use them.

And it’s very hard us as a small team to be able to educate the larger group of people that are interested in using the tool. And the professional services from Jamo have really stepped in to help answer questions and help us understand how best to use the tool along with the best practices. so before we started using ramo, this was our typical process where each team, each research team would literally apply this process independent of other teams. There was a lot of overlap, redundancy and inefficiencies in the way we went about bringing this data together and making it available to researchers. obviously we were pulling from file stores as well as database stores and the enterprise database warehouse. there was a lot of communication overhead for each team to make the requests for data. after having institutional review boards review the process and the cohorts that were necessary to be pulled each team would build their own bespoke E T L processes without much reuse across teams.

And then this data would have to be cataloged and de-identified and made available in yet another database system. And typically, the database system would be another inventory management system that is, is built in-house and doesn’t have the features like governance and apps access control built into it, which is very hard to implement in-house. and as a consequence data would then get pulled out of the inventory system and then shared across the various team members with email Excel files, which was sort of defeating the purpose of kind of having this data all in one place. And it was very, very hard to track this data when there were different versions of Excel sheets moving around. And these sheets were not being committed back into the inventory system. So collaboration was difficult. bring the data was difficult and it, it’s not very uncommon to see many weeks or months pass before the data can actually be consumed for research.

 and now we are at this stage where our data lake consists of drio, Minayo, and Tableau. Drio, of course, is for the tabular metadata that is being sourced from the various data marks and the institutional data warehouse and min io stores the files like pathology images and radiology, radiology scans. and we are able to do this now at scale and make it available for a number of research project projects all using the same pipelines that run in an automated manner. And we are actually in the process of building all of this out at this point. but the real benefit is that these interfaces are user-friendly. they there’s less of a need for researchers and other people other skill sets that are involved in a project to actually want to export this, this data out into Excel sheets and then share them outside of this environment.

 we are beginning to see less of that happen and which means there’s fewer copies flying around. And that’s, I think, one of the promises of dremeo, which I think we are beginning to see ourselves right now. And recently there was a independent Forester report which I read that talks about the return on investment of using ramo. And it sort of resonated with my experience so far with how Ramos been playing out for us in terms of productivity the very almost no vendor lock in in and reduction in et TL processes and things like that.

 the other big challenge we have is the lack of documentation behind data sets. I think in 2021, there was this really interesting paper by Tim Gabriel who at one point was a researcher. and so a scientist at Google who made this argument that every other engineering discipline, whether it’s electrical engineering or chemical engineering or mechanical engineering, when they build products, it’s very common to ha see them have data sheets that spec out these products in detail and make it easy for the consumer of those products to know how to use those products. but that’s not the case with DA datas data, and especially in the machine learning community, there are these huge data sets sometimes very, very complex with have, which have relationships between fields which are not documented, not well understood. And oftentimes data scientists are working with assumptions made about the data which could potentially be incorrect and could affect the results that they get out of it.

And it’s also not very uncommon to see this information being exchanged by the data produced between the data producers and the data consumers over meetings by word of mouth on Zoom calls which is a very, very inefficient way of going about it. So we now have this through drio where Drio has this wiki that sits literally adjacent to the data. We have this opportunity to create data sheets for data sets you know, describing the data com dataset composition, the collection process in detail. so it’s a much easier and more efficient experience for the data scientists to go about doing their work without having to worry about making false assumptions or having to spend time learning about this information through the, the people who, who produce the data. so this is the new challenge that we are trying to take on and produce these data sheets and data sets data, data sheets or data sets, right in REO adjacent to the data sets.

So it’s readily available. and all of this is gonna be on-prem, and of course it would be a much richer experience. if it’s more GitHub, like, I think Ramo is trying to give that GitHub like, experience to data. and if this was a versioned markdown, it would be closer to GitHub. so in general, I think you know, this is a long list of areas where we begin to see great improvement in our processes. The first one being that it satisfied our one hour rule for testing a new technology. We have this rule in our team where whenever we look at a new technology, we wanna be able to get it started and growing and try basic things with the technology within an hour. And that work for us, with the community edition with Dr. U a lot of our data is OnPrem.

 most healthcare institutions are already state and they want to keep the data OnPrem that’s beginning to slowly change in m sk now. And there is a pretty big cloud presence and a movement toward the cloud. but we still do do need to start OnPrem and then move to the cloud. And Ramo has given us as opportunity to do that and show the path towards the cloud for us. Access control, which was pretty much always missing earlier with the in-house implementations of inventory systems is now there. And I like to say that Ramo, the user interface is almost Excel like, and this has an appeal. it, it’s still pretty challenging to get like a physician to go and look at the data in REO when they prefer the Excel file. but in my opinion, it’s pretty close to that Excel experience with the UI widgets the SQL auto completion and all of that.

 and that’s helped with the data democratization. no copies with less emails floating around. we are seeing that. And I think that’s also very promising. also personally, I think it’s a lot easier to inspect data and curate data using sql, using the DRIO interface rather than going to a Pandas data frame and, you know, writing pandas code just to inspect these large data sets. so that’s a big speed up that I’ve seen and I’ve seen from others using it as well. no one likes to see that spiny wheel and waiting for the data to, or, you know, load. And that hasn’t been a problem since we’ve started using ramo. It’s pretty performant, no complaints there. And of course what we are trying to get into this year and address the data sheets for data sets and also data versioning with the ice iceberg technology that we are just getting our feet wet with.

 I think there are a lot of use cases that we can satisfy and a lot of efficiencies that we can build by taking on these two aspects. so remaining challenges again, it’s getting users to understand how to use the tool best evangelizing the tool within the organization and showing them you know, the ropes can be a bit of a challenge because everyone has a day jobs, they have a lot of things to do and to learn, yet another new tool can be an extra overhead for them. And we rely quite a bit right now on the professional services team to help us with that. the SQL learning curve is also an issue because we deal with the different skill sets. It’s hard to tell a biologist or a physician to learn SQL in order to, you know, be able to go and look at the data and inspect the data.

 to some extent I think Tablo and these sort of visualization interfaces can help. but my pitch right now is if you’re working with data, it really helps to know sql, so might as well learn it. It’s a declarative language, not so hard to learn. provenance is another big issue. DR has this beautiful provenance built into it but sometimes data sets evolve get transformed outside of Dr. U using code, come back into Dr. Jam’s physical data sets, and that cycle can go on two or three times, especially in this experimental manner. So we are trying to figure out how we can have one sort of a unified provenance structure that can then be exported out of Dremio and published outside of the REO system. And finally comes the growth. We are, we are beginning to notice beginning to realize a lot of interest across the organization for using Dremio and to allow it to grow without any bottlenecks or privacy governance issues. We are trying to figure out a scheme where it can grow in a multi-tenant manner where each team or group has its own tenancy and they manage their own users and their own data sources. And I believe I just heard from, no, that version 24 has some features that we should be that are ready to exploit and we’d be looking at this very soon. yeah, so that’s all I have. Uh,