Using Data & Analytics for Equitable Social Good

Session Abstract

Can AI, ML and Data Science help prevent children from getting lead poisoning? Can it reduce infant and maternal mortality? Can it reduce police violence and misconduct? Can it help cities better target limited resources to improve the lives of citizens and achieve equity? We’re all aware of the potential of ML and AI but turning this potential into tangible and equitable social impact involves dealing with both technical and ethical challenges. In this talk, Rayid Ghani will discuss lessons learned from working on 60+ projects over the past few years with non-profits and governments on high-impact public policy and social challenges in criminal justice, public health, education, economic development, public safety, workforce training, and urban infrastructure. He will highlight opportunities as well as challenges ethics, bias, and fairness that need to be tackled in order to have social and policy impact in a fair and equitable manner.

Video Transcript

Speaker 1:    Ladies and gentlemen, please welcome to the stage Rayid Ghani, distinguished career professor in machine learning at Carnegie Mellon University.

Speaker 2:    Thank you. Thanks for having me here. I’m excited to talk to all of you. What I’m going to be talking about is, using data and analytics, which we all do, for having social impact in a fair and equitable way. So, [00:00:30] what I’m going to do is start depressing all of you, because if I start bringing everybody down, then there’s only one way up. So, let’s do that. Let’s talk about some depressing stories.

The first [inaudible 00:00:46] is jail incarceration, and that’s depressing. If you look at the numbers, this is for the US and just for jails, about 11 million people go through jails every year in the US, and yes, it costs a lot of money, [00:01:00] but the depressing parts, are the bottom three numbers. Two thirds of people who go through jails suffer from mental illness. Substance abuse, there’s chronic health issues. And, there’s a lot of evidence that those are the underlying causes that actually have them cycle through jails. They get out, they come back in, they get out. So, this is some work we’ve been doing in Kansas, in a county called Johnson county. They have a high rate of jail [00:01:30] re incarceration, people who cycle through the system.

So, they came to us and said, “Look, we know we have a really high rate, we want to be able to reduce that.” One tool we think we have, is mental health outreach programs. What they want to help us is, to see, [inaudible 00:01:47] how do we prioritize people? They have limited resources to do this proactive outreach. How do we select people who can most benefit from it and will result in reducing their risk of recidivism?

So, what would we do? Well, we first combined a lot of data that we got [00:02:00] from them from different systems. We got data from them from the criminal justice system, from their jail stays, how long they were there, why they were there, got data from the police [inaudible 00:02:09] when were they arrested, what charges they were booked in. We got data from the mental health services team at the county that told us of their interactions with mental health system. Ambulance data, emergency room data, a lot of other things. We integrated that together, as a lot of you do in different projects, to then design [00:02:30] a system, a machine learning system that would predict somebody’s risk of coming back to jail.

Now, the [inaudible 00:02:38] that was accurate, but then the question [inaudible 00:02:40] we try to provide them with mental health outreach. So, we started a randomized control trial a couple of years ago, that, every month would provide them with a list of people who are high risk of recidivism and have mental health needs. They would then go off and connect with them through mental [00:03:00] health outreach programs. And, that’s what the last [inaudible 00:03:02] the results we’re seeing now, results that are actually reducing recidivism risk for the people that they intervened on.

More importantly, they’re also finding that, the highest risk people are the ones that are being benefited the most, because one of the questions was, who can we help? Who does this intervention help, and who does it not help? Because, the ones that it won’t help, we then have to figure out how to re design new things. So, we’re seeing really promising results [00:03:30] by combining this data and machine learning and the randomized control trial.

Let’s look at another example. This is a lead poisoning. It’s an issue in the US, it’s an issue all over the world. It’s an issue, because a lot of paint, pretty much in the US, every home built before 1977 has lead paint in there. And, that’s okay [inaudible 00:04:01] [00:04:00] inside the walls. But, homes get old, paint starts chipping, kids start crawling. And, as they crawl, they pick up lead dust and they put it in their mouth, and it goes into their bloodstream. What ends up happening is, the side-effects we’re having is pretty horrible. Like, [inaudible 00:04:22] issues, learning disabilities, motor control issues. The horrible, depressing thing is, all of this is irreversible. [00:04:30] Once somebody has been exposed to lead, there’s really nothing you can do to prevent these things.

So, how do most governments deal with it today? Well, they wait until somebody gets lead poisoning. They test kids for lead in their blood, and if they find high levels of lead, then they go and look for problems in their house. And, guess what? They find them. [inaudible 00:04:54] lead poisoning. Again, this is work that Chicago started doing several years ago, and they came to us with a similar [00:05:00] question of, when we know we’re being [crosstalk 00:05:02] can be proactive. The problem is, we only have limited resources to find these problems and then fix these problems. So, how do we prioritize the homes that we check?

Again, we took data about all the blood tests that they had done over the last 10 years, all the inspections they had done for every house over the last 10 years, all the data about the property, even if it was rehab, when it was built, all that information [inaudible 00:05:27] tell us important things. Combine all that together [00:05:30] to predict, when a kid is turning two months, three months old, can we predict if they’re going to be high risk of lead poisoning in the next six to nine months, because they typically get that when they crawl. If we could tell the health department, when somebody is two or three months old, it gives them six months to then go and check in their home for lead issues. And, if they are there, fix them before it becomes risky for the kid.

So, that system, again, ran a randomized trial, found it was effective [00:06:00] at finding these problems before they became issues. That’s this [inaudible 00:06:07] in Chicago, but also in hospital systems who also work on pregnant women who are going for checkups, so the system will just raise an alert and say, “Hey, the kid that’s going to be born will be high risk.” The health department has even more time to plan [inaudible 00:06:21] need.

Third example, also depressing, horribly depressing is, police misconduct. It’s an issue in the US, it’s an issue in [00:06:30] other countries. It turns out, a lot of police departments in the US actually have these systems that they’ve built over the years called, early intervention systems that are supposed to raise alerts when an officer is at risk of doing something horrible, shooting, injuries, other things. Except, when they built and installed these systems, they never bothered to evaluate them.

So, we started working on this a few years ago. The first department we worked with is in North Carolina in [00:07:00] Charlotte, Mecklenburg. The first thing we did was, evaluate the system that they had deployed a few years back. We basically found that it was pretty much random. The idea was, a good system should be able to detect [inaudible 00:07:18] who were going to [inaudible 00:07:19] these horrible things, and also not have too many false positives.

It turns out, it was flagging about half the police officers in the department, which made it totally unreliable and useless. So again, [00:07:30] what we did was, we took all the data that they had about the police [inaudible 00:07:34], their HR. Combined that with the arrests that they were making, the dispatches they were sent on and complaints against them, investigations and put it all together to, again, build a system to predict which officers are going to be high risk of doing something like this in the next six months, 12 months, and then provide that information to the department, their supervise [inaudible 00:07:56], to reduce that risk. [00:08:00] Is it a stress situation? Is it a training situation? Is it putting them off duty for a while?

A couple of indicators we found that put people at high risk were, if they were repeatedly dispatched to suicide attempts or domestic abuse cases involving children, that made them temporarily high risk. And, those are stress indicators that you could put somebody off when they’ve been to one of those incidents, don’t send them to another one for the next, some period of time to reduce that risk. [00:08:30] So, that system again, is now deployed in a few different departments and it’s being used to generate these alerts.

Very different example. A little bit less depressing, but still pretty bad, water pipes breaking. We’ve all probably experienced water pipes breaking, some more than others. This is some work we were doing in Syracuse, NY, which has, I don’t know about three water breaks a year. And, typically they’re not fatal, [00:09:00] unlike a lot of other things we’re talking about. But, water pipes are expensive to fix, because you have to dig up the street and it takes a long time. They affect people who live there because they don’t get any water, and they affect stores who are selling fruits, any fresh produce or frozen things, because the water supply goes away and they can’t sell those things. So, it’s inconvenient and it’s expensive.

So, we worked with Syracuse to build a system for them that predicted which pipes were likely to break. Unlike some of the other examples, [00:09:30] the idea here wasn’t, I predict that the pipe is going to break, so that they’re going to dig up the street and fix it. The idea was, that they could now coordinate with the roadworks department to figure out, okay, if a road was going to be repaired in three months, can they go a little bit deeper and look at the pipe and replace the pipe. Or, if it wasn’t designed, or wasn’t scheduled to be replaced [inaudible 00:09:52], but it’s still medium or high risk, can they put in a sensor? Prioritize which places to put sensors in so that they can detect those breaks early [00:10:00] and do something before it gets much worse.

A last example I want to give, which is, again, a different example, this is work in Jakarta around fatal traffic accidents. The traffic there is bad, but what’s really bad is that, a few thousand people die every year because of traffic accidents. And, the city there decided, we want to figure out why those accidents were happening. What were the underlying root causes, so we could do something about them? So, they [00:10:30] put a bunch of cameras, they put a few thousand cameras in the city. A couple of years later, they came to us and said, “Well, we have a few thousand cameras in the city that are collecting video 24 hours a day. So, now we have thousands and thousands of hours of video. We don’t know what to do with it.” They were not asking for us to build them a system that would predict an accident or really anything [inaudible 00:10:52], but they wanted to know, “Can you give us data that tells us what’s happening, so that we can then look at it,” and by data they meant, rows and columns, [00:11:00] not videos, “that we can then correlate that with what happens in that area a few minutes later, a few hours later. What leads to accidents happening so we can start [inaudible 00:11:10].”

So, what we’re looking for is really things like these types of events, like non vehicles. Food cart vendors in the street, or pedestrians going in the middle of traffic or motorcycles that are overloaded that are not balanced, that might actually fall, or cars going the wrong direction. So, what we needed [00:11:30] to do was really figure out, what are the different objects [inaudible 00:11:36] what direction they’re moving in, where they’re moving in, so that we can detect things like this. This is a two-way street, but you see cars coming around all the time. And, what turns out, that this could lead to accidents happening just down the street. If they could detect these things or at least have this data, then they could start seeing, “Oh, maybe we need to put a medium in the middle to stop people from crossing over, and that would reduce the type [00:12:00] of problems.”

So, stepping back a little bit, what do all these projects that I’ve talked about, have in common? They’re many things, they’re all about some type of social issue or a problem people deal with and are trying to tackle it. But, one of the things that they have in common is, that they all have data coming in from a lot of different places and it’s all being connected together.

If you take a look at the police example, it’s coming from the [00:12:30] HR system, the training system, the learning management system, the arrest system, computer-aided dispatch, the CAD system, the internal affairs investigation system, the complaint system, and it’s being hooked together. And then, we’re building these machine learning models to predict who’s going to be at risk of doing something bad, and then, we’re trying to prevent that from happening.

But, the first part is always [inaudible 00:12:53] with the lead poisoning. We’re connecting all these things together by our inspection and blood tests and [00:13:00] what the house is and what the history is. A lot of us work on the [inaudible 00:13:06]. A lot of our time is spent, basically doing data integration. We also spend a lot of time thinking about how to best do it, what database to use, where do we store it, what linkage tools to use. What queries do I run to make them more efficient? What are the ways to make these queries much, much, much faster, easier to write, we spend a lot of time thinking about that.

[00:13:30] But, the question I want everybody to think about when they’re working on it is, “Well, why are we doing it?” Yes, we’re doing data integration, and yes, we’re using some tool that’s really efficient at storage and fast at querying and linking, but why are we doing all of that? We’re not doing it because we want to integrate data. We’re not doing it because we want to run sequel queries, we’re doing it because we want to have impact. In all of these projects, the goal is [00:14:00] impact. The goal isn’t to build a machine learning model or an AI model, because who cares? The goal is to prevent kids from getting lead poisoning, to prevent people from being subjected to police misconduct, to prevent fatal traffic accidents. So, it’s impact on the people we’re trying to support [inaudible 00:14:23], that’s what’s, not just motivating and driving all of this, but it’s also what should be in [00:14:30] front of us when we think about this, that’s the thing we should be thinking about.

The second thing that most of them have in common is, [inaudible 00:14:40] these three pieces. One is, we’re putting all this data together to understand what’s going on. What’s happened in the past. We’re then using it to predict what will happen in the future. Who’s going to be getting lead poisoning or who might go back to jail. But then, the third piece in all of these cases is, we’re trying to change that, we’re trying to act and [00:15:00] intervene and do something. So, in that case, it’s really about that action that they’re all supporting.

We’ve all probably [inaudible 00:15:13] that are admired and looked at, but never taken any action on, or maps, probably one of the more useless things that are built because most people don’t take actions with that. And, actions are [00:15:30] really [inaudible 00:15:32] critical. Quite frankly, probably more critical than all the analytics. You need the other things. Without doing all the integrations, you can’t take the action as effectively. The other things are necessary, but they’re not sufficient. It’s the action that makes the actual impact.

So, let’s take an example. Let’s say, I asked all of you, if I use some of this to build the system, to predict if somebody is going to commit a crime or not, [00:16:00] is doing that good or bad? Some people might say, “Well, it’s bad, because what if you’re wrong? What if somebody goes and arrests them because the system predicted that they were going to commit a crime?” Just because it happens in minority report, doesn’t mean it’s actually a good thing. Somebody might say, “Well, it’s good, because you’re trying to prevent crime and it’s good for public safety.” I would probably say, it’s neither good nor bad. At that level, it’s an analysis that you [00:16:30] do, that’s sitting on your computer. If you don’t take an action, nothing happens. So, in some ways, it’s neither good nor bad. The question is, if you use it to preemptively arrest somebody, because you predict they’re commit a crime, well, that’s pretty bad.

But, if you use it to figure out, well, they actually have mental health needs and that they need social service programs. I’m going to start providing them services so that they don’t get into trouble and don’t get to jail, that’s really good. So, the analysis [00:17:00] and the prediction and the machine learning [inaudible 00:17:04] is often neither good or bad, it’s what you do with it, what you allow others to do with it, that’s what makes things often good or bad. Those pieces basically depend on your values. It’s, what you’re building and how you’re building it and what it can be used for, is a matter of what your values are and the values of the organization that’s building it. [inaudible 00:17:26] system with [inaudible 00:17:28] changes, two lines of code [00:17:30] can be really, really good for the world or really, really bad for the world. So, it’s really important to think about, what are the values that I want to embed in the system. When I’m building the system, what should it be designed to do?

And, that’s what I want to go a little bit deeper is, often, when we’re doing things that are good [inaudible 00:17:53] nobody wants to do, most people don’t want to do evil. Some people might be exceptions, but we all want to do good. But, just because we think we’re doing [00:18:00] good, doesn’t mean we’re actually doing good. For example, if you built a system, you build a system that indicates, even, let’s say lead poisoning or the recidivism case in jail, let’s say we’re trying to build a system that an organization has limited resources, and they come to you and say, “I’d like to build a system that can make the best use of the resources I have. I have a very limited resource. I can only help a [00:18:30] hundred people a week. Can you help me figure out the most people I can help with those resources?” Well, if you build a system that’s designed to be really, really efficient, to help the most number of people, you might only help people that are cheapest to help or easiest to help. You might leave people behind who are more expensive or difficult to help. In one [inaudible 00:18:53] you’re going to help a hundred people who are really easy to help.

In the other case, you might only have 20 people, but they really needed that help and wouldn’t have survived without [00:19:00] you. So, the question really becomes, what are the values of this system? Is it efficiency or is it equity? If it’s equity, what does that actually mean? Who decides what those values are? Is it you or me who are building the system? [inaudible 00:19:17] people who asked to build the system? Is it the people who the system is being [inaudible 00:19:21] or a system who’s being used by? Those are things that we don’t often think about too much. We often say, “Give me your parameters and I’ll just put them in my code. [00:19:30] I’ll run the code and it’ll produce things.” Well, those producing things affect people’s lives. So, it’s important to think about, where are those parameters and values coming from and how do we decide on them and what is our role as people building these types of things, when this is happening?

I’ll give you a concrete example instead of being abstract. One example, some work we were doing recently with Los Angeles city attorney’s office. It was a similar problem [00:20:00] to the Kansas example, where they have a team in their city attorney’s office, when somebody gets arrested for misdemeanors, they get booked, they go to court and somebody from the city attorney’s office has to show up and plead their case. And, what they say is “Look, we typically get an hour to get ready for this case. We can’t even connect all the data and look up all their history in all these three different systems and three different computers. So, we don’t have time to figure out, what’s the right thing to do for this person. [00:20:30] And, often this person needs some social service program, some assistance program, but we can’t even figure that out, so this person keeps cycling through the system.”

And so they asked us, “Is there a way you can tell us [inaudible 00:20:45] for the coming month, here are the people. We can prepare these case files and figure out programs for 150 people every month. Could you build a system for us that could identify 150 most likely people to come back, [00:21:00] arrested for misdemeanors, so we could pre-prepare their case files, figure out what program support they need. So, when they show up, we can go in front of the judge and design that program for them.” If they don’t show up, nothing happens.

So, we started building [inaudible 00:21:18] all the data, collect data from different sources, put it all together. The first version we be built for them was designed for efficiency, which means, of the 150 people we gave them, 73% of them [00:21:30] were going to come back, ended up coming back for a misdemeanor or arrested for misdemeanor. So, it was 75% efficient, but it turns out it was more accurate for white people than Hispanic people. You look at the data [inaudible 00:21:46] the Hispanic recidivism rate was higher than white recidivism rate. So, what ends up happening in this case is, the system helps both people. Recidivism goes down for both groups, but because [00:22:00] it’s more accurate for white than Hispanic, the white recidivism rate goes down faster than the Hispanic recidivism rate. So, it takes an already disparate system, because you’re starting with different points, and it makes the disparities much worse.

So, then we went and designed option number two, which was designed for [inaudible 00:22:21] equal accuracy, equal efficiency for both groups. It was overall less accurate, less efficient, it was 2% less [00:22:30] efficient. But here’s what it did, because it was equally accurate for both, it brought down [inaudible 00:22:37] equally. So, it didn’t increase disparities, but it also didn’t decrease disparities.

So, we designed option number three, which is designing for equity, which costs about the same as equality, and 2% is pretty minimal. What that did was, it was more accurate for Hispanics. We tuned it. So, what that does, it, again, helps both, reduces both, but [00:23:00] reduces one faster than the other in order to get to equal recidivism rate. We gave them this menu and said, “Look, if you care about efficiency, it’s going to cost you this much, but it’s going to have this impact on increasing disparity. If you care about equality, option [inaudible 00:23:22] more expensive, keeps the status quo.” Option number three is, the equitable outcomes and it’s the same cost. And, they went with number three.

But, the point [00:23:30] here is really, reasoning at the level of, what are the goals you want to achieve with the system? What are the social goals? What are the policy goals? And then, having all the technical [inaudible 00:23:39]. A lot of technical algorithms designed in the back and doing all this work to [inaudible 00:23:43] for equity and all that stuff. But, on the front end, it’s a menu that policymakers are looking at and deciding, what do I care about more, and making those informed decision, as opposed to, I’m going to use accuracy and build the most efficient system because that could have unintended [inaudible 00:24:03].

[00:24:00] So, all of these examples are ways of thinking about, what are our values? How do we turn those values into action? There’s all the data and machine learning and all the things in the middle, but really, we’re going from, here are values, how do I take action based on those values to generate the impact that I want to have on the world? The examples I’ve been talking about are big examples. It’s, working with these organizations and changing criminal justice [00:24:30] and traffic safety and policing, but what can all of us, what can all of you do to take your personal values and turn it into action and impact? So, I’m going to give you four ways that I think about, that we can do that ourselves.

First is, in the top left case, often, a lot of us, when we’re doing data engineering, we’re living in this world where, we get some data and we do things with data and we do some analysis and machine [00:25:00] learning. We’re detached from, how was this whole project [inaudible 00:25:05]? What was happening? What were the values the system needs to have? And then, what are the actions somebody is taking from it? We’re kind of in the middle, and somebody’s decided things here, and then somebody is using it to do something, but we can’t be agnostic to the first and the last pieces, because what we’re doing has impact on the world. So, what I want everybody to do is, to question those things. Think about, well, [00:25:30] what [inaudible 00:25:31] values, and then, what is it going to be used for? For two reasons. One is, it gives you more visibility, which means you can actually build, do the data and analysis and the machine learning stuff much, much, much better, because you have a sense of, what is it going to be used for. It’s not a generic system, it’s a very custom system.

The second thing is, that it allows you to figure out, is this embedding the values that I care about, and if not, I should do something about it. That’s the second piece of, [00:26:00] if you do see questionable things happening in this pipeline, then say something. Because, some of it may not be intentional. Like, the example I gave on the Los Angeles, if [inaudible 00:26:17] still optimizing for accuracy, which a lot of people do, not because they’re evil and they want to make things worse, but because that’s the standard thing to do. But now, if you think about it, “Well, that’s not the thing I want to do.” You might want to question, how did [00:26:30] we decide that we wanted to optimize for accuracy? Most likely, the answer would be, “Oh, we didn’t decide, we just kind of assumed.” So, if you see issues where, accountability and transparency issues and ethics issues coming up in the kinds of work you’re doing, raise your hand, say something. I think that’s a really good way to have that impact, because a lot of those times it might just be unintentional.

The third thing is, which a lot of you already probably do is [inaudible 00:26:59] pretty [00:27:00] much every example I gave for every project, we’re using open source code, most of it is built in Python and PostgreSQL and a bunch of Python packages. All of these organizations are using open-source packages to do this work. So, contribute to open source and all the contributions you make to open source code, that basically leads to other people being able to reuse [inaudible 00:27:30] and that [00:27:30] leads to [inaudible 00:27:31].

The last one is, volunteers. I’m just giving you one example, which is [inaudible 00:27:39] recently called [inaudible 00:27:41]. We’ve got government agency nonprofits, who are [inaudible 00:27:50]. Perfect. Yeah. Sorry about that. Somebody in the internet well didn’t like what I was saying, so they would probably cut down my network.

[00:28:00] So, I was just ending with the new platform we launched [inaudible 00:28:05] for good, that connects volunteers. Talked to engineers, data engineers, machine learning people, data science people, that can then help solve those problems. So, if you want to go check it out, it’s early, so be gentle, give us feedback. But, there are other opportunities for volunteering. There’s an organization called Data Client that you can, again, work with to connect you with opportunities where you can help these types of social good organizations. And again, that’s another [00:28:30] way to take, “Here are the values I care about. I care about these issues. Can I help people take action and have an impact?”

So, basically that’s all I had. I think, a lot of us working in this world today with data and AI and [inaudible 00:28:46] this opportunity to actually have an impact in the world. It’s up to us, how we use this opportunity and it depends on what our values are, what we want the world to look like. So, [00:29:00] I know we don’t have time for questions here, but my email address is here. Feel free to email me. Happy to have a conversation, answer any questions. Also, just put in the github [inaudible 00:29:11] link for all the different projects. All of the code there is open source, if you want to take a look. But, thank you very much for having me here. Yeah. Thanks