Renn: Hello this is Aaron Renn, contributing editor at City Journal, and I’m here today with Harvard Business School professor Mike Luca. Professor Luca’s work is with cities and with companies, about helping them become more data driven, and a lot of it is actually the intersection of those two. He works with cities, coming together with these digital platform companies, to help them create value and policy solutions that neither will be able to do on their own.
So, very excited to speak with him today. Mike, Thank you for coming in.
Luca: Thanks for having me.
Renn: One interesting project that you did is you partnered with Yelp and the San Francisco Health Department to do a number of interesting things. So, why don’t you tell us about that?
Luca: So actually the motivation for this project began in LA back in 1997. And what had happened in LA is there were a bunch of inspectors sitting in a room thinking about what to do with the hygiene violations that were coming in. So, turns out that in LA, and San Francisco, and in just about every city restaurants are inspected for cleanliness. So we can think about if there is a single rat that would be a minor violation in a restaurant. If there’s a bunch of rats, there’s a major violation in a restaurant. And if there’s a dirty hallway that would be something I guess counted as in between.
What happens is, if there are many rats or major violations, restaurants typically get closed down and are kept out of business until they have cleaned up the violations. When there is no violations restaurants are sort of left alone and good to go. And the question that the LA inspectors were thinking about is what do you do when there is just a single violation or a minor thing that’s bad enough that customers might want to know but not so bad that you want to shut the restaurant down for doing this. So what LA ended up doing is creating a grade-card system and every restaurant got assigned a letter grade ofA, B, C, or essentially think of ‘don’t go here’ and required every restaurant to post it in their window visibly. And what happened is this became a success story not just of disclosure in health context but more generally of disclosure.
What they found is that restaurants started cleaning up their act. So more restaurants started getting As. Restaurants that were getting As got about 5% more sales then restaurants that were getting Bs. And this wasn’t just reflecting some sort of collusion between the inspectors and the restaurants because there was actually a reduction in the number of reported cases of food borne illness in LA relative to the neighboring counties.
And what happened is, fast forward to 2010, New York started trying to impose the same type of policies. So they required restaurants to post hygiene grades on their doors, and something that struck me as I was looking through this is that a lot of years had elapsed in between the time that LA had implemented their policy and New York had implemented their policy, and a lot had changed in those years. In 1997 when LA started making their posting public and mandatory, there was no Yelp. There was no Twitter. There was no other place to go and get information about restaurants, so posting things on doors seemed like a pretty natural policy.
2010, I noticed that a lot of things had changed about the way people are making decisions about what restaurants to go to, but the policy that was generating this information hadn’t changed a lot.
Renn: We all look at online sites to try and find a place to go for the night.
Luca: That’s exactly right. So I had done some research starting in 2009 that had shown that getting a one-star increase in your Yelp rating leads to about 5% increase in sales for a business. So, we might think about this as a possible home for disclosure. So, what I started thinking about is what would a guidebook look like if we wanted to tell cities how they could update their disclosure and inspection policy for the digital age.
Now what we did is we came up with a two part guide. Essentially trying to create a two way street between city policy makers and private companies that collected data and had access to lots of customers. And what we started thinking about is taking this hygiene context and digitizing everything that we’re doing. And in the first part of this we started thinking about how to take hygiene scores and post them directly onto Yelp’s website. And the logic was that posting on doors seems fine, but so many decisions were being made on Yelp that it seems natural to have this type of data sitting on Yelp’s website rather than on a government website or on the door of a restaurant.
Renn: Right, if you’re just walking past the street maybe looking for a place in the neighborhood that would be useful. Otherwise, you’re not going to see the grade until you’ve already made reservations and show up.
Luca: That’s exactly right, and when we did this we had the first city, which was San Francisco, go live in 2013, and the initial posting looked like a number so 0 to 100 was the scoring system in San Francisco, and every city has their own slightly different scoring system. We posted it and there was some effect, but things got really interesting over the summer where we started thinking about at Yelp what is the optimal disclosure policy? It may be that people are ok with a 97 or a 93, but maybe when it gets sufficiently bad they really want to know.
So, what we did is created a consumer alert so that restaurants that are in the bottom 5% of hygiene scores in all of San Francisco, started instead of going to the Yelp page to go directly to a consumer alert that told them what the hygiene score was. And what we saw was a dramatic drop in the number of business going to the dirtiest restaurant in San Francisco and therefore better sorting to the restaurants that are keeping clean.
Renn: Great. Now you also did some work, the other direction, working with the Yelp data to try to prove, at least prove out the concept of how that could be used to target inspections. How does that work?
Luca: There’s been a notion within the field of open data that what government should do is take their data, put it on a website, and let the market take the data and go do something with it. And it struck me when I was working in this area that that’s not exactly the whole story. Actually, government should also be looking to private companies for data that they have that can improve their operations. So, if we think about what a city does, cities do lots of things where they can benefit from having private data.
Cities want to forecast things. They want to know how business is going to be the next year, what employment is going to be. They want to know where they should go and send inspectors, and all these things are factors that sure, they could say something about through a random inspection or through their own internal forecasts, but they could say a whole lot more about if they were to leverage the types of data that online companies were keeping. So what I started thinking about is can you create an algorithm that tells cities what restaurants they should go and inspect based on what people are saying on Yelp? And sure, Yelp reviewers don’t look a lot like city inspectors. They don’t have coats and gloves and checkboards and all the things that we think about as this type of process, but they are talking about a lot of things that reflect their concerns in a restaurant. And what we saw is that looking at San Francisco data you could predict a large number of the violations just by looking at what people are saying on Yelp, without ever having to step foot into a restaurant.
Renn: So you were looking at the reviews and then determining there’s probably a health problem at this restaurant because we’re hearing people say it’s dirty or I’m getting sick or something of that nature.
Luca: Exactly, the algorithm that we used was actually taking all the text that was on Yelp. So, think about the process as a training process. So we know what the prior violations are. We know which restaurant had three violations, two violations, one violation, and zero violations, and we thought about that as being the left-hand side of the thing we are trying to predict, and then on the right-hand side we took every word that was written about that restaurant on Yelp, and we had no theoretical framework to say here are the words that are going to predict this. And the reason we didn’t want a theoretical framework to do that is because whatever thing we think might predict, restaurants might also think might predict it, and therefore it’s not going to be as effective as a tool of prediction. So, instead what we did is we did a machine learning algorithm that took all the words and pairs of words and phrases and used that to classify the likelihood of having different types of violations.
Renn: Is that like spam filtering in an email?
Luca: So, there’s a lot of the same types of logic that goes into it. In fact, the first algorithm that we’ve done, part of the motivation for thinking about the algorithm we use was looking exactly at an algorithm that had been used to predict which reviews were real and which reviews were fake.
Renn: Ahh, very similar to a spam review. See, you were able to look retrospectively and find out, we can determine which of these violations we could have predicted. Have you applied those algorithms to targeting future inspections in San Francisco or elsewhere?
Luca: So, we’ve done this in a couple of different contexts. In San Francisco we started by thinking about how much we could predict going forward on hygiene violations. And what we saw is that if you classified restaurants into the cleanest half and the less clean half. If you just were to flip a coin to guess where you should go, think about doing a random inspection, you’d have a 50% chance of getting it right. But by using our algorithm you’d have about an 80% chance of classifying which is going to be the cleanest half of restaurants and the less clean half.
Then we wanted to take this a little bit further, because our end game was to think about how to implement this in practice. So, my collaborators and I partnered with the City of Boston and Yelp to think about creating a Boston specific algorithm, and we ran a tournament last summer where we allowed data scientists from around the world to enter their algorithms so we gave them access to the old San Francisco algorithm. We gave them access to all of the reviews that had been written for Boston restaurants, and we had given them access to the history of hygiene violations within the city of Boston. We then asked the City what are the violations that you care about and how much do you care about this type of violation versus that type of violation, and we had them rank everything so that they were essentially creating a scoring system for us. And once we had that we ran a tournament that was funded by Yelp. They gave a $5000 prize split between the top three performers in our tournament, and we used that as the basis for the algorithm that we created for Boston. And what we found when we evaluated the solutions, is with a little tweaking on our end we saw that you could cut the number of inspections by about 40% and capture the same number of hygiene violations. So if we think about it, Department of Health is trying to figure out what are the restaurants that actually have violations. You could be a lot more efficient at guessing it and figuring it out by having a targeted mechanism rather than letting an inspector figure out what is a restaurant they think they should be going to.
Renn: You mentioned earlier that cities like to just post data on the web. Originally there was this thought, we’re going to have all this open data and the marketplace is just going to make use of it. And the first thing people did was come up with these transit tracker apps. Tell me when my bus is coming. People were like wow this is amazing I know when my bus is going to be here, but after that there were very few really interesting apps that seemed to have come out of this, and it’s almost like a lot of the promise seems to have faded. That’s my impression anyway. What you’re doing of trying to collaborate more directly and have bi-directional flow of data. How do you see that capturing the original promise of some of this open data? Is there any lessons we can share about how cities or companies should be thinking about this?
Luca: The transit tracker apps were pretty exciting, and it is true that there’s a lot of data sets out there relative to the number of apps that are created off of them. There are also a lot of apps in absolute terms, but at the local and federal level there are about 400,000 data sets that people have reported to have put on some version of data.gov. That’s a lot of data. We certainly haven’t seen the promise of all this data realized in the sense that there aren’t bunches of apps running around taking all of the datasets. Lots of these apps… or lots of the data just sits there on the government website not being used, and there are a few challenges and a few barriers to data getting used. There are some applications that go on behind the scenes that we just don’t hear about. So, for example platforms like Google and even RentHop, which is a New York based real estate listing website, pull in some of this data to make their businesses better. They don’t kinda make it into the limelight of what we see when we see data.gov. Now, that being said there is a lot more that this data could be used for, and one of the barriers to using this data is that it’s not sufficient just to put a bunch of data sets on the web, and hope that people are going to come and create apps off of it. Sure that could happen sometimes, but in my experience and through this process one of the things I’ve noticed is that by having more focused collaborations around a target issue, a city policymaker or a national policymaker and a private company could collaborate to do more than just letting the data go and sit there.
Renn: Mike Luca, thank you very much. Appreciate you coming in today.
Luca: Thank You.