Jabril: Yeah, that was... John Green Bot: … the BEST movie ever! Jabril: That’s not what I was gonna say. How about for the next movie night we pick a new movie that we’ll both probably like? John-Green-Bot: Maybe something romantic? How about Pride & Prejudice? Jabril: Oh John Green Bot... I'm going to need this. Okay, I think it’s time to make a movie recommender system AI. It’s the only hope for the future of John Green Bot and my friendship… or at least our movie nights! INTRO Hey, I’m Jabril, and welcome to Crash Course AI. Last time, we introduced the idea of recommender systems, which are AIs that use information about something, and its social ratings to recommend new things to people. These things can be ads, products, YouTube videos, or pretty much anything like that. Today, I’m going to build a recommender system for movies to hopefully find a new movie that both me and John-Green-bot want to watch for our next movie night. Like in previous labs, I’ll be writing all of the code using a language called Python in a tool called Google Colaboratory. And as you watch this video, you can follow along with the code in your browser from the link we put in the description. In these Colaboratory files, there’s some regular text explaining what we’re trying to do, and pieces of code that you can run by pushing the play button. These pieces of code build on each other, so keep in mind that you have to run them in order from top to bottom, otherwise you might get an error. To actually run the code or make changes to it, you’ll have to either click “open in playground” at the top of the page or open the File menu and click “Save a Copy to Drive”. And just an fyi: you’ll need a Google account for this. So, if I’m going to build a movie-recommending AI, the first thing I know is that AI systems need data. I’ll need to find and import a dataset of movies, and ideally it’ll already have ratings given by lots of different people to lots of different movies, so I won’t have to go through and rank every single movie by myself. That would take a while. Second, I’ll need to do some basic analysis. Let’s start by finding some generic recommendations, like the top-rated movies in both John-Green-bot’s and my favorite genres. Maybe we’ll get lucky and find a movie we both want to watch and haven’t seen yet on those lists. But… I don’t really have hope for that because we like such different movies. So, third, John-Green-bot and I will need to personalize this dataset by providing some of our own movie ratings. Fourth, I’ll use a technique known as user-user collaborative filtering to generate a set of recommendations for both me and John-Green-bot. Hopefully there will be SOME overlap on those recommendation lists. Alright, let’s get started. The first step is getting data. And just like other labs, I’m not going to start from scratch. This time, I’m using an existing dataset published by MovieLens, which has about 100,000 user ratings for about 10,000 different movies. MovieLens has bigger datasets available, going up to tens of millions of ratings, but this smaller set should be enough to plan movie nights for John-Green-bot and me. I’m also going to use a library known as LensKit, which comes built-in with some nice tools for building recommender systems. So now, I’ve got data, but let’s make sure I understand what data are even there. This code lets me see the first 10 rows of the ratings dataset. There's one important thing that I notice about this dataset right away: how it handles missing data. Like, for example, here I can see that user #1 gave a rating of 4.0 to item #1, and that they provided a rating of 4.0 to item #3. But I don’t see a rating for item #2 at all. Most people don’t watch most movies, so that makes sense that there would be missing data. And storing a bunch of zeros would take a lot of space, so it’s good to know that MovieLens decided to avoid zeros in this dataset by not storing unranked items at all. But the way it stores movie data isn’t super useful for this current problem, because I want to know what these movies are! Not just ID numbers like “item #2.” John-Green-bot and I can’t exactly search for “item #2” when we’re trying to rent a movie. Thankfully, the MovieLens dataset has more than just "ratings." It also contains a table called "movies" that has a bunch of information about each of these items, like titles and genres. So we can get a better sense of the data by joining the “ratings” and “movies” files. From now on, let’s include the genre and title whenever I print results, because that’s much more clear. So I’m done with Step 1! Step 2 is getting some generic recommendations from the MovieLens dataset, just to see what happens. Let’s just average the ratings for each movie and print out a sorted list, with the best-rated movies at the top. Uh... Paper Birds? Bill Hicks: Revelations? I have no idea what these movies are… but they’re supposed to be good? They all have a perfect 5.0 average rating. I would expect to see movies like Harry Potter or Titanic or I dunno… The Avengers? So let’s look at the data and see why these are perfect. Let’s add a count column to see how many people rated these movies so highly… Okay, so these movies have a perfect 5.0 average rating because only one person actually rated each of these! That doesn’t really help me pick what to watch, because if I just wanted ONE person’s opinion, I’d ask a friend who knows me! We’re using the MovieLens dataset to get a more general idea of good movies. So let’s try only sorting movies with at least a certain number of ratings. This is kind of arbitrary, but I guess I'd want at least 20 people to weigh in before I trust an average rating. Okay, now I've heard of most of these movies and I trust that they’re actually sort of popular recommendations. But these movies are all sorts of genres, so maybe I can narrow the list down a little more based on what John-Green-bot and I usually watch. I like action movies and John-Green-Bot likes romance movie. There's actually one movie that’s on both of our recommended lists: The Princess Bride! Jabril: John-Green-Bot I’ve got the perfect movie. You’re gonna love it. It’s got a love story, swords fights, the greatest movie line of all time: “Hello, my name is Inigo Montoya, you killed my father, prepare to die... John-Green-Bot: Seen it. Let’s watch something new. *sighs*… our lists don’t have any other movies in common. So even though finding generic recommendations is sort of helpful, our AI system hasn’t found us a new movie to watch together. What we’re facing is the cold-start problem we talked about in the last video. The recommender system we’re programming doesn’t know anything about John-Green-bot and me to make personalized recommendations. So for Step 3, it’s time to get personal! To personalize our recommender system AI, we need to give it our own movie data. Okay, we’ve got two spreadsheets now, but I don’t think that they’re in the right format for LensKit, so I need to check the documentation which is linked it in the description. It looks like I need to import our spreadsheets and store the data in item-rating pairs just like the original dataset. Thankfully, Python is great for changing data formats. As a sanity check to make sure I coded everything correctly, let’s print both of our ratings for The Princess Bride, since we know we’ve both seen it. This all looks reasonable, so we’re done with Step 3! Remember, our goal is to program an AI to give us personalized movie recommendations based on our ratings. So, to make this happen, I’ll implement User-User Collaborative Filtering in Step 4. There are techniques like Item-Item Collaborative Filtering, latent factor analysis, and others too, but the User-User approach is pretty common and a nice first step to understanding recommender systems. In multiple episodes of Crash Course AI, we’ve talked about visualizing AI features on a graph, whether it’s petal lengths on a flower or weather and swimmers. As we add more features, we add more dimensions to that graph. In user-user collaborative filtering, each item is its own dimension. So if we have 10,000 movies in our dataset, that’s 10,000 dimensions. We’re not even going to try to visualize that, but we can understand the logic behind user-user collaborative filtering with a two-movie example. To be totally honest, this is going to be a pretty simplified explanation of what the user-user algorithm does. Dealing with thousands of dimensions and lots of missing data requires a lot of clever linear algebra and statistics. But I can use the LensKit library to do this math and understand what’s happening conceptually, without diving under the hood. So, okay, let’s say we have a graph where one axis is the movie Inception and the other axis is The Notebook. And for this example, we’ll plot social ratings on it from everyone who has seen and ranked both movies, such as John-Green-bot, me, and a bunch of other people in the MovieLens dataset. Some people may really like or hate both movies. I like Inception but dislike The Notebook, and John-Green-bot is the opposite of me. The user-user algorithm will try to cluster people who gave the movies similar ratings. This is a classic unsupervised learning approach, except there isn’t a “correct” size for these clusters, so I have to set parameters. First, I have to set a minimum neighborhood size, or the minimum number of people the algorithm should put in one cluster. Like, for example, if I set the minimum neighborhood size to 5, when the algorithm looks for people similar to John-Green-bot, it may select this neat cluster here. But if I set the minimum neighborhood size higher, the algorithm may be forced to include some people who are less similar to each other and John-Green-bot. I also have to set a maximum neighborhood size, or the maximum number of people the algorithm should put in one cluster. Again, having clusters that are too big might give recommendations that are too generic and don’t consider individual taste enough. After the algorithm has defined the cluster of people who like these movies just about as much as John-Green-Bot, it can analyze what those users have rated movies that John-Green-bot hasn’t seen yet, such as Casablanca. Now, this is a classic supervised learning problem. The user-user algorithm trains on past data from users in the cluster to guess how much John-Green-bot would rate Casablanca. It might predict something like “4.6.” And then the algorithm will do the same thing for all the other movies John-Green-bot hasn’t seen, that his cluster-neighbors have. In the end, I want the algorithm to give us a sorted list of the top 10 movies John-Green-bot will probably like. There isn’t really a “best” minimum and maximum neighborhood size. It really depends on what I want this AI to recommend. Different parameters have different pros and cons. A small neighborhood size would mean the AI considers fewer people who have more similar movie tastes, and it has less data to make predictions. So I’m more likely to run into the “Bill Hicks: Revelations” situation from earlier which was when recommendations of surprising or obscure movies were based on what a few people like. A big neighborhood size would mean the AI considers more people who have less similar movie tastes, and it has more data to make predictions. So I’m more likely to get movie recommendations that are generally popular and more widely known. Figuring out the best approach to clustering requires a lot of tinkering. But if someone did work on it, they could make a video streaming service that could recommend videos to billions of different people online. YouTube. It's a joke on YouTube if you didn't get it. For this movie night AI, I’ll just set a minimum neighborhood size of 3 and maximum size of 15, because those seem reasonable. But feel free to play around with those values in your own code to see how it changes the recommendations. Now that the AI system has run the user-user collaborative filtering algorithm and has clusters, I can give it our personal ratings to get its top 10 recommended movies for both John-Green-bot and me! Now we’re talking… show me what to watch! Remember, for each of us, the user-user algorithm finds a neighborhood of similar users based on their movie ratings compared to ours. The algorithm looks for movies that people in that neighborhood have seen, and rated, that we HAVEN’T seen yet. And based on the ratings in our neighborhoods, the algorithm will predict how we might rate each of those movies, and print a list of its “top 10” recommendations for us. So now we have thoughtful movie recommendations by our newly programmed AI, but there’s still a huge problem. John-Green-bot and I have to AGREE on a movie to watch, and our “top 10” lists don’t overlap at all because we like such different things. We need another STEP! This is the beauty of representing movies we like as lists of numbers! I can create a Jabril-Green-bot hybrid! Uh, but not a cyborg. Just a dataset. So if both of us have rated a movie, I’ll use the average of our ratings. Using the two-axis graph of Inception and The Notebook from before, this would place our Jabril-Green-bot hybrid around here. And if only one of us has rated a movie, I’ll just add that movie rating to the list. I know this isn't a perfect strategy. Like, it's possible that I might hate some movie that I haven’t seen but John-Green-Bot highly rated. But this keeps things simple, and it should give a reasonable estimate across both of our ratings. Like always when I reorganize data with code, I should do a quick sanity check. Let’s look at The Princess Bride again because I rated it as a 4.5 and John-Green-bot rated it as a 3.5, so I'd expect our combined list would have it as a 4. Looks like everything checks out! So now, I have a combined dataset of ratings that I can plug right into our user-user collaborative filtering model from earlier. And I SHOULD get a ranked list of 10 movies that we’ll both like! The number one recommendation is Submarine which seems to be a quirky movie from 2010. I've never heard of it, but I'm willing to give it a try. If that’s too obscure for John-Green-bot, we could pick a different recommendation from this list... like I’ve heard some good things about True Grit. In fact, all these movies seem like they might have some stuff we both like. At this point, I could also go back to step 4.1 and select different settings for my clusters. Bigger neighborhoods would probably give me a more well-known list of movies. But that list may also be a little less tailored to our individual interests. Anyway, we know what we'll be watching this weekend. Anyone can use our spreadsheets as a template to enter their own preferences and see some recommendations for themselves and their friends. Of course, these spreadsheets don’t have EVERY MOVIE EVER -- that’s just one of the limits of our smaller dataset. By using one of the bigger datasets from MovieLens, anyone can create a new set of spreadsheets for this project that does include more movies. But be warned that more movies will mean that all the math will take a LOT longer to do before you get your recommendations! There’s also nothing that limits our algorithm to just two people! You could combine a ten-person movie club into one rating dataset to see what results it comes up with. Next time, we’ll take a look at a different kind of recommendation that we use all the time: search engines.