This project may be done independently or in teams of two. We hope that teams will implement more sophisticated techniques and/or do more experimentation and testing, but we would not be surprised to see an ambitious singleton win the contest - the sky's the limit with this project.
Late Policy: All assignments and projects are due at 11:59pm on the due date. Each assignment and project may be turned in up to 24 hours late for a 10% penalty and up to 48 hours late for a 30% penalty. No assignments or projects will be accepted more than 48 hours late. Students have four free late days they may use over the quarter to turn in work late with no penalty: four 24-hour periods, no pro-rating. This late policy is enforced without exception. If you are working as a pair, each late day is taken from both students.
Honor Code: Under the Honor Code at Stanford, you are expected to
submit your own original work for assignments, projects, and exams. On many
occasions when working on assignments or projects (but never exams!) it is
useful to ask others -- the instructor, the TAs, or other students -- for
hints, or to talk generally about aspects of the assignment. Such activity
is both acceptable and encouraged, but you must indicate on all submitted
work any assistance that you received. Any assistance received that is not
given proper citation will be considered a violation of the Honor Code. In
any event, you are responsible for understanding, writing up, and being able
to explain all work that you submit. The course staff will pursue
aggressively all suspected cases of Honor Code violations, and they will be
handled through official University channels.
The goal of the project is to use past movie ratings to predict how users will rate movies they haven't watched yet. This type of prediction algorithm forms the underpinning of recommendation engines, such as the one used by Netflix. We're giving you a large set of ratings from real movie-rating data, but holding back 200 ratings for you to predict. (In machine-learning parlance, the data we provide is "labeled training data"; you will use it to come up with rating predictions for the 200 "unlabeled" data items.) You're welcome to use any techniques and tools to come up with your rating predictions. The students producing the most accurate predictions will be awarded coveted Stanford Engineering swag!
To get started, use the links on the filenames to download the following five files. Files movies and allData are in TSV format (tab-separated values), since movie names may have commas in them. All other files are in CSV format.
Note that reading in TSV files is slightly different from reading in CSV files. We've created a notebook in the Project2 folder on Instabase demonstrating a few different ways to load tsv files into a Jupyter Notebook. Please refer to LoadingData.ipynb to see how you can load TSV files using different tools (sql, Pandas, and raw Python).
- users.csv - Information about 2353 movie watchers. Each line has three fields: userID, age (see notes below), gender ("F" or "M")
- movies.tsv - Information about 1465 movies. Each line has six fields: movieID, name, year, genre1, genre2, genre3. If a movie has fewer than three genres the extra fields are blank.
- ratings.csv - 31,620 movie ratings. Each line has three fields: userID, movieID, rating. The userID and movieID correspond to those in the users.csv and movies.tsv files, respectively. Ratings are integers in the range 1 to 5 (from worst to best).
- allData.tsv - For those who prefer having everything in one place, this file contains the combined information from the previous three files. Each line has 10 fields: userID, age, gender, movieID, name, year, genre1, genre2, genre3, rating
- predict.csv - Ratings for you to predict. Each line has three fields: userID, movieID, rating, with all ratings set initially to 0. There are no ratings for these userID-movieID pairs in ratings.csv.
This data is real: it's a subset of the movie ratings data from MovieLens, collected by GroupLens Research, which we anonymized for the project. We suggest you begin by browsing the data to see what's in the various fields. Note that the age field in the users.csv file doesn't contain exact ages but instead one of seven bucketized values as follows: 1 (age is under 18), 18 (age is 18-24), 25 (age is 25-34), 35 (age is 35-44), 45 (age is 45-49), 50 (age is 50-55), 56 (age is 56 or older).
Step 1: Create a Method to Generate Predictions
You are to generate two different types of rating predictions, fractional and integer, which will be evaluated with two different metrics. All students/teams should submit between one and three solutions for each of the two versions.
NOTE: Before you start developing your prediction process, make sure you read the section on creating a test set!
Task A: Fractional Ratings
In this task, you must predict a rating for each movie between 1 and 5, where each rating may be a real number up to any number of decimal places. For example, Netflix used to predict user ratings up to one decimal place, i.e., for a given user and movie it might predict "3.8 stars". The evaluation metric for Task 1 is average absolute distance from the actual rating. Note that actual ratings are always integers from 1 to 5, but each distance (and of course the average distance) may be a real number of any number of decimal places.
Task B: Integer Ratings
In this task, you must predict a rating for each movie between 1 and 5, where each rating is an integer. The evaluation metric for Task 2 is number of correct ratings, which will be an integer between 0 and 200. It is perfectly acceptable to reuse your solution for Task 1 and simply round the prediction for Task 2, but specialized solutions for the integer case are welcome as well.
Tools, Techniques, and Methodology
You can use any method you like to make your predictions. At one end of the spectrum you could implement sophisticated modeling of users and movies based on their features and apply machine-learning techniques to make predictions. At the other end, your predictions could be based on simple data analysis or calculations. Most students will probably do something in between. You don't have to use machine learning to get full credit, and many good solutions exist that don't use machine learning at all. For example, you can just compute overall average rating and predict that for every movie! There exist many creative ways to make good predictions without necessarily using ML.
You are free to use any tools and/or languages that you like. Be aware that the course staff may not be able to provide a great deal of support if you choose to use tools or techniques we're not well-versed in.
If you've tried a few different methods and are torn about which one might be best, don't despair: we're allowing (but not expecting) each student or team to submit up to three solutions for each of the two evaluation metrics below. In addition, we mention this in Grading section below, but we really want to let you know that you ARE ALLOWED to submit very simple solutions, as long as you also explain more complicated ones you tried that didn't do as well.
Step 2: Evaluate Your Prediction Process
Once you've designed and implemented a prediction method, how do you know how good it is? Do you just turn it in for us to evaluate, and hope for the best? We encourage you to take a more principled approach, specifically to extract one or more "test sets" from the ratings we're providing.
What's a Test Set?
A test set is a subset of the data that you put aside until you're done working on your algorithm to evaluate how well it performs. Before you develop your algorithm, you can split your labeled data (i.e., movies that have ratings) into two subsets: a main development set and a test set. You put aside your test set, and you use your main set to develop your algorithm, whether using machine learning or some other approach where you predict the labels. Again, you don't have to use machine learning to make your predictions! If you're using machine learning, the main set would be considered your "training set".
After you've developed one or more algorithms, you can run your prediction process on the test set to see which of your algorithms performs best overall. The reason to split this data out ahead of time is to create an environment where you can predict values on unseen data that you didn't use in creating your algorithm, since the final accuracy of your algorithm will be based on an unseen set of data points.
Creating a Test Set
A common way to create a test set is to extract a random sample of rows from your dataset. For example, if you want to set aside 10% of your data as a test set, you could assign random numbers from 0 to 1 to every row in the dataset and take the top 10% of values. You can store this test set separately, and then you use the remaining 90% of rows to think about how you want to make predictions for the unlabeled movies (predict.csv).
If you're using Python to do your predictions, take a look at the random.sample(population, k) method in the random module.
Other Tips for Creating Test Sets
- Please note that you can generate several different combinations of train/test sets. We will discuss this strategy in class as well. Refer to this Wikipedia page on k-fold cross validation if you want to know more.
- You can sort the data by an attribute that has no correlations with anything meaningful and take chunks out. You can refer to the code we provide in Assignment4 Jupyer notebook that uses Python for machine learning on the World Cup data. We only use several features of World Cup data to make our prediction.
There are actually many options that help you generate test sets instead of creating only one test set. You can use pandas "sample" and "drop" functions.
# Splitting the data into 8 : 2 train to test split
If you are pretty confident at Python and want to explore more advanced options, Python scikit library offers many different options. Here is an article you can refer to: cross validation in scikit.
Testing Your Results
Now that you have a test set, you can use the rest of the data to determine how you want to make predictions for each row. Once you have finished developing your algorithm, you can figure out how well it's working by using the prediction process you have developed to predict labels for all of the items in your test set. Then you can measure how close your predictions were to the true values of the ratings using the metrics for Task A and Task B. For example, on Task A, you should calculate the absolute distance to the true value for each of your predictions on the test set and report the total error. On Task B, you should calculate how many of your predictions were correct on the test set and report the percentage of correct rows.
Step 3: Generate and Evaluate Final Predictions
When you have developed your algorithms for making predictions and chosen one that gives you good performance on your training and test sets, you should generate predictions for the unlabeled movies in predict.csv. You will submit your predictions along with your writeup (see the "Submission Instructions" section).
Here is a jupyter notebook tutorial showing you how to generate a .csv file from your predictions.
Also, we have created a leaderboard to allow you to see how accurate your predictions are and compare against others in the class. The leaderboard ranking will NOT BE FACTORED into the Project 2 grade -- it is purely for fun!
The leaderboard submission is through the Gradescope assignment titled "Project 2 Leaderboard". You can find the leaderboard itself by pressing on the "Leadboard" link in that Gradescope page. If you have not submitted yet, you can access the leaderboard directly with this link.
Leadboard submission instructions:
predict.csvwith the "rating" column filled with your fractional predictions from TaskA. Please keep the header, which should contain only three columns "userID", "movieID", and "rating".
V2predict.csvbut with the "rating" column containing your integer predictions from TaskB.
V2predict.csvto the Gradescope assignment "Project 2 Leaderboard".
- Visit the leaderboard and view how your submission performed!
Submitting to the leaderboard is optional; however, the top scorers will be eligible for prizes!
Step 4: Writing Up your findings
Create a writeup describing your overall approach and the techniques and tools that you used.
A good target length for the writeup is 2 pages.
Please use the following outline:
- Header: At the top of your report, clearly state the name(s) and SUID(s) for yourself or you and your partner (if working as a group).
- Summary: In one short paragraph, describe at a high level how you approached the project, the techniques and tool you used, and your results.
- Data preparation: How did you process the dataset? What did you use to help process it?
- Techniques and tools: What technique / algorithm / strategy did you used to make predictions? Which tools and libraries did you use? Plain python, pandas, sheets, etc.?
- Evaluation: How did you create and use your test set? For each of the techniques you tried, what were your error and accuracy measurements on the test set? If a technique works well, or has poor results, why do you think so?
- Description of Files Used: At the end of the writeup, create a list of any spreadsheets, code, Jupyter notebooks, or other artifacts you used to generate your predictions, with a clear description of what each one contains (similar to Project 1).
You will be evaluated on the process you take to develop your algorithms and how much effort you put into creating and improving them. We will specifically be looking for a writeup that covers all the major points listed above in the suggested outline for step 4. The sections that most affect your score are "Techniques and tools" and "Evaluation". Note that you are allowed to submit simple solutions, especially if you find that they work well. And as a reminder, machine learning techniques are not necessary -- Professor Widom's submissions do not use machine learning and have generally placed in the top 5 in preivous years.
Submission InstructionsStep 0. Carefully read the following instructions!
Step 1. Prepare the following files, named exactly as specified:
- project2.pdf - Main writeup containing description of your approach, as described above.
- V1predict.csv - A copy of predict.csv except values in the rating field should be real numbers in the range 1 to 5. Optionally if you want to submit up to two additional solutions for Version 1, please name the files V1predict2.csv and V1predict3.csv.
- V2predict.csv - A copy of predict.csv except values in the rating field should be integers between 1 and 5. Optionally if you want to submit up to two additional solutions for Version 2, please name the files V2predict2.csv and V2predict3.csv.
- All files or other artifacts you used to generate your solution, each of which should be listed in the Description of Files Used section of your project2.pdf writeup.
Step 2. Submit V1predict.csv, V2predict.csv, and all other files listed above to Gradescope entry "Project 2 All Files". Note that we will do a quick sanity check of your V1predict.csv and V2predict.csv files! Make sure the autograder confirms your prediction csv files are in the right format. For projects being done in pairs, only one partner needs to submit to Gradescope, and should add their partner's name to the submission under 'GROUP'. See the group submission video for details.
Step 3. Upload the writeup PDF to Gradescope under "Project 2 Report (PDF)". Repeat the same instructions above for groups of 2!