Assignment 3
Warning: longer and more difficult than previous assignments!
Due Date: Sunday May 1 at 11:59 PM

Late Policy: All assignments and projects are due at 11:59pm on the due date. Each assignment and project may be turned in up to 24 hours late for a 10% penalty and up to 48 hours late for a 30% penalty. No assignments or projects will be accepted more than 48 hours late. Students have four free late days they may use to turn in work late with no penalty: four 24-hour periods, no pro-rating. This late policy is enforced without exception.

Honor Code: Under the Honor Code at Stanford, you are expected to submit your own original work for assignments, projects, and exams. On many occasions when working on assignments or projects (but never exams!) it is useful to ask others -- the instructor, the TAs, or other students -- for hints, or to talk generally about aspects of the assignment. Such activity is both acceptable and encouraged, but you must indicate on all submitted work any assistance that you received. Any assistance received that is not given proper citation will be considered a violation of the Honor Code. In any event, you are responsible for understanding, writing up, and being able to explain all work that you submit. The course staff will pursue aggressively all suspected cases of Honor Code violations, and they will be handled through official University channels.

Datasets: This assignment includes some familiar datasets from past assignments as well as some new ones. All of the data files are included in the zipped folder you will download.

Setup Instructions: All of the files you need for this assignment have been gathered into one zipped folder: assignment3.zip (download link). Create a new directory for the assignment, download the zipped file and place it into the directory, unzip, and you're ready to start.

Submission Instructions: You must submit your work via Canvas under Assignment 3. Submissions via email will not be accepted. For the complete assignment you need to submit two separate files:
  1. Jupyter notebook - The assignment3.ipynb notebook included in the files downloaded for the assignment is set up to fill in and test your answers to all of the Python and SQL problems, i.e., all problems except those in Section 4. Filling in and submitting this notebook is the only format we will accept for these problems.
  2. pdf document - Section 4 Problems 1 and 2 are to be submitted together in a single one-page pdf file containing two screenshots (specified below). Please name the file Section4.pdf.

Section 1: Data Operations using Python (3 points per problem)

Files used for this section: assignment3.ipynb, Football.csv, Schoolkids.csv

Each of the problems in this section asks a question over the Football or Schoolkids data that you solved in one of the previous assignments using spreadsheets or SQL. Now you are to solve them using Python. Important notes:

Problem 1 (from Assignment 1 Problem 1). On the Football data, find the average difference between game predictions and actual outcomes, across all games. The HomeScore and AwayScore are the actual scores, while Prediction is an estimate of how much the home team will win or lose by. In arithmetic terms, Outcome = (HomeScore - AwayScore), and Difference = (Prediction - Outcome). Your Python program should print a single positive or negative number for the average difference.

Problem 2 (from Assignment 2 Problem 6). On the Football data, find all pairs of teams where the two teams played each other in 1998 and 1999 in the same configuration (the same team was home and the same team was away), and in 1998 the home team won while in 1999 the away team won. Your Python program should print all such pairs of teams.

Problem 3 (from Assignment 2 Problem 8). On the Schoolkids data, count how many 4th graders, 5th graders, and 6th graders think Looks are more important than Grades in determining popularity, separately for each of the three grades. Your Python program should print the three results, specifying which number goes with which grade.

Problem 4 (from Assignment 2 Problem 11). On the Schoolkids data, find the oldest students. For each one your Python program should print their gender, grade, age, and school.

The following two extra credit problems aren't a lot more difficult than the previous four -- they're provided to give you more practice with Python if you have extra time. As a reminder, we will keep track of extra credit points separately; earning extra credit points can serve to raise your final grade, but a lack of extra credit points will never lower your grade.

Extra Credit Problem EX1 (from Assignment 2 Problem 4). Find the average prediction for games where the home team won, and the average prediction for games where the away team won. Your Python program should print the two numbers, specifying which is for home wins and which is for away wins.

Extra Credit Problem EX2 (from Assignment 1 Problem 4). What is the overall ranking of Grades, Sports, Looks, and Money in perceived popularity impact across all of the students? Your Python program should print the four factors in order of overall importance, i.e., the factor with the highest importance should be first. [Here is a Python construct you may find useful for this problem: Suppose you have a list of tuples (or lists) assigned to a variable L, and you want to sort list L by the Nth component of each tuple (or list) in L, assigning the result to variable SL. (Note N is actually an integer >=0 identifying the component you want to sort by.) The following code does the trick: SL = sorted(L, key=lambda x: x[N])]

Section 2: Data Mining using SQL (3 points per problem)

Files used for this section: assignment3.ipynb, Movies.db, optionally Shop.db

In this section and the next one you will perform data mining over real data about people's movie-watching habits. For SQL, the database file Movies.db (included in your downloads) contains one table Movies(uid,movie). Each row in the table represents the user with ID uid watching the movie named movie. In data mining terms, the items are movies, and a transaction (or "basket") is the set of movies watched by one user. There are about 3400 rows in the table covering about 1400 users and 125 movies.

Problem 1. In class we developed SQL queries to compute frequent itemsets of two items and three items over a very small set of shopping data. Copy the two queries from the lecture notes into your Jupyter notebook and modify them to compute frequent itemsets of two and three items over the movie data, using support threshold .03 (not 0.3!). Then write a third query that computes frequent itemsets of four items over the movie data, still using support threshold .03. If your queries are correct, you should find 20 itemsets of two, 14 itemsets of three, and 3 itemsets of four.

Problem 2. Write SQL queries to compute association rules over the movie data that have one item on the left-hand side and one item on the right-hand side, with support threshold .03 and confidence threshold 0.5. If your approach is correct, you should find 10 association rules. Some notes:

Extra Credit Problem. Write SQL queries to compute association rules over the movie data that have two items on the left-hand side and one item on the right-hand side, with support threshold .02 and confidence threshold 0.8. If your approach is correct, you should find 7 association rules.

Section 3: Data Mining using Python (3 points per problem)

Files used for this section: assignment3.ipynb, Movies.csv, optionally Shop.csv

Problem 1. In class we developed Python code to compute and print frequent itemsets of two items and three items over a very small set of shopping data. Copy the code from the lecture notes into your Jupyter notebook and modify it to compute frequent itemsets of two and three items over the movie data in Movies.csv, using support threshold .03 (not 0.3!) Then extend the code to compute frequent itemsets of four items over the movie data, still using support threshold .03. If your code is correct, you should find 20 itemsets of two, 14 itemsets of three, and 3 itemsets of four.

Problem 2. Write Python code to compute association rules over the movie data that have one item on the left-hand side and one item on the right-hand side, with support threshold .03 and confidence threshold .5. If your approach is correct, you should find 10 association rules. Some notes:

Extra Credit Problem. Write Python code to compute association rules over the movie data that have two items on the left-hand side and one item on the right-hand side, with support threshold .02 and confidence threshold 0.8. If your approach is correct, you should find 7 association rules.

Section 4: Regression and Correlation in Google Sheets (2 points per part)

Files used for this section: Football.csv

In Assignment 1 Problem 13 you created a scatterplot from the Football data where the x-axis was Prediction and the y-axis was actual outcome (HomeScore minus AwayScore). Most of you concluded that there was at least some correlation between the two measures.

Part 1. To make that conclusion concrete, create the scatterplot again in Google Sheets, use the Trendline (Linear) feature to plot the linear regression line, and select the R2 checkbox to see what Google Sheets computes as the coefficient of determination.

Part 2. Correlation between two measures shouldn't be affected by which measure happens to be on the x-axis and which on the y-axis. Repeat Problem 1 with the axes reversed. Did you get the same R2 value?

Take a screenshot of each of your two plots including the regression line and the r2 number. Paste the two screenshots into a single document that can be saved as pdf, then see Submission Instructions above.

Section 5: Plotting and Regression in Python (6 points)

Files used for this section: assignment3.ipynb, datapoints.csv, lines.csv

For this problem you are given a set of (x,y) points in coordinate space, and three lines; we provide Python code that reads the points and lines from files. Your job is to write Python code that computes the sum of squares error (SSE) for each line with respect to the points, determines colors for the lines based on how closely they fit the points -- green for the closest fit (smallest SSE), orange for the middle, and red for the worst fit (largest SSE) -- then draws a plot showing the points and the colored lines. Some details:

In the Jupyter notebook for this problem you will see clearly where you need to add code for the different steps of the problem. Have fun!