Warning: longer and more difficult than previous assignments!

**Due Date:**Sunday May 1 at 11:59 PM

**Late Policy:**All assignments and projects are due at 11:59pm on the due date. Each assignment and project may be turned in up to 24 hours late for a 10% penalty and up to 48 hours late for a 30% penalty. No assignments or projects will be accepted more than 48 hours late. Students have four free late days they may use to turn in work late with no penalty: four 24-hour periods, no pro-rating. This late policy is enforced without exception.

**Honor Code:**Under the Honor Code at Stanford, you are expected to submit your own original work for assignments, projects, and exams. On many occasions when working on assignments or projects (but never exams!) it is useful to ask others -- the instructor, the TAs, or other students -- for hints, or to talk generally about aspects of the assignment. Such activity is both acceptable and encouraged, but you must indicate on all submitted work any assistance that you received. Any assistance received that is not given proper citation will be considered a violation of the Honor Code. In any event, you are responsible for understanding, writing up, and being able to explain all work that you submit. The course staff will pursue aggressively all suspected cases of Honor Code violations, and they will be handled through official University channels.

**Datasets:**This assignment includes some familiar datasets from past assignments as well as some new ones. All of the data files are included in the zipped folder you will download.

**Setup Instructions:**All of the files you need for this assignment have been gathered into one zipped folder: assignment3.zip (download link). Create a new directory for the assignment, download the zipped file and place it into the directory, unzip, and you're ready to start.

**Submission Instructions:**You must submit your work via Canvas under Assignment 3. Submissions via email will not be accepted. For the complete assignment you need to submit two separate files:

**Jupyter notebook -**The*assignment3.ipynb*notebook included in the files downloaded for the assignment is set up to fill in and test your answers to all of the Python and SQL problems, i.e., all problems except those in Section 4. Filling in and submitting this notebook is the only format we will accept for these problems.**pdf document -**Section 4 Problems 1 and 2 are to be submitted together in a single one-page pdf file containing two screenshots (specified below). Please name the file*Section4.pdf*.

### Section 1: Data Operations using Python (3 points per problem)

Files used for this section: *assignment3.ipynb*, *Football.csv*, *Schoolkids.csv*

Each of the problems in this section asks a question over the Football or Schoolkids data that you solved in one of the previous assignments using spreadsheets or SQL. Now you are to solve them using Python. Important notes:

- The skeleton Jupyter notebook we've provided already includes the code to read the two datasets into lists of rows in dictionary format. For each problem you just need to write the code to compute and print the answer.
- Your code should use regular Python only -- please don't use PANDAS or embedded SQL queries.
- Remember that all values are read from CSV files as strings. You will frequently need to convert values to integers or floats using function int() or float(). Your most common mistake may be forgetting to do so!

**Problem 1**(

*from Assignment 1 Problem 1*). On the Football data, find the average difference between game predictions and actual outcomes, across all games. The HomeScore and AwayScore are the actual scores, while Prediction is an estimate of how much the home team will win or lose by. In arithmetic terms, Outcome = (HomeScore - AwayScore), and Difference = (Prediction - Outcome). Your Python program should print a single positive or negative number for the average difference.

**Problem 2**(

*from Assignment 2 Problem 6*). On the Football data, find all pairs of teams where the two teams played each other in 1998 and 1999 in the same configuration (the same team was home and the same team was away), and in 1998 the home team won while in 1999 the away team won. Your Python program should print all such pairs of teams.

**Problem 3**(

*from Assignment 2 Problem 8*). On the Schoolkids data, count how many 4th graders, 5th graders, and 6th graders think Looks are more important than Grades in determining popularity, separately for each of the three grades. Your Python program should print the three results, specifying which number goes with which grade.

**Problem 4**(

*from Assignment 2 Problem 11*). On the Schoolkids data, find the oldest students. For each one your Python program should print their gender, grade, age, and school.

*The following two extra credit problems aren't a lot more difficult than the previous four -- they're provided to give you more practice with Python if you have extra time. As a reminder, we will keep track of extra credit points separately; earning extra credit points can serve to raise your final grade, but a lack of extra credit points will never lower your grade.*

**Extra Credit Problem EX1**(

*from Assignment 2 Problem 4*). Find the average prediction for games where the home team won, and the average prediction for games where the away team won. Your Python program should print the two numbers, specifying which is for home wins and which is for away wins.

**Extra Credit Problem EX2**(

*from Assignment 1 Problem 4*). What is the overall ranking of Grades, Sports, Looks, and Money in perceived popularity impact across all of the students? Your Python program should print the four factors in order of overall importance, i.e., the factor with the highest importance should be first. [

*Here is a Python construct you may find useful for this problem:*Suppose you have a list of tuples (or lists) assigned to a variable L, and you want to sort list L by the Nth component of each tuple (or list) in L, assigning the result to variable SL. (Note N is actually an integer >=0 identifying the component you want to sort by.) The following code does the trick:

`SL = sorted(L, key=lambda x: x[N])`

]### Section 2: Data Mining using SQL (3 points per problem)

Files used for this section: *assignment3.ipynb*, *Movies.db*, *optionally Shop.db*

In this section and the next one you will perform data mining over real data about people's movie-watching habits. For SQL, the database file *Movies.db* (included in your downloads) contains one table `Movies(uid,movie)`

. Each row in the table represents the user with ID `uid`

watching the movie named `movie`

. In data mining terms, the items are movies, and a transaction (or "basket") is the set of movies watched by one user. There are about 3400 rows in the table covering about 1400 users and 125 movies.

**Problem 1.**In class we developed SQL queries to compute frequent itemsets of two items and three items over a very small set of shopping data. Copy the two queries from the lecture notes into your Jupyter notebook and modify them to compute frequent itemsets of two and three items over the movie data, using support threshold .03 (not 0.3!). Then write a third query that computes frequent itemsets of four items over the movie data, still using support threshold .03. If your queries are correct, you should find 20 itemsets of two, 14 itemsets of three, and 3 itemsets of four.

**Problem 2.**Write SQL queries to compute association rules over the movie data that have one item on the left-hand side and one item on the right-hand side, with support threshold .03 and confidence threshold 0.5. If your approach is correct, you should find 10 association rules. Some notes:

You may want to work on this problem initially using the small shopping data (database file

*Shop.db*included with the downloads), then modify your solution to use the movie data.Most likely you will need to use a series of queries, storing results from one query in a table and using them in the next query. Here's how to create a table and store results from a query into it:

Create Table tablename(attr-name1,attr-name2,...,attr-nameN); Insert Into tablename SQL-query

(In general, semicolons are used to separate multiple SQL commands in one cell.) The only requirement is that the result of the`SQL-query`

must have the same number of attributes specified in the table creation statement. To delete a table, use one of:Drop Table tablename Drop Table If Exists tablename

The first version raises an error if the table doesn't exist, while the second doesn't.One suggested approach (not required): First compute all items that have sufficient support to be the left-hand side of an association rule. Then separately compute all pairs of items, along with how many transactions each pair appears in. Finally combine the two results to find all pairs where the first item has sufficient support, and the pair satisfies the confidence requirement. Each pair in the final result is an association rule, where the first item is the left-hand side and the second item is the right-hand side.

It's likely you will need to perform division in one or more of your queries to check support or confidence.

*Be careful!*By default when two integers are divided in SQL, the result is rounded to an integer. To get a floating point result when dividing X and Y (say), use (1.0*X) / (1.0*Y).

**Extra Credit Problem.**Write SQL queries to compute association rules over the movie data that have two items on the left-hand side and one item on the right-hand side, with support threshold .02 and confidence threshold 0.8. If your approach is correct, you should find 7 association rules.

### Section 3: Data Mining using Python (3 points per problem)

Files used for this section:*assignment3.ipynb*,

*Movies.csv*,

*optionally Shop.csv*

**Problem 1.**In class we developed Python code to compute and print frequent itemsets of two items and three items over a very small set of shopping data. Copy the code from the lecture notes into your Jupyter notebook and modify it to compute frequent itemsets of two and three items over the movie data in

*Movies.csv*, using support threshold .03 (not 0.3!) Then extend the code to compute frequent itemsets of four items over the movie data, still using support threshold .03. If your code is correct, you should find 20 itemsets of two, 14 itemsets of three, and 3 itemsets of four.

**Problem 2.**Write Python code to compute association rules over the movie data that have one item on the left-hand side and one item on the right-hand side, with support threshold .03 and confidence threshold .5. If your approach is correct, you should find 10 association rules. Some notes:

You may want to work on this problem initially using the small shopping data (database

*Shop.csv*included with the downloads), then modify your solution to use the movie data.You can use the same overall approach suggested for SQL (see Section 2 Problem 2 above), although some details will differ since the data is stored differently in Python from the table structure of SQL. As with frequent itemsets, make sure your program prints the association rules after they are computed -- if you want to get fancy you can even use an arrow!

It's likely you will need to perform division to check support or confidence. Be careful! By default when two integers are divided in Python, the result is rounded to an integer. To get a floating point result when dividing X and Y (say), use float(X) / float(Y).

It's likely you will need to perform division to check support or confidence. Be careful! By default when two integers are divided in Python, the result is rounded to an integer. To get a floating point result when dividing X and Y (say), use float(X) / float(Y). (Note: You can force all Python division in a Jupyter notebook to have floating point results by issuing "from __future__ import division" in the notebook.)

**Extra Credit Problem.**Write Python code to compute association rules over the movie data that have two items on the left-hand side and one item on the right-hand side, with support threshold .02 and confidence threshold 0.8. If your approach is correct, you should find 7 association rules.

### Section 4: Regression and Correlation in Google Sheets (2 points per part)

Files used for this section:*Football.csv*

In Assignment 1 Problem 13 you created a scatterplot from the Football data where the x-axis was Prediction and the y-axis was actual outcome (HomeScore minus AwayScore). Most of you concluded that there was at least some correlation between the two measures.

**Part 1.**To make that conclusion concrete, create the scatterplot again in Google Sheets, use the Trendline (Linear) feature to plot the linear regression line, and select the R

^{2}checkbox to see what Google Sheets computes as the coefficient of determination.

**Part 2.**Correlation between two measures shouldn't be affected by which measure happens to be on the x-axis and which on the y-axis. Repeat Problem 1 with the axes reversed. Did you get the same R

^{2}value?

Take a screenshot of each of your two plots including the regression line and the r

^{2}number. Paste the two screenshots into a single document that can be saved as pdf, then see

*Submission Instructions*above.

### Section 5: Plotting and Regression in Python (6 points)

Files used for this section:*assignment3.ipynb*,

*datapoints.csv*,

*lines.csv*

For this problem you are given a set of (x,y) points in coordinate space, and three lines; we provide Python code that reads the points and lines from files. Your job is to write Python code that computes the sum of squares error (SSE) for each line with respect to the points, determines colors for the lines based on how closely they fit the points -- green for the closest fit (smallest SSE), orange for the middle, and red for the worst fit (largest SSE) -- then draws a plot showing the points and the colored lines. Some details:

The code we provide reads two files (included in your downloads):

*datapoints.csv*contains a set of points, one per line of the file in x,y format;*lines.csv*contains three lines, one per line of the file in a,b format. Each a,b pair in file*lines.csv*represents the line defined by the equation y=ax+b. You will need to read the code to understand how points and lines are represented in the program's data structures.The SSE for each of the lines with respect to the set of points is the sum of the squared vertical distances from the points to the line. For a given point (X,Y) and line y=Ax+B, the squared vertical distance is (A*X + B - Y)

^{2}.All points are in the positive coordinate space, i.e., x > 0 and y > 0.

You may assume that the three lines all have different SSEs, i.e., your code does not need to account for ties.