Assignment 2
Due Date: Thursday April 21 at 11:59 PM

Late Policy: All assignments and projects are due at 11:59pm on the due date. Each assignment and project may be turned in up to 24 hours late for a 10% penalty and up to 48 hours late for a 30% penalty. No assignments or projects will be accepted more than 48 hours late. Students have four free late days they may use to turn in work late with no penalty: four 24-hour periods, no pro-rating. This late policy is enforced without exception.

Honor Code: Under the Honor Code at Stanford, you are expected to submit your own original work for assignments, projects, and exams. On many occasions when working on assignments or projects (but never exams!) it is useful to ask others -- the instructor, the TAs, or other students -- for hints, or to talk generally about aspects of the assignment. Such activity is both acceptable and encouraged, but you must indicate on all submitted work any assistance that you received. Any assistance received that is not given proper citation will be considered a violation of the Honor Code. In any event, you are responsible for understanding, writing up, and being able to explain all work that you submit. The course staff will pursue aggressively all suspected cases of Honor Code violations, and they will be handled through official University channels.

Datasets: This assignment uses the same "Football" and "Schoolkids" datasets that were used for Assignment 1. Please refer to the Assignment 1 specification and the Datasets page for more details.

Setup Instructions:
  1. You will need to complete the Prerequisite to Assignment 2 before you begin this assignment. The prerequisite ensures that Jupyter notebooks are installed and running properly on your computer. We've set a deadline of Thursday April 14 for all students to complete the prerequisite, to ensure there are no system problems down the road. See the Prerequisite web page for links to Jupyter setup instructions and other details.
  2. Download the following two files:
  3. Place the two downloaded files in the same directory. Now you're ready to launch the notebook for the assignment, open the database from the notebook, and get to work!
Submission Instructions: You must submit your work via Canvas under Assignment 2. Submissions via email will not be accepted. For the complete assignment you need to submit two separate files:
  1. Jupyter notebook - We set up the assignment2.ipynb notebook so that you can use it to fill in your answers to all of the SQL problems, in a form that's easy for us to read and check. Of course the notebook also allows you to test your SQL queries. A filled-in notebook is the only format we will accept for Problems 1-11 and optional problems EX1 and EX2.
  2. pdf document - Problems 12-13 and optional Problems EX3 and EX4 are to be submitted together in a single pdf file, including text and possibly some graphics. As in Assignment 1, it's critical that the answer to each problem is on a separate page. Answers spanning multiple pages are allowed, as long as the answer to the next problem starts on a new page. To make sure the format is correct, you may modify the template you used for Assignment 1, or here's one for Assignment 2: Google Doc

Part 1: SQL on Football data (3 points per problem)

For each problem, write a SQL query to find the answer. Make sure your query is general enough to work on any data in the tables, not just the current data.

Problem 1. Find all games where Seattle was the home or away team. For each game, return the two teams and scores.

Problem 2. Find all games where the home team scored more than 40 points and the away team scored fewer than 10 points. For each game, return the home and away team, their scores, and the prediction for the game; return the result sorted by the home-team score in descending order.

Problem 3. Find all teams who lost by more than 20 points as the home team during the first four weeks of some year (i.e., Week < 5). Return the teams in alphabetical order and return each team only once.

Problem 4. Find the average prediction for games where the home team won. Then in a separate query find the average prediction for games where the away team won.

Problem 5. In each of the three years (1998, 1999, 2000), how many games were there where the home team was predicted to win, i.e., prediction > 0, and the actual point spread (Homescore - Awayscore) was more than three times the prediction? Your result should have three tuples, each containing the year and number of games.

Problem 6. Find all pairs of teams where the two teams played each other in 1998 and 1999 in the same configuration (the same team was home and the same team was away), and in 1998 the home team won while in 1999 the away team won. Return the pairs of teams.

Part 2: SQL on Schoolkids data (3 points per problem)

For each problem, write a SQL query to find the answer. Make sure your query is general enough to work on any data in the tables, not just the current data.

Problem 7. Find all students whose goal is Sports but they ranked Sports as fourth in impact on popularity. Return the entire tuple for each such student.

Problem 8. Count how many 4th graders think Looks are more important than Grades. Return just one number. Then run the same query for 5th graders and again for 6th graders, returning one number each time.

Problem 9. For each of the three goals (Grades, Sports, Popular), what is the average age of students who have that goal? Your result should have three tuples, one each with the goal and the average age for that goal.

Problem 10. Find all 4th graders in Urban schools where there is no 5th grader in their same school who has the same goal and same ranking of popularity impact factors. Return each student's grade, school, and goal.

Problem 11. Find the oldest students. For each one return their gender, grade, age, and school.

Part 3: Correlation and Causation (4 points per problem)

Problem 12. In class we talked about four different relationships between two measures X and Y:
  1. X and Y are positively correlated: they tend to be higher and lower together, without X directly influencing Y or Y directly influencing X.
  2. X and Y are negatively correlated: when X is higher Y tends to be lower (and when X is lower Y tends to be higher), without X directly influencing Y or Y directly influencing X.
  3. X and Y are positively causally related: X being higher causes Y to be higher (and X being lower causes Y to be lower).
  4. X and Y are negatively causally related: X being higher causes Y to be lower (and X being lower causes Y to be higher).
Give one real-world example of each relationship, without using examples from class or those found on the web.
Problem 13. In the Schoolkids data, we might expect there to be some correlation between Gender and Goal. Since both measures are categorical, use the table method shown in class to determine whether there appears to be a correlation. You may use a spreadsheet, SQL, or any other tool you like to create the table. Include the table in your writeup (as an actual table or a screenshot), and argue based on the table whether or not there appears to be a correlation.


Extra Credit Problems (2 points each)

We will keep track of extra credit points separately. Earning extra credit points can serve to raise your final grade, but a lack of extra credit points will never lower your grade.

SQL Extra Credit

Problem EX1. On the Football data, find all teams whose average score playing as a home team is lower than its average score playing as an away team. For each team, return the team, its average score in home games, and its average score in away games.

Problem EX2. On the Schoolkids data, revisit Problem 8, but instead of running three separate queries, run one query whose result is a single tuple with six attributes:
  1. the integer 4
  2. the number of 4th graders who think Looks are more important than grades
  3. the integer 5
  4. the number of 5th graders who think Looks are more important than grades
  5. the integer 6
  6. the number of 6th graders who think Looks are more important than grades

Correlation and Causation Extra Credit

Problem EX3. Find an apparent correlation in the Schoolkids data, other than the possible correlation between Gender and Goal explored in Problem 13. State the correlation and then show why it appears to hold. You may use a visualization, spreadsheet, SQL, or any other tool you like to do the analysis, but do make your argument based purely on data analysis, not what you know about the meaning of the data.

Problem EX4. Find an apparent correlation in the Football data. State the correlation and then show why it appears to hold. You may use a visualization, spreadsheet, SQL, or any other tool you like to do the analysis, but do make your argument based purely on data analysis, not what you know about the meaning of the data.