Assignment 1
Due Date: Sunday April 10 at 11:59 PM

Late Policy: All assignments and projects are due at 11:59pm on the due date. Each assignment and project may be turned in up to 24 hours late for a 10% penalty and up to 48 hours late for a 30% penalty. No assignments or projects will be accepted more than 48 hours late. Students have four free late days they may use to turn in work late with no penalty: four 24-hour periods, no pro-rating. This late policy is enforced without exception.

Honor Code: Under the Honor Code at Stanford, you are expected to submit your own original work for assignments, projects, and exams. On many occasions when working on assignments or projects (but never exams!) it is useful to ask others -- the instructor, the TAs, or other students -- for hints, or to talk generally about aspects of the assignment. Such activity is both acceptable and encouraged, but you must indicate on all submitted work any assistance that you received. Any assistance received that is not given proper citation will be considered a violation of the Honor Code. In any event, you are responsible for understanding, writing up, and being able to explain all work that you submit. The course staff will pursue aggressively all suspected cases of Honor Code violations, and they will be handled through official University channels.

Datasets: This assignment makes use of the following two datasets; more information is provided on the Datasets page, but what's here is sufficient for you to complete the assignment.
Setup Instructions: We recommend using Google Sheets for Problems 1-13. If you use Excel or another spreadsheet program we may not be able to help you with special features, and you should explain any nonstandard features that you use.
  1. Download the two datasets by clicking on the links above.
  2. Go to the Google Drive website. You may use your personal Gmail account to log in, or you can use your SUNet ID by entering your Stanford email address -- Google should automatically redirect you to WebLogin.
  3. To upload the CSV files, drag them from your desktop or file browser to the main Google Drive page.
  4. To open a newly uploaded CSV file, right-click the CSV, select "Open with" and choose "Google Sheets". After you open your CSV file in Google Sheets once, it will save a new version that opens automatically in Sheets from then on.
  5. You're now ready to solve the problems -- have fun!
Submission Instructions: You are to create one PDF file containing all of your answers for this assignment, then submit it to your Canvas account. (You may submit multiple times, however we will use your last submission only, both for grading and for the late policy.)

Part 1: Spreadsheet Data Analysis (3 points per problem)

For each of these problems you will submit an answer obtained by performing spreadsheet operations on the provided data, and a description of the how you manipulated the spreadsheet(s) to obtain your answer.

Problem 1 (Football data). Find the average difference between game predictions and actual outcomes, across all games. The HomeScore and AwayScore columns are the actual scores, while the Prediction column is an estimate of how much the home team will win or lose by, i.e., it's an estimate of HomeScore minus AwayScore. For example, if the Home team beats the Away team by 15 points and the Prediction is 10, then the difference between Prediction and outcome is -5; if the Home team loses to the Away team by 8 points and the Prediction is -2, then the difference between Prediction and outcome is 6. In arithmetic terms, Outcome = (HomeScore - AwayScore), and Difference = (Prediction - Outcome). Give a single positive or negative number for the average difference, then explain how you manipulated the spreadsheet to arrive at the number.

Problem 2 (Football data). Which weeks have the smallest and largest point spreads on average across all games played in that week across all three years? The point spread is the difference between the higher and lower actual scores, i.e., |HomeScore - AwayScore|, always a positive number. (Hint: You might want to use the =abs() function.) Give the smallest and largest point-spread weeks, along with the average point spread across all games played in that week in all years, then explain how you manipulated the spreadsheet to arrive at the results.

Problem 3 (Football data). Find the teams that scored the least and the most total points. Make sure to include both home and away games when computing total points. Give the two teams and their total points, then explain how you manipulated the spreadsheet to arrive at the answer.

Problem 4 (Schoolkids data). What is the overall ranking of Grades, Sports, Looks, and Money in perceived popularity impact across all of the students? Note that every row represents one student's ranking from 1 (most important) to 4 (least important) on how important the four factors (Grades, Sports, Looks and Money) are in impacting popularity. State the four factors in order of overall importance (the factor with the highest importance should be first), and explain how you manipulated the spreadsheet to arrive at the answer.

Problem 5 (Schoolkids data). Which area type puts the least emphasis overall on Looks: Rural, Suburban, or Urban? Give your answer and explain how you manipulated the spreadsheet to arrive at your answer.

Problem 6 (Schoolkids data). Find the school with the highest girl-to-boy ratio. Provide the school name and the ratio, and explain how you manipulated the spreadsheet to arrive at your answer.

Problem 7 (Schoolkids data). Considering only boys who are older than 10, what is their most common goal? Give the goal and the number of boys greater than 10 who have that goal, then explain how you manipulated the spreadsheet to arrive at the answer.

Part 2: Data Visualization (3 points per problem)

For each of these problems you will submit a graphic: a bar graph, pie chart, scatterplot, or other visualization. You may either copy and paste visualizations from Google Sheets into your document, or do a screenshot - either is fine. Problems 14 and 15 will also include a text section covering tool used and what the visualization is showing.

Problem 8 (Football data, bar graph). For Arizona, Atlanta, Baltimore, Buffalo, and Carolina, create a bar graph showing their average score as a home team and average score as an away team.

Problem 9 (Schoolkids data, bar graph). Create a stacked bar graph that shows, for each of the three school Types (Rural, Suburban, Urban), the total number of students divided by how many have the three goals (Sports, Popular, Grades). Your graph should have three bars, with each bar divided into three colors.

Problem 10 (Schoolkids data, pie chart). Create a pie chart showing the relative percentage of students attending schools that are Rural, Suburban, and Urban.

Problem 11 (Schoolkids data, pie chart). Create a pie chart that shows the relative percentage of students who rank as most important Grades, Sports, Looks, and Money.

Problem 12 (Football data, pie chart). Every game is either a home win (HomeScore > AwayScore), an away win (HomeScore < AwayScore), or a tie (HomeScore = AwayScore). Create a pie chart that shows the relative percentage of home wins, away wins, and ties across all games played. (Hint: you may find the =if() function helpful.)

Problem 13 (Football data, scatterplot). Create a scatterplot where the x-axis is the Prediction and the y-axis is the actual outcome (HomeScore minus AwayScore). There should be one dot in the scatterplot for every game played. What do you conclude from the scatterplot?

Problems 14-15 (Personal choice). Create two interesting visualizations of your own. You may use the Football and/or Schoolkids dataset, and you may use Google Sheets, Raw, Excel, or any other tool of your choosing. For each one, mention what tool you used, and explain what your visualization is showing.