Assignment 3

Due Date: Sunday Apr 30 at 11:59 PM

Honor Code: Under the Honor Code at Stanford, you are expected to submit your own original work for assignments, projects, and exams. On many occasions when working on assignments or projects (but never exams!) it is useful to ask others -- the instructor, the TAs, or other students -- for hints, or to talk generally about aspects of the assignment. Such activity is both acceptable and encouraged, but you must indicate on all submitted work any assistance that you received. Any assistance received that is not given proper citation will be considered a violation of the Honor Code. In any event, you are responsible for understanding, writing up, and being able to explain all work that you submit. The course staff will pursue aggressively all suspected cases of Honor Code violations, and they will be handled through official University channels.

Datasets: This assignment includes the Football datasetfrom past assignments as well as actual Yelp data! All of the data files are included in the zipped folder you will download.

Setup Instructions: All of the files you need for this assignment have been gathered into one zipped folder: assignment3.zip. Create a new directory for the assignment, download the zipped file and place it into the directory, unzip, and you're ready to start.

Submission Instructions: You must submit your completed jupyter notebook assignment3.ipynb via Canvas under Assignment 3. Submissions via email will not be accepted.

Important notes:

  • Your code should use regular Python only -- please don't use PANDAS or embedded SQL queries.
  • Remember that all values are read from CSV files as strings. You will frequently need to convert values to integers or floats using function int() or float(). Your most common mistake may be forgetting to do so!

Section 1: Data Operations with Python on Football data

Problem 1 [3 points]. [Football.csv] Find the average difference between game predictions and actual outcomes, across all games. The HomeScore and AwayScore are the actual scores, while Prediction is an estimate of how much the home team will win or lose by. In arithmetic terms, Outcome = (HomeScore - AwayScore), and Difference = (Prediction - Outcome). Your Python program should print a single positive or negative number for the average difference.

Problem 2 [3 points]. [Football.csv] Find all pairs of teams where the two teams played each other in 1998 and 1999 in the same configuration (the same team was home and the same team was away), and in 1998 the home team won while in 1999 the away team won. Your Python program should print all such pairs of teams.

Section 2: Data Operations with Python on Yelp data

Problem 1 [3 points]. [Restaurants.csv] Find the category that has the largest number of restaurants. Your program should print out the category you found, and the number of restaurants in that category.
Hint: Create a dictionary that maps from each category to how many restaurants belong to that category. So the category is the key, and the number of restaurants is the value.

Problem 2 [3 points]. [Restaurants.csv] Find all the unique cities that are represented in the table. Your program should print out all cities with no duplicates.

Problem 3 [12 points]. In this problem, we want to find the top-rated restaurant in each city. However, we only want to include restaurants that have gotten at least 5 reviews. We broke down this problem into subparts for you:

  • Part a [2 points]. [Restaurants.csv] For each restaurant, find the city it is in. Your program should print out all restaurants and their cities, each line corresponding to one city.
    Hint: Create a dictionary mapping each restaurant to its city.
  • Part b [2 points]. [Reviews.csv] For each restaurant, compute the average star rating.Your program should print out all restaurants and their average star ratings.
  • Part c [2 points]. [Reviews.csv] For each restaurant, compute the number of reviews. Your program should print out all restaurants and their corresponding number of reviews.
  • Part d [2 points]. For each city, find all restaurants in that city. Your program should print out each city and its restaurants.
    Hint: Create a dictionary from city to a list of restaurants.
  • Part e [4 points]. For each city, find the restaurant with the highest average star rating that has at least 5 reviews. Your program should print out all cities and the top-rated restaurant in each city.
    Hint: Use the dictionaries you created in parts a), b), c) and d).

Problem 4. Imagine you are the owner of some restaurant and you want to know which restaurants are competing in your neighborhood. You are also curious about how much competition other restaurants have. So you decide to make an 'oracle' (a Python function!), that can give you a list of competing restaurants for any restaurant you are interested in.

  • Part a [2 points]. [Restaurants.csv] For each restaurant, find its geographic location in (latitude, longitude) tuples, and store this information in a dictionary mapping from restaurant to location tuple for use in part b). Your program should print out all restaurants and their corresponding geographic locations.
    Hint: Make sure to cast your latitude and longitude values to the float type.
  • Part b [4 points]. [Restaurants.csv] Write a function that takes the name of a restaurant as argument and returns a list of all restaurants in its neighborhood (within 0.1 latitude and 0.1 longitude units away from the reference restaurant). You should use the dictionary you created in part a) in your function. Then, run your function first on the restaurants 'Bonchon' and then on 'Pizza Pan'. Your program should print the restaurants in the neighborhood of 'Bonchon', and the restaurants in the neighborhood of 'Pizza Pan'.

    Hint: We created a helper function in_neighborhood(lat1, lon1, lat2, lon2), which returns True if one location is in the neighborhood of the other, and False otherwise. You can use this function to check whether two locations are close enough be in the neighborhood.