Assignment 4: R, Classification, Clustering, Sampling
Due Date: Sunday May 29 at 11:59 PM

Late Policy: All assignments and projects are due at 11:59pm on the due date. Each assignment and project may be turned in up to 24 hours late for a 10% penalty and up to 48 hours late for a 30% penalty. No assignments or projects will be accepted more than 48 hours late. Students have four free late days they may use to turn in work late with no penalty: four 24-hour periods, no pro-rating. This late policy is enforced without exception.

Honor Code: Under the Honor Code at Stanford, you are expected to submit your own original work for assignments, projects, and exams. On many occasions when working on assignments or projects (but never exams!) it is useful to ask others -- the instructor, the TAs, or other students -- for hints, or to talk generally about aspects of the assignment. Such activity is both acceptable and encouraged, but you must indicate on all submitted work any assistance that you received. Any assistance received that is not given proper citation will be considered a violation of the Honor Code. In any event, you are responsible for understanding, writing up, and being able to explain all work that you submit. The course staff will pursue aggressively all suspected cases of Honor Code violations, and they will be handled through official University channels.

Setup Instructions: Before starting this assignment you need to make sure the programming language R and development environment RStudio are installed on your computer, and download a zipped folder containing the datasets used in the assignment.
  1. R - If you don't already have R installed on your computer, go to this page, find the link corresponding to your OS (Windows, Mac OS X, or Linux), and follow the corresponding download and installation instructions.
  2. RStudio - To install RStudio, go to this page, download the correct installer for your OS, open it and follow the setup instructions.
  3. Datasets - Download this zipped file and unzip it; there are four CSV files in the folder. Keep track of where you've unzipped your data so you can use it in RStudio.
Submission Instructions: You must submit your work via Canvas under Assignment 4. Submissions via email will not be accepted. For the complete assignment, you need to submit one file for each of the five problems, plus optionally one file for each extra credit problem. Use the following exact filenames:

Problem 1: Naive Bayes Classifier (5 points)

(Schoolkids data) This problem takes you step by step through building a Naive Bayes classifier for predicting a student's Goal from their Gender, Grade, and school Type, using the entire Schoolkids dataset as training data. For the probability calculations you can use any tool you like: spreadsheets, SQL, Python, R, or another tool of your choosing, but we do expect you to state what you used and demonstrate your work. Note that this problem parallels the computation we did in class for the weather data (reflected in the lecture notes and Bayes.csv file).

(a) Category probabilities - For each of the three categories for Goal -- Grades, Popular, Sports -- compute the probability of that category, i.e., the fraction of the Schoolkids data items in that category. Your answer should consist of one number for each category, with the three numbers summing to 1.0. Don't forget to state what tool you used to compute your answer, and demonstrate how the tool was used.

(b) Conditional probabilities within categories - Now separately consider each of the three categories for Goal. For each one, compute the probabilities within that category for each of the possible values for each of the three features -- Gender, Grade, and Type. For each category, you should have 8 numbers (so your answer will consist of 24 numbers total): the fraction of girls versus boys within that category (summing to 1.0); the fraction of grades 4, 5, and 6 within that category (summing to 1.0); the fraction of Rural, Suburban, and Urban within that category (summing to 1.0). Again, don't forget to state what tool you used to compute your answer, and demonstrate how the tool was used.

(c) Category assignment - Now you're ready to predict a category for new items. Suppose new Item #1 is a 5th grade girl from a Suburban school, and Item #2 is a 4th grade boy from a Rural school. Using the probabilities from Steps 1 and 2, compute the most likely categories for Items #1 and #2. Show your calculations!

Problem 2: Plotting and Regression Using R (5 points)

(Football data) Write R code that reads the Football data and creates a scatterplot as follows:

Hints for the point coloring requirement:

Problem 3: kNN Classification Using R (5 points)

(Schoolkids data) Write R code that creates a classifier for the Schoolkids data using R's k-nearest-neighbors (knn) function, taking into account more features than the three that you used in Problem 1. We've created two new data files, SchoolkidsTrain.csv and SchoolkidsTest.csv: SchoolkidsTest contains 20 rows extracted from the original Schoolkids data, and SchoolkidsTrain contains the rest. In both of the data files, the School column has been removed, and Goal is shifted to the last (9th) column. We suggest you start with the R code we showed in class for weather categorization, then modify it to read from the two separate files and work with the schoolkid features. Initially use all eight features in the data to predict each student's Goal. (Note: For fun, the first two items in the test set have the same feature values you used in Step 3 of Problem 1. Which do you like better -- Naive Bayes or k-nearest-neighbors?)

Once you have your code running, experiment with different values of k. In addition, try using different subsets of the features for prediction instead of all eight features.
How high can you get the accuracy? In the program you submit, make sure to use the setting for k and the set of features that give the highest accuracy you were able to find.

Extra credit extension (3 points): Add code that automatically iterates through values for k and/or combinations of features to find the settings that give the highest accuracy. Please include comments explaining your strategy.

Problem 4: Clustering Using R (5 points)

(Football data) Write R code that clusters the Football data based on two features, HomeScore and AwayScore, and creates a plot with HomeScore on the x-axis, AwayScore on the y-axis, and point colors showing the clusters. (Note the similarity to what was shown in class for clustering the weather data based on Longitude and Latitude.) Use only the games played in the first two weeks of 1998, i.e., only the first 30 rows of the Football data. To reference rows 1-30 in a dataframe D use "D[1:30,]". Using your program, experiment with different numbers of clusters from 2 to 8. Choose the number you like best and use it in the program that you submit; put a comment at the top of your code briefly justifying your choice.

Problem 5: Sampling Coded in R (5 points)

(Football data) Suppose you need to reduce the size of the football data through sampling. You decide to try several methods (note not all samples are the same size):
  1. Random selection with 10% probability, i.e., each game has a 10% chance of being included in the sample
  2. Every 10th game in the dataset, i.e., games 10, 20, 30, 40, etc.
  3. Games played in "representative" weeks 4, 9, and 14
  4. Games played in "representative" year 1999
  5. Games involving "representative" teams Dallas, Denver, and Detroit (as home or away)
Write R code that creates the five samples, then prints a few simple statistics to compare the samples against the original data:

The output of your program should consist of a total of 24 numbers: for each of the four statistics, you should give the result of that statistic on the original full dataset and on each of the five samples. You should structure the output so it's easy for you (and the grader!) to compare the statistics across the different samples.

Just based on eyeballing the results, are any samples clearly better than others? Which one(s) do you like best? Put your answers in a comment at the top of your code.

Some coding hints:

Extra credit extension (3 points): Add code to determine computationally which sample is best. Make sure to include comments so we know what calculations you're performing to make the determination.