Assignment 4

Logistics

Due Date

Thursday, Feb 28 at 11:59 PM

Policies

Late Policy: All assignments and projects are due at 11:59pm on the due date. Each assignment and project may be turned in up to 24 hours late for a 10% penalty and up to 48 hours late for a 30% penalty. No assignments or projects will be accepted more than 48 hours late. Students have five free late days they may use to turn in work late with no penalty: four 24-hour periods, no pro-rating. This late policy is enforced without exception.

Honor Code: Under the Honor Code at Stanford, you are expected to submit your own original work for assignments, projects, and exams. On many occasions when working on assignments or projects (but never exams!) it is useful to ask others -- the instructor, the TAs, or other students -- for hints, or to talk generally about aspects of the assignment. Such activity is both acceptable and encouraged, but you must indicate on all submitted work any assistance that you received. Any assistance received that is not given proper citation will be considered a violation of the Honor Code. In any event, you are responsible for understanding, writing up, and being able to explain all work that you submit. The course staff will pursue aggressively all suspected cases of Honor Code violations, and they will be handled through official University channels.

Datasets:

This assignment includes familiar datasets from past assignments (World Cup and Titanic). See setup instructions below.

Setup Instructions:

We will primarily be using Instabase for this assignment, although the first part of the assignment will require Google Sheets.

  1. For part 1, you will need to open Teams.csv and/or Players.csv in Google Sheets (see below for additional details).
  2. For parts 3 and 4, you will need the files found at the following link copied to your personal Instabase account: Assignment 4.
    To copy files to your personal Instabase account, select all the files (select the first one, then shift+click on the last one), then press Actions > Copy. Copy the 5 files into a folder in your own repository. You now have a private copy of the assignment to work on.
  3. Navigate to the folder where you copied the files to, and the folder should contain PythonMLAssign.ipynb and RAssign.ipynb, along with the necessary datasets. Right-click on this file and select Open With > Jupyter. (If you simply double-click on it, it will show you the file but will not run Jupyter notebooks.) Sometimes it will take a minute or so for a new Jupyter server to start up on your behalf. Once it does, you are ready to go! In the notebook you will see clearly where you need to add code for the different problems.

Submission Instructions:

There are three parts to submit on Gradescope: a combined PDF with your answers from parts 1 and 2 and your notebooks from parts 3 and 4, your .ipynb file from part 3, and your .ipynb file from part 4:
  1. Download each of your Jupyter Notebooks as PDFs. With your Notebook pulled up in Instabase, open the print menu for your browser (File > Print). Change the printer to "Save to PDF", and print (this saves your Notebook as a PDF file). Check that all of your answers are still there.
  2. Next, download each of your Juypter Notebooks as .ipynb files. In the menu bar, choose File > Download As > Notebook (.ipynb).
  3. Merge a PDF with your answers from parts 1 and 2 with the PDFs of your Jupyter Notebooks. You can easily find free software online for merging PDF files.
  4. Go to the CS102 class in Gradescope and click on Assignment 4: Combined PDF.
  5. Upload your PDF and tag the pages corresponding to each question and your answer. You may submit as many times as you like before the submission deadline, and we will use your latest submission for both grading and the late policy.
  6. Go back to the Gradescope CS102 class and click on Assignment 4: PythonML .ipynb File. Upload the .ipynb file you downloaded.
  7. Go back to the Gradescope CS102 class and click on Assignment 4: R .ipynb File. Upload the .ipynb file you downloaded.

For more detailed instructions on submitting to Gradescope, take a look at the Gradescope FAQ.


Part 1: Regression using Google Sheets

Find correlations in the World Cup data: find correlations in Teams.csv and find correlations in Players.csv.
If using scatterplots, linear trendlines only.
If using correl() or rsq() functions, entire columns only.

Problem 1. What is the strongest positive correlation you can find?

Problem 2. What is the strongest negative correlation you can find?

Note: Specify the two columns and either the r or R^2 value.

Part 2: Naive Bayes on World Cup Data

See PDF Handout: A4Bayes.pdf

Part 3: Using Python for Machine Learning

See Notebook: PythonMLAssign.ipynb

Part 4: The R Language

See Notebook: RAssign.ipynb