Assignment 4 - Machine Learning and R

Logistics

Due Date: Monday, May 25th at 11:59pm

Policies

Late Policy: All assignments and projects are due at 11:59pm on the due date. Each assignment and project may be turned in up to 24 hours late for a 10% penalty and up to 48 hours late for a 30% penalty. No assignments or projects will be accepted more than 48 hours late. Students have four free late days they may use to turn in work late with no penalty: four 24-hour periods, no pro-rating. This late policy is enforced without exception.

Honor Code: Under the Honor Code at Stanford, you are expected to submit your own original work for assignments, projects, and exams. On many occasions when working on assignments or projects (but never exams!) it is useful to ask others -- the instructor, the TAs, or other students -- for hints, or to talk generally about aspects of the assignment. Such activity is both acceptable and encouraged, but you must indicate on all submitted work any assistance that you received. Any assistance received that is not given proper citation will be considered a violation of the Honor Code. In any event, you are responsible for understanding, writing up, and being able to explain all work that you submit. The course staff will pursue aggressively all suspected cases of Honor Code violations, and they will be handled through official University channels.

Datasets:

This assignment includes familiar datasets from past assignments (Players.csv, Teams.csv, Titanic.csv). All of the data files should be included in our assignment repo.

Setup Instructions:

We will be using Instabase as usual.

You will need the files found at the following link copied to your personal Instabase account: Assignment 4. To copy files to your personal Instabase account, first select all of the files by pressing the "Shift" key while simultaneously clicking on each of the files. This will cause an "Actions" dropdown menu to appear above the file names. Click on this dropdown menu, choose "Copy To", and then copy the files over to your own private Instabase folder (you can create a new private folder at this step if you want). Once you've done this, you should have a private copy of the assignment to work on.
Navigate to the repository where you copied the files to, and the folder should contain PythonMLAssign.ipynb, RAssign.ipynb, and three familiar csv files. Right-click or control-click on the notebook and select Open With > Jupyter. (If you simply double-click on it, it will show you the file but will not run Jupyter notebooks.) Sometimes it will take a minute or so for a new Jupyter server to start up on your behalf. Once it does, you are ready to go! In the notebook you will see clearly where you need to add code for the different steps of each problem.

Submission Instructions:

There are three parts to submit on Gradescope: a combined PDF with your answers from parts 1 and 2, your .ipynb file from part 3, and your .ipynb file from part 4:

Download each of your Juypter Notebooks as .ipynb files. In the menu bar, choose File > Download As > Notebook (.ipynb). Please be sure to have removed all print statements you used for debugging from your Notebook before submitting. Note: Please make sure the download ends with .ipynb and not .json, .html, or another format. Additionally, please be sure to Run All Cells before submitting. We will take points off if this is not the case. It may be easier to download these notebooks with Google Chrome if you are having trouble with the .ipynb format.
Please submit a PDF with your answers from parts 1 and 2. If necessary, you can easily find free software online for merging PDF files.
Go to the CS102 class in Gradescope and click on Assignment 4: Parts 1 and 2: Combined PDF.
Upload your PDF and tag the pages corresponding to each question and your answer.
Go back to the Gradescope CS102 class and click on Assignment 4: PythonML .ipynb File. Upload the .ipynb file you downloaded.
Go back to the Gradescope CS102 class and click on Assignment 4: R .ipynb File. Upload the .ipynb file you downloaded.
Make sure that you submit .ipynb and not .json / html files!
Please be sure your tag your assignment (particularly Parts 1 and 2), otherwise we will take points off of submission.
Lastly, please be sure to run all nodes of your .ipynb before submission, and delete all print statements you used for debugging, otherwise we will take up to 10% off! You can check your .ipynb submissions by pressing on your submission and clicking Code on the top right. This will visualize your iPython notebook and you can look over it carefully to make sure you don't have extra print statements or cells that were not run.

For more detailed instructions on submitting to Gradescope, take a look at the Gradescope FAQ.

Part 1: Regression using Google Sheets

Find correlations in the World Cup data: find correlations in Teams.csv and find correlations in Players.csv.
If using scatterplots, linear trendlines only.
If using correl() or rsq() functions, entire columns only.

Problem 1. What is the strongest positive correlation you can find?

Problem 2. What is the strongest negative correlation you can find?

Note: Specify the two columns and either the r or R^2 value.

Part 2: Naive Bayes on World Cup Data

See second slide of PDF Handout: A4Bayes.pdf

Note: Make sure to merge Part1 and Part2 into a merged pdf for submission.

Part 3: Using Python for Machine Learning

See Notebook in Instabase Repo: PythonMLAssign.ipynb

Note: For the minutes-passes linear regression, print how many passes you predict for 100, 200, and 300 minutes. Similarly, for the interactive number-of-passes predictor question, report the predicted passes for Kuzmanovic, Messi, and Kadir.

Part 4: The R Language

See Notebook in Instabase Repo: RAssign.ipynb