Assignment 5


Due Date: Thursday, March 14th at 11:59 PM


Late Policy: All assignments and projects are due at 11:59pm on the due date. Each assignment and project may be turned in up to 24 hours late for a 10% penalty and up to 48 hours late for a 30% penalty. No assignments or projects will be accepted more than 48 hours late. Students have five free late days they may use to turn in work late with no penalty: five 24-hour periods, no pro-rating. This late policy is enforced without exception.

Honor Code: Under the Honor Code at Stanford, you are expected to submit your own original work for assignments, projects, and exams. On many occasions when working on assignments or projects (but never exams!) it is useful to ask others -- the instructor, the TAs, or other students -- for hints, or to talk generally about aspects of the assignment. Such activity is both acceptable and encouraged, but you must indicate on all submitted work any assistance that you received. Any assistance received that is not given proper citation will be considered a violation of the Honor Code. In any event, you are responsible for understanding, writing up, and being able to explain all work that you submit. The course staff will pursue aggressively all suspected cases of Honor Code violations, and they will be handled through official University channels.


This assignment includes some familiar datasets from past assignments as well as some new ones. All of the data files are included in our assignment repo so there's no need to have them available locally. You'll find descriptions of the new datsets in the assignment notebooks.

Setup Instructions:

We will be using Instabase again. The entire assignment will consist of Jupyter Notebooks with exercises that employ the skills we've learned in class.

  1. You will need the files found at the following link copied to your personal Instabase account: Assignment 5. To copy files to your personal Instabase account, click on the drop-down menu above the folder listing, choose Select All, then click on the Copy link that appears. You should now have a private copy of the assignment to work on.
  2. Navigate to the repository where you copied the files to, and the folder should contain 4 separate notebooks: MiningPythonAssign.ipynb, MiningSQLAssign.ipynb, NetworksAssign.ipynb, and UnstructuredAssign.ipynb. There is a different component of the assignment in each of these notebooks. Right-click or control-click on this file and select Open With > Jupyter. (If you simply double-click on it, it will show you the file but will not run Jupyter notebooks.) Sometimes it will take a minute or so for a new Jupyter server to start up on your behalf. Once it does, you are ready to go! In the notebook you will see clearly where you need to add code for the different steps of each problem.

Assignment Details:

There are 3 parts to this assignment: DataMining, Network Analysis, and Unstructured Data (Text and Image). There is a Jupyter Notebook to complete for each of these 3 parts, all of which can be found in the Assignment 5 class Instabase drive. Note: You only need to submit ONE of the notebooks for the DataMining part (either MiningPythonAssign.ipynb or MiningSQLAssign.ipynb). You are not required to do both and you may choose whichever you prefer. There are also extra credit opportunities in some of the notebooks (MiningPythonAssign.ipynb has an extra credit question not available in MiningSQLAssign.ipynb). Start early! This is a longer assignment that covers a broad range of topics!

Part 1: Data Mining

Choose one of MiningPythonAssign.ipynb or MiningSQLAssign.ipynb

Dataset used: Movies.csv (not to be confused with Project #2 movies.tsv)

Part 2: Network Analysis

See Notebook: NetworksAssign.ipynb

Datasets used: Friends.csv, Follows.csv, Dolphins.csv, Dolphins2.csv, Follows2.csv

Part 3: Unstructured Data (Text and Image)

See Notebook: UnstructuredAssign.ipynb

Datasets used: Wines10K.csv, Folder flags

Submission Instructions:

There are two parts to submit on Gradescope: 1) a combined PDF with your answers from parts 1, 2 and 3. 2) a zip file containig all your .ipynb files from part 1, 2 and 3.
  1. Download each of your Jupyter Notebooks as PDFs. With your Notebook pulled up in Instabase, open the print menu for your browser (File > Print). Change the printer to "Save to PDF", and print (this saves your Notebook as a PDF file). Check that all of your answers are still there.
  2. Next, download each of your Juypter Notebooks as .ipynb files. In the menu bar, choose File > Download As > Notebook (.ipynb). Next, compress all your .ipynb files into a single zip file.
  3. Merge a PDF with your answers from parts 1, 2 and 3 with the PDFs of your Jupyter Notebooks. You can easily find free software online for merging PDF files.
  4. Go to the CS102 class in Gradescope and click on Assignment 5: Combined PDF.
  5. Upload your PDF and tag the pages corresponding to each question and your answer. You may submit as many times as you like before the submission deadline, and we will use your latest submission for both grading and the late policy.
  6. Go back to the Gradescope CS102 class and click on Assignment 5: Combined .ipynb Files. Upload the zip file that contains all your .ipynb files.