Final Project

In the final project for this course, you will apply the techniques learned in this class to analyze a data set of personal interest to you. Your goal should be to create an original project that you would be proud to show off to a potential employer. You are encouraged to upload your project to Github.

Requirements

You must work on this project with a partner (in a group of 2).

  • The data that you analyze should be complex to collect or to clean in some way. All of the following would satisfy this requirement:
    • data that has to be scraped from a website or a REST API
    • textual data
    • geospatial data
    • data from multiple sources that has to be joined
    A CSV file that you downloaded from Kaggle would not satisfy this requirement.
  • Your analysis should tell a clear story through visuals. It is not enough to do data analysis; you must weave the analysis into a compelling story.
  • You are encouraged to try fitting machine learning models, but only if it fits with the story you want to tell.

Then, you will turn your work into a poster. Here is a template that you can (but are not required to) use. Your poster will need to be printed on 24" x 36" paper. You may select all the most basic printing options (e.g., matte paper, no lamination, not mounted). We will supply a board and an easel during the poster session; all you need to bring is the poster.

Here are some printing options:

  • For students presenting in the Friday session, we will print and bring your poster for you if it is uploaded to Canvas by 9 AM on Wednesday, 3/18.
  • Copy Factory at 3929 El Camino Real: $29.46 with Stanford discount (use coupon code STANFORD at checkout)
  • FedEx at 249 California Ave: $34.05 with Stanford discount (e-mail poster to usa5101@fedex.com and place order in person, showing your student ID)
  • Staples at 700 Menlo Park: $36.05
  • Tech Desk on campus: $50 plus tax

You will present this poster at one of two sessions. (The second session is the registrar-scheduled final exam time for this course. The first session is provided as a convenience for students with conflicts.)

Please sign up here for a poster session.

You will also upload your poster and code to Canvas.

Rubric

Criterion 10 points 8 points 6 points 3 points 0 points
Research Question Interesting research question that could be the basis of a publication. Clear, well-motivated research question. Research question is fuzzy or not motivated. Research question is not well defined. No clear research question.
Data Collection Data collection is extraordinarily complex. Data collection meets the complexity requirement. Data collection was simplistic but challenging in some way. Superficial data collection (e.g., downloaded data set from Kaggle) No data collection.
Data Visualization Unusually appealing and/or insightful visualizations. Data visualizations were clean, labeled, and insightful. Visualizations were technically correct, but not insightful. Poor data visualizations that were incorrect (e.g., bar plot for a quantitative variable) No visualizations were provided.
Data Analysis Correctly applied a broad range of techniques from this class and perhaps a few beyond this class, in technically challenging situations. Correctly applied a broad range of techniques from this class. Applied techniques incorrectly, or applied only a limited set of techniques. Data analysis was done, but the approach was fundamentally flawed. No data analysis.
Storytelling Weaved visualizations and analysis into a compelling story. Visualizations and analyses told a coherent story. Visualizations and analyses seemed scattered, with the main thread unclear. Visualizations and analyses were not tied to a main thread. No attempt to tell a story.
Real-World Application Project generates insights with immediate real-world impact. Project generates insights that clearly have the potential to be useful. With some tweaking, project could have generated useful insights. The insights generated are not clearly useful. No insights were generated from this project.
Poster Poster goes above and beyond. Poster is clean, with a good balance of text and visuals. Poster content is satisfactory, but a bit lacking in professionalism (e.g., too much text, blurry images). Poster layout is sloppy. No poster was made.
Presentation Presentation was highly engaging and memorable. Fielded tough questions. Gave a good summary of the poster and answered questions well. Presentation was unclear, or speakers had difficulty answering questions. Presentation was unclear, and speakers had difficulty answering questions. Did not attend presentation session.
Peer Reviews Completed required peer reviews and provided insightful feedback that even the instructors missed. Completed required peer reviews and provided good feedback about each poster. Completed required peer reviews, but provided perfunctory feedback. Completed some, but not all peer reviews. Feedback was perfunctory. Did not complete peer reviews.
Submission Poster and code submitted on time, well-organized. Did not submit poster or code.

If you have a different idea for a data science project that does not fit neatly with the above requirements, please talk to Professor Sun.

Where to Find Datasets

The best data set is one that you are passionate about. I recommend that you start by finding a question you want to answer and then finding data to answer that question, rather than starting with a data set. That said, here are some helpful websites with large collections of data.

Example Projects

Posters Github Repositories