Final Project

In the final project for this course, you will apply the techniques learned in this class to analyze a data set of personal interest to you. Your goal should be to create an original project that you would be proud to show off to a potential employer. You are encouraged to upload your project to Github.

Requirements

You are expected to work on this project with a partner (in a group of 2).

Stage 1. Collecting and Analyzing Data

  • The data that you analyze should be complex to collect or to clean in some way. All of the following would satisfy this requirement:
    • data that has to be scraped from a website or a REST API
    • textual data
    • geospatial data
    • data from multiple sources that has to be joined
    A CSV file that you downloaded from Kaggle would not satisfy this requirement.
  • Your analysis should tell a clear story through visuals. It is not enough to do data analysis; you must weave the analysis into a compelling story.
  • You are encouraged to try fitting machine learning models, but only if it fits with the story you want to tell.

Rubric

A project which meets our expectations will earn 8/10 in each category. To earn higher, you have to exceed our expectations. We cannot tell you how to do a great data science project, in the same way there is no recipe to write a great novel. We have taught you the tools; now your task is to do something surprising, creative, and human with them: this is the value of data science in an age of AI.

Criterion 10 points 8 points 6 points 3 points 0 points
Research Question Interesting research question that could be the basis of a publication. Clear, well-motivated research question. Research question is fuzzy or not motivated. Research question is not well defined. No clear research question.
Data Collection Data collection is extraordinarily complex. Data collection meets the complexity requirement. Data collection was simplistic but challenging in some way. Superficial data collection (e.g., downloaded a CSV from Kaggle) No data collection.
Data Visualization Unusually appealing and/or insightful visualizations. Data visualizations were clean, labeled, and insightful. Visualizations were technically correct, but not insightful. Poor data visualizations that were incorrect (e.g., bar plot for a quantitative variable) No visualizations were provided.
Data Analysis Correctly applied a broad range of techniques from this class and perhaps some beyond this class, in technically challenging situations. Correctly applied a broad range of techniques from this class. Applied techniques incorrectly, or applied only a limited set of techniques. Data analysis was done, but the approach was fundamentally flawed. No data analysis.
Storytelling Weaved visualizations and analysis into a compelling story. Visualizations and analyses told a logical story. Visualizations and analyses seemed scattered, with the main thread unclear. Visualizations and analyses were not tied to a main thread. No attempt to tell a story.
Real-World Application Project generates insights with tangible real-world impact. Project generates insights that clearly have the potential to be useful. With some tweaking, project could have generated useful insights. The insights generated are not clearly useful. No insights were generated from this project.
Poster Poster goes above and beyond in integrating design with content. Poster communicates the content clearly, with a good balance of text and visuals. Poster content is satisfactory, but a bit lacking in professionalism (e.g., too much text, blurry images). Poster layout is sloppy. No poster was made.
Presentation Memorable presentation that was engaging. Fielded tough questions. Clear, concise high-level summary of the poster (3 minutes or less) and answered questions well. One of the following issues: high-level summary was unclear or too long, questions were not answered well. Multiple issues: high-level summary was unclear or too long, questions were not answered well. Did not attend presentation session.

Where to Find Datasets

The best data set is one that you are passionate about. I recommend that you start by finding a question you want to answer and then finding data to answer that question, rather than starting with a data set. That said, here are some helpful websites with large collections of data.

Example Projects

Posters Github Repositories