$\DeclareMathOperator{\p}{Pr}$ $\DeclareMathOperator{\P}{Pr}$ $\DeclareMathOperator{\c}{^C}$ $\DeclareMathOperator{\or}{ or}$ $\DeclareMathOperator{\and}{ and}$ $\DeclareMathOperator{\var}{Var}$ $\DeclareMathOperator{\E}{E}$ $\DeclareMathOperator{\std}{Std}$ $\DeclareMathOperator{\Ber}{Bern}$ $\DeclareMathOperator{\Bin}{Bin}$ $\DeclareMathOperator{\Poi}{Poi}$ $\DeclareMathOperator{\Uni}{Uni}$ $\DeclareMathOperator{\Exp}{Exp}$ $\DeclareMathOperator{\N}{N}$ $\DeclareMathOperator{\R}{\mathbb{R}}$ $\newcommand{\d}{\, d}$

Machine Learning Datasets

May 19th, 2016

Update (May/19): We fixed the number of features in the Ancestry dataset.



Figure: You will program classifiers for heart disease, predicting ancestry and predicting movie taste.


This page describes in slightly more detail the datasets for the machine learning programming part of the ultimate CS109 assignment. Each dataset is formatted in exactly the same way. See the problem set handout for more details on formatting. You don't need to know the details of the features or the prediction task to complete pset6. This information is provided simply to give you a deeper understanding of the tasks you are working on.


Heart Dataset

Task:

Your task is to assist a doctor in predicting whether or not a patient has heart disease (specifically myocardial perfusion diagnosis). Your prediction will be based on partial diagnosis made on images of different parts of the heart.

Values:

A heart is scanned and pictures of 2D images are generated for five different parts of the heart: Area A: Near the heart's apex (4 ROIs) Area B: In middle of the "LV" (5 ROIs) Area C: Near the heart base (5 ROIs) Area D: In the center of the LV cavity for horizontal long axis view (4 ROIs) Area E: In the center of the LV cavity for vertical long axis view (4 ROIs) There are 4 or 5 regions of interest (ROIs) in each image. A cardiologist makes a partial diagnoses for each of the 22 Regions of Interest (ROIs). These partial diagnosis were mechanical to generate and could be performed by a trained nurse. 0 is a negative diagnosis (healthy), 1 is a positive diagnosis (unhealthy). Each column represents a partial diagnosis by a cardiologist for a particular ROI. Here is a picture of the 22 ROIs:

Prediction:

The variable you are predicting is the overall diagnosis of heart disease. The labels were generated by a team of cardiologists based on detailed analysis of each case.

This dataset was collected by Kurgan et al and is hosted by the UCI Machine Learning Repository:
http://archive.ics.uci.edu/ml/datasets/SPECT+Heart


Ancestry Dataset

Task:

Your task is to predict the ethnicity of a person who has sent in their DNA based on Single Nucleotide Polymorphisms (SNPs).

Values:

This dataset contains the genetic variation found in people sampled by the 1000 Genomes Project which sequenced the DNA from different ethnic groups around the world. Each input vector represents the DNA at specific locations in the genome for one individual. There are 20 binary input features. 0 indicates that the user's DNA at the given location matches the human reference genome. 1 indicates that the user's DNA does not match the human reference genome. The output class value represents the super population (ethnicity) of each individual. The super populations contained in this dataset are East Asian or Ad Mixed American, encoded in binary. The training data set contains 283 data vectors, and the testing data set contains 184 data vectors. Each feature represents a particular location in the human genome. 0 indicates that the user's DNA at the given location matches the human reference genome. 1 indicates that the user's DNA does not match the human reference genome. Though the particular locations in DNA may have semantic meanings -- for this task all you know is that each column is a distinct nucleotide index.

Prediction:

The variable you are predicting is the super population of the user.

Thanks to Jim Notwell and Gill Bejerano from the Stanford Computer Science and Genetics departments for this dataset.


Netflix Dataset

Task:

Your task is to predict if a user would like Miss Congeniality based on their ratings for the 30 most rated movies. This data is from real users on Netflix.

Values:

Each row in the train and test set represents one user. Each column represents one movie. All users in the dataset rated all movies in the dataset. Each entry in this dataset is binary. A value of 1 indicates a rating of 4 or 5 (they liked the movie). A value of 0 indicates a rating of 1, 2 or 3 (didn't really like it). Each feature represents ratings for a particular movie.
  • 1, Independence Day (1996)
  • 2, The Patriot (2000)
  • 3, The Day After Tomorrow (2004)
  • 4, Pirates of the Caribbean: The Curse of the Black Pearl (2003)
  • 5, Pretty Woman (1990)
  • 6, Forrest Gump (1994)
  • 7, The Green Mile (1999)
  • 8, Con Air (1997)
  • 9, Twister (1996)
  • 10, Sweet Home Alabama (2002)
  • 11, Pearl Harbor (2001)
  • 12, Armageddon (1998)
  • 13, The Rock (1996)
  • 14, What Women Want (2000)
  • 15, Bruce Almighty (2003)
  • 16, Ocean's Eleven (2001)
  • 17, The Bourne Identity (2002)
  • 18, The Italian Job (2003)
  • 19, I Robot (2004)
  • 20, American Beauty (1999)
  • 21, How to Lose a Guy in 10 Days (2003)
  • 22, Lethal Weapon 4 (1998)
  • 23, Shrek 2 (2004)
  • 24, Lost in Translation (2003)
  • 25, Top Gun (1986)
  • 26, Pulp Fiction (1994)
  • 27, Gone in 60 Seconds (2000)
  • 28, The Sixth Sense (1999)
  • 29, Lord of the Rings: The Two Towers (2002)
  • 30, Men of Honor (2000)

Prediction:

The variable you are predicting is the binary value for the user's rating of Miss Congeniality (2000).

Credit: This dataset was curated by Chris Piech, but it is based on data originally made for the "Netflix Prize". The Netflix Prize data was initially retracted because of concerns over user privacy. Reed Hastings, the CEO of Netflix, gave the official thumbs up for CS109 to release this anonymized subset of data. Thanks to Matt Chen for his help in getting the Netflix Prize data.