Project in Mining Massive Data Sets
When dealing with these datasets please be careful and responsible. The datasets are meant to be used strictly for the purposes of the class project and nothing else. This means: (1) Do not do anything ''funny'' with the dataset; (2) Do not try to break the anonymization; (3) Do not share that data outside the class; (4) do not copy the data off the Amazon EC2; (4) After the class is over destroy all data.
Stanford CS341 only datasets
- See the slides from the QA session for more information about what datasets are available to work with.
- Additional information about the Yume (ie, Cookieless Fingerprinting) dataset is now available.
- A full breakdown of the fields available in the Synapse.org (ie, Voice Analysis for Parkinson's) dataset is also available.
Let us know if you need more info on these datasets. We will upload the datasets to EC2.
- DBpedia. Richly labeled network containing extracted
data from Wikipedia (based on infoboxes). Labeled network of multiple
types of nodes and edges
About 2.6 million concepts described by 247 million triples, including
abstracts in 14 different languages. http://dbpedia.org. Some project ideas:
Other OpenLinkedData datasets available at http://esw.w3.org/DataSetRDFDumps.
- Detecting of missing links (and relation types)
- Classification of nodes into the onthology.
Antonellis and Jawed Karim offer a file that contains information about
the search queries that were used to reach pages on the Stanford Web
server. See http://www.stanford.edu/~antonell/tags_dataset.html
- SNAP network datasets. 60 large social and information network datasets
- Ratings and purchases (movies, music, etc.)
- Yahoo! Webscope Catalog of datasets
- Yahoo! Webscope dataset collection. Contains Language Data, Graph and Social Data, Ratings Data, Advertising and Market Data, Competition Data
Jure Leskovec will have to apply for any sets you want, and we must
agree not to distribute them further. There may be a delay, so get
requests in early.
- 1.usa.gov data set
It would enable questions around link propagation, half-life by
referrer, geographical analysis, and I'm sure a ton of other fun