Linguist 278: Programming for Linguists
Stanford Linguistics, Fall 2021
Christopher Potts

Assignment 6: Language dataset hackathon

Distributed 2020-11-09
Due 2020-11-16

Contents

  1. Overview
    1. Requirements
    2. Ideas
  2. Set-up
  3. Age of acquisition dataset
  4. Concreteness dataset
  5. Sentiment dataset
  6. Beautiful words
  7. Novels from Project Gutenberg
  8. Potentially useful code
    1. Project Gutenberg iterator
    2. Sentence tokenizing using NLTK
    3. Word counts
    4. egrep

Overview

Requirements

Ideas

Examples of things you might do (not meant to be restrictive!):

Set-up

Download the hackathon data distribution:

http://web.stanford.edu/class/linguist278/data/hackathon.zip

and unzip it in the same directory as this notebook. (If you want to put it somewhere else, just be sure to change data_home in the next cell.)

Age of acquisition dataset

From Age-of-acquisition ratings for 30 thousand English words (Victor Kuperman, Hans Stadthagen-Gonzalez, and Marc Brysbaert, Behavior Research Methods, 2014):

  1. Word: The word (str)
  2. OccurTotal: token count in their data
  3. OccurNum: Participants who gave an age-of-acquisition, rather than saying "Unknown"
  4. Rating.Mean: mean age of aquisition in years of age
  5. Rating.SD: standard deviation of the distribution of ages of acquisition
  6. Frequency: token count of Word in the SUBTLEX-US

Concreteness dataset

We've worked with this dataset before. It's presented in Concreteness ratings for 40 thousand generally known English word lemmas (Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman, Behavior Research Methods, 2014). Overview:

  1. Word: The word (str)
  2. Bigram: Whether it is a single word or a two-word expression
  3. Conc.M: The mean concreteness rating
  4. Conc.SD: The standard deviation of the concreteness ratings (float)
  5. Unknown: The number of persons indicating they did not know the word
  6. Total: The total number of persons who rated the word
  7. Percent_known: Percentage of participants who knew the word
  8. SUBTLEX: The SUBTLEX-US frequency count
  9. Dom_Pos: The part-of-speech where known

Sentiment dataset

The dataset Norms of valence, arousal, and dominance for 13,915 English lemmas (Amy Beth Warriner, Victor Kuperman, and Marc Brysbaert, Behavior Research Methods, 2013) contains a lot of sentiment information about +13K words. The following code reads in the full dataset and then restricts to just the mean ratings for the three core semantic dimensions:

  1. Word: The word (str)
  2. Valence (positive/negative)
  3. Arousal (intensity)
  4. Dominance

Beautiful words

I took the 100 Most Beautiful Words (of which there are 107) and enriched them:

  1. Word: The word (str).
  2. Pronunciation: CMU Pronouncing Dictionary representation.
  3. Morphology: Celex morphological representations.
  4. Frequency: frequency according to the Google N-gram Corpus.
  5. Category: 'most-beautiful' or 'regular'

The 'regular' examples are 107 randomly selected non-proper-names.

Maybe there's something interesting here?

Novels from Project Gutenberg

The Gutenberg metadata has been removed from these files, and the first line gives the title, author, and publication year in a systematic pattern.

Potentially useful code

Project Gutenberg iterator

You might want to modify this, depending on how you want to process these texts (by word? sentence? chapter?).

Sentence tokenizing using NLTK

Word counts

From assignment 2.

egrep

From assignment 5.