CS224N Section 3
Corpora, et cetera
Bill MacCartney
22 April 2005


These are very sketchy notes from the section -- we spent most of our time
visiting websites, looking through AFS corpora directories, and playing
with tools like tgrep.


* Project Ideas ============================================================

See http://nlp.stanford.edu/courses/cs224n/
See project ideas from handout

Don't restrict your attention to topics we cover in the first half of the
quarter!


* Corpora ==================================================================

Tons of resources!
http://www.stanford.edu/dept/linguistics/corpora/
Corpus TA: Neil Snider

LDC: http://www.ldc.upenn.edu/Catalog/

Stanford corpora:
/afs/ir/data/linguistic-data/


** Treebanks ---------------------------------------------------------------

*** Penn Treebank .....................................................

There's PTB2 and PTB3.  Use PTB3.

Penn Treebank parsed WSJ trees:
/afs/ir/data/linguistic-data/Treebank/3/parsed/mrg/wsj/

Penn Treebank tag set:
http://bulba.sdsu.edu/jeanette/thesis/PennTags.html
http://www.comp.leeds.ac.uk/amalgam/tagsets/upenn.html

PTB contains:
  50,000 sentences (1,000,000 words) of WSJ text from 1989
  30,000 sentences (400,000 words) of Brown corpus

PTB WSJ contains sections 0 through 24, ~2400 sentences each, but
section 24 is half size.  ~50,000 sentences total.

Convention in parsing world:
  sections 2-21: training (39,832 sentences)
  section 0 or 22 or 24: development testing
  section 23: final test data

Sections 0 and 1 perceived to be less reliable -- annotators warming
up.

PTB3 adds some new stuff vs. PTB2, but NO BUG FIXES.


*** Other parsed corpora ...................................................

Switchboard: telephone conversations
NEGRA: German newspaper text
TIGER
ICE-GB


** POS-tagged corpora ------------------------------------------------------

PTB

BNC
100m words
wide sample of British English: newspapers, books, letters


** Speech ------------------------------------------------------------------

BNC
10m words


** Dialog ------------------------------------------------------------------

Penn Treebank Switchboard corpus


** Foreign languages -------------------------------------------------------

Penn Arabic Treebank Corpus
    734 stories (140,000 words)

Penn Chinese Treebank Corpus
    50,000 sentences

NEGRA corpus
    20,000 sentences (350,000 words) of German newspaper text
    syntactically annotated (parsed)
    tgrep2able

TIGER corpus
    40,000 sentences (700,000 words)
    syntactically annotated (parsed)


** Spam --------------------------------------------------------------------

** Semantics ---------------------------------------------------------------

WordNet
    150,000 nouns, verbs, adjectives, adverbs
    grouped into "synsets" with glosses, sentence frames
    includes hypernym (kind-of) hierarchy rooted at 'entity'
    also antonyms, holonyms & meronyms, polysemy
    good tutorial: http://www.brians.org/Projects/Technology/Papers/Wordnet/

PropBank
    adds predicate-argument relations to PTB syntax trees
    100,000 annotated verb tokens, 3,200 types
    covers all 1m words of WSJ section except be, do, have

FrameNet
    similar

** Other -------------------------------------------------------------------

Question answering: TREC

WSD: Senseval

The web!  
Google Web API, Google scraping.
Wikipedia.


* Software =================================================================

** Corpus tools ------------------------------------------------------------

*** Tgrep2 .................................................................

Go to /afs/ir/data/linguistic-data/Treebank/tgrep2able/
Read README

To use tgrep2:

- log in to e.g. firebird, raptor -- not elaine!  (needs linux)

- put tgrep2 in your path:
  alias tgrep /afs/ir/data/linguistic-data/bin/linux_2_42/tgrep2
  setenv PATH /afs/ir/data/linguistic-data/bin/linux_2_4:$PATH

- go to tgrep2able data directory
  preprocessed tgrep2able versions of:
    WSJ, Brown, Switchboard, NEGRA, Chinese Treebank
  /afs/ir/data/linguistic-data/Treebank/tgrep2able

- try some commands

  tgrep2 -c wsj_mrg.t2c.gz 'Greenspan'
  tgrep2 -c wsj_mrg.t2c.gz 'NP < VP' | less

- operators

  A < B      A immediately dominates B
  A << B     A dominates B
  A <- B     B is the last child of A
  A <<, B    B is a leftmost descendant of A
  A <<` B    B is a rightmost descendant of A
  A . B      A immediately precedes B
  A .. B     A precedes B
  A $ B      A and B are sisters
  A $. B     A and B are sisters and A immediately precedes B
  A $.. B    A and B are sisters and A precedes B

- useful options

  -t  show only terminals nodes
  -w  show tree for whole sentence
  -a  show all matches in a sentence

- neat example

  to search for NPs that are coordinations of plural nouns:
  tgrep2 -at '/NP*/ <1 NNS <2 (CC < and) <3 NNS'

- environment variables

  setenv TGREP2_CORPUS /afs/ir/data/linguistic-data/Treebank/tgrep2able/wsj_mrg.t2c.gz

- check the manual
  /afs/ir/data/linguistic-data/Treebank/tgrep2able/tgrep2-manual.pdf


*** Other search tools: ....................................................

Tregex

View: searches BNC
http://view.byu.edu/

** Parsers -----------------------------------------------------------------

The Stanford Parser
  several parsers in one:
    PCFG parser, lexicalized PCFG parser, dependency parser
    also a German parser based on NEGRA   
    also a Chinese parser
  Java 1.4, includes source code
  can control with simple shell script
  get PTB-style parses for sentences

Collins's parser
Charniak's parser
MiniPar

** Taggers -----------------------------------------------------------------

Stanford POS tagger

** Language models ---------------------------------------------------------

CMU-Cambridge Statistical Language Modeling toolkit

** Named entity recognizers ------------------------------------------------

LingPipe

** Machine learning tools --------------------------------------------------

Stanford Classifier
    conditional loglinear (aka maxent) model classification

LIBSVM
SVMLite

countless open-source libraries/tools for Bayesian Learning, neural
networks, decision trees, loglinear models, etc.