CS224N Section 3 Corpora, et cetera Bill MacCartney 22 April 2005 These are very sketchy notes from the section -- we spent most of our time visiting websites, looking through AFS corpora directories, and playing with tools like tgrep. * Project Ideas ============================================================ See http://nlp.stanford.edu/courses/cs224n/ See project ideas from handout Don't restrict your attention to topics we cover in the first half of the quarter! * Corpora ================================================================== Tons of resources! http://www.stanford.edu/dept/linguistics/corpora/ Corpus TA: Neil Snider LDC: http://www.ldc.upenn.edu/Catalog/ Stanford corpora: /afs/ir/data/linguistic-data/ ** Treebanks --------------------------------------------------------------- *** Penn Treebank ..................................................... There's PTB2 and PTB3. Use PTB3. Penn Treebank parsed WSJ trees: /afs/ir/data/linguistic-data/Treebank/3/parsed/mrg/wsj/ Penn Treebank tag set: http://bulba.sdsu.edu/jeanette/thesis/PennTags.html http://www.comp.leeds.ac.uk/amalgam/tagsets/upenn.html PTB contains: 50,000 sentences (1,000,000 words) of WSJ text from 1989 30,000 sentences (400,000 words) of Brown corpus PTB WSJ contains sections 0 through 24, ~2400 sentences each, but section 24 is half size. ~50,000 sentences total. Convention in parsing world: sections 2-21: training (39,832 sentences) section 0 or 22 or 24: development testing section 23: final test data Sections 0 and 1 perceived to be less reliable -- annotators warming up. PTB3 adds some new stuff vs. PTB2, but NO BUG FIXES. *** Other parsed corpora ................................................... Switchboard: telephone conversations NEGRA: German newspaper text TIGER ICE-GB ** POS-tagged corpora ------------------------------------------------------ PTB BNC 100m words wide sample of British English: newspapers, books, letters ** Speech ------------------------------------------------------------------ BNC 10m words ** Dialog ------------------------------------------------------------------ Penn Treebank Switchboard corpus ** Foreign languages ------------------------------------------------------- Penn Arabic Treebank Corpus 734 stories (140,000 words) Penn Chinese Treebank Corpus 50,000 sentences NEGRA corpus 20,000 sentences (350,000 words) of German newspaper text syntactically annotated (parsed) tgrep2able TIGER corpus 40,000 sentences (700,000 words) syntactically annotated (parsed) ** Spam -------------------------------------------------------------------- ** Semantics --------------------------------------------------------------- WordNet 150,000 nouns, verbs, adjectives, adverbs grouped into "synsets" with glosses, sentence frames includes hypernym (kind-of) hierarchy rooted at 'entity' also antonyms, holonyms & meronyms, polysemy good tutorial: http://www.brians.org/Projects/Technology/Papers/Wordnet/ PropBank adds predicate-argument relations to PTB syntax trees 100,000 annotated verb tokens, 3,200 types covers all 1m words of WSJ section except be, do, have FrameNet similar ** Other ------------------------------------------------------------------- Question answering: TREC WSD: Senseval The web! Google Web API, Google scraping. Wikipedia. * Software ================================================================= ** Corpus tools ------------------------------------------------------------ *** Tgrep2 ................................................................. Go to /afs/ir/data/linguistic-data/Treebank/tgrep2able/ Read README To use tgrep2: - log in to e.g. firebird, raptor -- not elaine! (needs linux) - put tgrep2 in your path: alias tgrep /afs/ir/data/linguistic-data/bin/linux_2_42/tgrep2 setenv PATH /afs/ir/data/linguistic-data/bin/linux_2_4:$PATH - go to tgrep2able data directory preprocessed tgrep2able versions of: WSJ, Brown, Switchboard, NEGRA, Chinese Treebank /afs/ir/data/linguistic-data/Treebank/tgrep2able - try some commands tgrep2 -c wsj_mrg.t2c.gz 'Greenspan' tgrep2 -c wsj_mrg.t2c.gz 'NP < VP' | less - operators A < B A immediately dominates B A << B A dominates B A <- B B is the last child of A A <<, B B is a leftmost descendant of A A <<` B B is a rightmost descendant of A A . B A immediately precedes B A .. B A precedes B A $ B A and B are sisters A $. B A and B are sisters and A immediately precedes B A $.. B A and B are sisters and A precedes B - useful options -t show only terminals nodes -w show tree for whole sentence -a show all matches in a sentence - neat example to search for NPs that are coordinations of plural nouns: tgrep2 -at '/NP*/ <1 NNS <2 (CC < and) <3 NNS' - environment variables setenv TGREP2_CORPUS /afs/ir/data/linguistic-data/Treebank/tgrep2able/wsj_mrg.t2c.gz - check the manual /afs/ir/data/linguistic-data/Treebank/tgrep2able/tgrep2-manual.pdf *** Other search tools: .................................................... Tregex View: searches BNC http://view.byu.edu/ ** Parsers ----------------------------------------------------------------- The Stanford Parser several parsers in one: PCFG parser, lexicalized PCFG parser, dependency parser also a German parser based on NEGRA also a Chinese parser Java 1.4, includes source code can control with simple shell script get PTB-style parses for sentences Collins's parser Charniak's parser MiniPar ** Taggers ----------------------------------------------------------------- Stanford POS tagger ** Language models --------------------------------------------------------- CMU-Cambridge Statistical Language Modeling toolkit ** Named entity recognizers ------------------------------------------------ LingPipe ** Machine learning tools -------------------------------------------------- Stanford Classifier conditional loglinear (aka maxent) model classification LIBSVM SVMLite countless open-source libraries/tools for Bayesian Learning, neural networks, decision trees, loglinear models, etc.