CS224N Section 3 Corpora, et cetera Bill MacCartney 5 May 2006 Today we'll be visiting websites, looking through AFS corpora directories, and playing with tools like tgrep. Project Ideas =================================================================== See http://nlp.stanford.edu/courses/cs224n/ See project ideas from handout Don't restrict your attention to topics we cover in the first half of the quarter! Corpora ========================================================================= Tons of resources! See links at http://nlp.stanford.edu/links/statnlp.html Corpora@Stanford ------------------------------------------------------------ http://www.stanford.edu/dept/linguistics/corpora/ Corpus TA: Liz Coppock Stanford corpora: /afs/ir/data/linguistic-data/ You need to belong to a special group for access Pi-chuan is gonna hook you up LDC = Linguistic Data Consortium -------------------------------------------- http://www.ldc.upenn.edu/Catalog/ Treebanks ------------------------------------------------------------------- Penn Treebank ........................................................... There's PTB2 and PTB3. Use PTB3. Penn Treebank parsed WSJ trees: /afs/ir/data/linguistic-data/Treebank/3/parsed/mrg/wsj/ Penn Treebank tag set: http://bulba.sdsu.edu/jeanette/thesis/PennTags.html http://www.comp.leeds.ac.uk/amalgam/tagsets/upenn.html PTB contains: 50,000 sentences (1,000,000 words) of WSJ text from 1989 30,000 sentences (400,000 words) of Brown corpus PTB WSJ contains sections 0 through 24, ~2400 sentences each, but section 24 is half size. ~50,000 sentences total. Convention in parsing world: sections 2-21: training (39,832 sentences) section 0 or 22 or 24: development testing section 23: final test data Sections 0 and 1 perceived to be less reliable -- annotators warming up. PTB3 adds some new stuff vs. PTB2, but NO BUG FIXES. Other parsed corpora .................................................... BLLIP: Like PTB, but 30m words, parsed automatically by Charniak Switchboard: telephone conversations NEGRA: German newspaper text TIGER ICE-GB POS-tagged corpora ---------------------------------------------------------- PTB BNC 100m words wide sample of British English: newspapers, books, letters Named Entity Recognition ---------------------------------------------------- Message Understanding Conference (MUC) see e.g. /afs/ir/data/linguistic-data/MUC_7/muc_7/data/training.ne.eng.keys.980205 CoNLL shared task in NER Speech ---------------------------------------------------------------------- BNC: 10m words Dialog ---------------------------------------------------------------------- Penn Treebank Switchboard corpus Foreign languages ----------------------------------------------------------- Penn Arabic Treebank Corpus 734 stories (140,000 words) Penn Chinese Treebank Corpus 50,000 sentences NEGRA corpus 20,000 sentences (350,000 words) of German newspaper text syntactically annotated (parsed) tgrep2able TIGER corpus 40,000 sentences (700,000 words) syntactically annotated (parsed) Spam/Email ------------------------------------------------------------------ The Enron corpus /afs/ir/data/linguistic-data/Enron-Email-Corpus/maildir/skilling-j/ TREC Spam track http://trec.nist.gov/data/spam.html Question Answering (QA) ----------------------------------------------------- E.g. "What film introduced Jar Jar Binks?" E.g. "How much is the Sacajawea coin worth?" The TREC competition http://trec.nist.gov/data/qa.html 5 years' worth of data online Word Sense Disambiguation (WSD) --------------------------------------------- Senseval: http://www.senseval.org/ Semantics ------------------------------------------------------------------- WordNet ................................................................. Website: http://wordnet.princeton.edu/ Online: http://wordnet.princeton.edu/perl/webwn 150,000 nouns, verbs, adjectives, adverbs grouped into "synsets" with glosses, sentence frames includes hypernym (kind-of) hierarchy rooted at 'entity' also antonyms, holonyms & meronyms, polysemy good tutorial: http://www.brians.org/Projects/Technology/Papers/Wordnet/ neat visual interface: http://www.visualthesaurus.com/?vt Problems with WordNet: - fine-grained senses - sense ordering sometimes funny (see "airline") PropBank ................................................................ http://www.cs.rochester.edu/~gildea/PropBank/Sort/ adds predicate-argument relations to PTB syntax trees "semantic role labeling" 100,000 annotated verb tokens, 3,200 types covers all 1m words of WSJ section except be, do, have FrameNet ................................................................ similar Lexical FreeNet ......................................................... http://www.lexfn.com/ Other lexical resources ----------------------------------------------------- Dekang Lin's "thesaurus" (distributional similarity scores) http://www.cs.ualberta.ca/~lindek/downloads.htm Other ----------------------------------------------------------------------- The web! Google Web API, Google scraping Wikipedia Software ======================================================================== Corpus tools ---------------------------------------------------------------- Tgrep2 .................................................................. Go to /afs/ir/data/linguistic-data/Treebank/tgrep2able/ Read README To use tgrep2: - log in to e.g. firebird, raptor -- not elaine! (needs linux) - put tgrep2 in your path: alias tgrep /afs/ir/data/linguistic-data/bin/linux_2_42/tgrep2 setenv PATH /afs/ir/data/linguistic-data/bin/linux_2_4:$PATH - go to tgrep2able data directory preprocessed tgrep2able versions of: WSJ, Brown, Switchboard, NEGRA, Chinese Treebank /afs/ir/data/linguistic-data/Treebank/tgrep2able - try some commands tgrep2 -c wsj_mrg.t2c.gz 'Greenspan' tgrep2 -c wsj_mrg.t2c.gz 'NP < VP' | less - operators A < B A immediately dominates B A << B A dominates B A <- B B is the last child of A A <<, B B is a leftmost descendant of A A <<` B B is a rightmost descendant of A A . B A immediately precedes B A .. B A precedes B A $ B A and B are sisters A $. B A and B are sisters and A immediately precedes B A $.. B A and B are sisters and A precedes B - useful options -t show only terminals nodes -w show tree for whole sentence -a show all matches in a sentence - neat example to search for NPs that are coordinations of plural nouns: tgrep2 -at '/NP*/ <1 NNS <2 (CC < and) <3 NNS' - environment variables setenv TGREP2_CORPUS /afs/ir/data/linguistic-data/Treebank/tgrep2able/wsj_mrg.t2c.gz - check the manual /afs/ir/data/linguistic-data/Treebank/tgrep2able/tgrep2-manual.pdf Other search tools: ..................................................... Tregex View: searches BNC http://view.byu.edu/ Parsers --------------------------------------------------------------------- The Stanford Parser several parsers in one: PCFG parser, lexicalized PCFG parser, dependency parser also a German parser based on NEGRA also a Chinese parser Java 1.4, includes source code can control with simple shell script get PTB-style parses for sentences online interface: http://josie.stanford.edu:8080/parser/ Collins's parser Charniak's parser MiniPar ................................................................. http://www.cs.ualberta.ca/~lindek/minipar.htm Rion's MINIPAR visualizations: http://ai.stanford.edu/~rion/parsing/minipar_viz.html Taggers --------------------------------------------------------------------- Stanford POS tagger Language models ------------------------------------------------------------- CMU-Cambridge Statistical Language Modeling toolkit Named entity recognizers ---------------------------------------------------- LingPipe Machine learning tools ------------------------------------------------------ Stanford Classifier conditional loglinear (aka maxent) model classification Dekang Lin's HMM and MaxEnt packages, in C++: http://www.cs.ualberta.ca/~lindek/downloads.htm Weka: Java library containing (nearly) every machine learning algorithm Naive Bayes, perceptron, decision tree, MaxEnt, SVM, etc. http://www.cs.waikato.ac.nz/ml/weka/ LIBSVM SVMLite countless open-source libraries/tools for Bayesian Learning, neural networks, decision trees, loglinear models, etc.