CS224N Section 3
Corpora, et cetera
Bill MacCartney
5 May 2006


Today we'll be visiting websites, looking through AFS corpora directories, and
playing with tools like tgrep.


Project Ideas ===================================================================

    See http://nlp.stanford.edu/courses/cs224n/

    See project ideas from handout

    Don't restrict your attention to topics we cover in the first half of the
    quarter!


Corpora =========================================================================

    Tons of resources!

    See links at http://nlp.stanford.edu/links/statnlp.html


    Corpora@Stanford ------------------------------------------------------------

        http://www.stanford.edu/dept/linguistics/corpora/
        Corpus TA: Liz Coppock

        Stanford corpora:
        /afs/ir/data/linguistic-data/
        You need to belong to a special group for access
        Pi-chuan is gonna hook you up


    LDC = Linguistic Data Consortium --------------------------------------------

        http://www.ldc.upenn.edu/Catalog/


    Treebanks -------------------------------------------------------------------

        Penn Treebank ...........................................................

            There's PTB2 and PTB3.  Use PTB3.
    
            Penn Treebank parsed WSJ trees:
            /afs/ir/data/linguistic-data/Treebank/3/parsed/mrg/wsj/
    
            Penn Treebank tag set:
            http://bulba.sdsu.edu/jeanette/thesis/PennTags.html
            http://www.comp.leeds.ac.uk/amalgam/tagsets/upenn.html
    
            PTB contains:
              50,000 sentences (1,000,000 words) of WSJ text from 1989
              30,000 sentences (400,000 words) of Brown corpus
    
            PTB WSJ contains sections 0 through 24, ~2400 sentences each, but
            section 24 is half size.  ~50,000 sentences total.
    
            Convention in parsing world:
              sections 2-21: training (39,832 sentences)
              section 0 or 22 or 24: development testing
              section 23: final test data
    
            Sections 0 and 1 perceived to be less reliable -- annotators warming up.
    
            PTB3 adds some new stuff vs. PTB2, but NO BUG FIXES.


        Other parsed corpora ....................................................

            BLLIP: Like PTB, but 30m words, parsed automatically by Charniak
    
            Switchboard: telephone conversations
            NEGRA: German newspaper text
            TIGER
            ICE-GB


    POS-tagged corpora ----------------------------------------------------------

        PTB

        BNC
            100m words
            wide sample of British English: newspapers, books, letters


    Named Entity Recognition ----------------------------------------------------

        Message Understanding Conference (MUC)
            see e.g. /afs/ir/data/linguistic-data/MUC_7/muc_7/data/training.ne.eng.keys.980205

        CoNLL shared task in NER


    Speech ----------------------------------------------------------------------

        BNC: 10m words


    Dialog ----------------------------------------------------------------------

        Penn Treebank Switchboard corpus


    Foreign languages -----------------------------------------------------------

        Penn Arabic Treebank Corpus
            734 stories (140,000 words)
        
        Penn Chinese Treebank Corpus
            50,000 sentences
        
        NEGRA corpus
            20,000 sentences (350,000 words) of German newspaper text
            syntactically annotated (parsed)
            tgrep2able
        
        TIGER corpus
            40,000 sentences (700,000 words)
            syntactically annotated (parsed)


    Spam/Email ------------------------------------------------------------------

        The Enron corpus
        /afs/ir/data/linguistic-data/Enron-Email-Corpus/maildir/skilling-j/

        TREC Spam track
        http://trec.nist.gov/data/spam.html


    Question Answering (QA) -----------------------------------------------------

        E.g. "What film introduced Jar Jar Binks?"
        E.g. "How much is the Sacajawea coin worth?"

        The TREC competition
            http://trec.nist.gov/data/qa.html
            5 years' worth of data online


    Word Sense Disambiguation (WSD) ---------------------------------------------

        Senseval: http://www.senseval.org/


    Semantics -------------------------------------------------------------------

        WordNet .................................................................

            Website: http://wordnet.princeton.edu/
            Online:  http://wordnet.princeton.edu/perl/webwn
            150,000 nouns, verbs, adjectives, adverbs
            grouped into "synsets" with glosses, sentence frames
            includes hypernym (kind-of) hierarchy rooted at 'entity'
            also antonyms, holonyms & meronyms, polysemy
            good tutorial: http://www.brians.org/Projects/Technology/Papers/Wordnet/
            neat visual interface: http://www.visualthesaurus.com/?vt

            Problems with WordNet:
                - fine-grained senses
                - sense ordering sometimes funny (see "airline")

        
        PropBank ................................................................

            http://www.cs.rochester.edu/~gildea/PropBank/Sort/
            adds predicate-argument relations to PTB syntax trees
            "semantic role labeling"
            100,000 annotated verb tokens, 3,200 types
            covers all 1m words of WSJ section except be, do, have
        

        FrameNet ................................................................

            similar


        Lexical FreeNet .........................................................

            http://www.lexfn.com/


    Other lexical resources -----------------------------------------------------

        Dekang Lin's "thesaurus" (distributional similarity scores)
        http://www.cs.ualberta.ca/~lindek/downloads.htm


    Other -----------------------------------------------------------------------

        The web!  
        Google Web API, Google scraping
        Wikipedia


Software ========================================================================

    Corpus tools ----------------------------------------------------------------

        Tgrep2 ..................................................................

        Go to /afs/ir/data/linguistic-data/Treebank/tgrep2able/
        Read README
        
        To use tgrep2:
        
        - log in to e.g. firebird, raptor -- not elaine!  (needs linux)
        
        - put tgrep2 in your path:
          alias tgrep /afs/ir/data/linguistic-data/bin/linux_2_42/tgrep2
          setenv PATH /afs/ir/data/linguistic-data/bin/linux_2_4:$PATH
        
        - go to tgrep2able data directory
          preprocessed tgrep2able versions of:
            WSJ, Brown, Switchboard, NEGRA, Chinese Treebank
          /afs/ir/data/linguistic-data/Treebank/tgrep2able
        
        - try some commands
        
          tgrep2 -c wsj_mrg.t2c.gz 'Greenspan'
          tgrep2 -c wsj_mrg.t2c.gz 'NP < VP' | less
        
        - operators
        
          A < B      A immediately dominates B
          A << B     A dominates B
          A <- B     B is the last child of A
          A <<, B    B is a leftmost descendant of A
          A <<` B    B is a rightmost descendant of A
          A . B      A immediately precedes B
          A .. B     A precedes B
          A $ B      A and B are sisters
          A $. B     A and B are sisters and A immediately precedes B
          A $.. B    A and B are sisters and A precedes B
        
        - useful options
        
          -t  show only terminals nodes
          -w  show tree for whole sentence
          -a  show all matches in a sentence
        
        - neat example
        
          to search for NPs that are coordinations of plural nouns:
          tgrep2 -at '/NP*/ <1 NNS <2 (CC < and) <3 NNS'
        
        - environment variables
        
          setenv TGREP2_CORPUS /afs/ir/data/linguistic-data/Treebank/tgrep2able/wsj_mrg.t2c.gz
        
        - check the manual
          /afs/ir/data/linguistic-data/Treebank/tgrep2able/tgrep2-manual.pdf


        Other search tools: .....................................................

            Tregex
            
            View: searches BNC
            http://view.byu.edu/


    Parsers ---------------------------------------------------------------------

        The Stanford Parser
          several parsers in one:
            PCFG parser, lexicalized PCFG parser, dependency parser
            also a German parser based on NEGRA   
            also a Chinese parser
          Java 1.4, includes source code
          can control with simple shell script
          get PTB-style parses for sentences
          online interface: http://josie.stanford.edu:8080/parser/
        
        Collins's parser
        Charniak's parser

        MiniPar .................................................................

            http://www.cs.ualberta.ca/~lindek/minipar.htm
            Rion's MINIPAR visualizations:
                http://ai.stanford.edu/~rion/parsing/minipar_viz.html


    Taggers ---------------------------------------------------------------------

        Stanford POS tagger


    Language models -------------------------------------------------------------

        CMU-Cambridge Statistical Language Modeling toolkit


    Named entity recognizers ----------------------------------------------------

        LingPipe


    Machine learning tools ------------------------------------------------------

        Stanford Classifier
            conditional loglinear (aka maxent) model classification
        
        Dekang Lin's HMM and MaxEnt packages, in C++:
        http://www.cs.ualberta.ca/~lindek/downloads.htm

        Weka: Java library containing (nearly) every machine learning algorithm
        Naive Bayes, perceptron, decision tree, MaxEnt, SVM, etc.
        http://www.cs.waikato.ac.nz/ml/weka/

        LIBSVM
        SVMLite
        
        countless open-source libraries/tools for Bayesian Learning, neural
        networks, decision trees, loglinear models, etc.