This site

::  HOME
What? What not?
::  Site map
::  About this site
 
 

 

Corpus-tools & other useful software

Corpora@Stanford

Getting started
@Stanford

::  Intro & Overview
Where corpora grow and why you like them
::  Playground rules
& registration

Apply for your visa to the land of corpora
::  Setting up your account
Pack your suitcase to the land of corpora

Available resources
@Stanford

::  User support
The Corpus TA &
our corpora-email-list
::  Corpora
[Ordering corpora | Checking out CDs]

::  Corpora-tools & Software
[Documents]

::  Corpus-related classes
& projects

Beyond Stanford

::  Top 10 info-sources
E-resources out there

For the Corpus TA

::  Guidelines & help
 

Tgrep & TGrep2

Tgrep and the improved new version TGrep2 are Unix-based tools that allow you to search syntactically & POS-annotated corpora on AFS. The syntax needs some getting used to but is is worth it since the searches you can do with this tool are quite powerful. If you prefer a graphical interface you can use TIGERsearch which has the same search options. Below you find information on

Tip-1: Much of the information on this page is summarized on this handout on TGrep 1 by Tatiana Nikitina and Jeanette Pettibone (PDF file; ~130KB) which will provide you with a short intro and a comprised summary of the TGrep syntax.

Tip-2: New presentation available:

Setting up TGrep2

By Susanne Riehemann (ed. by Florian Jaeger): To use TGrep2 you need to be logged in to a firebird or raptor computer - TGrep2 is compiled only for linux and doesn't work on the elaines! Use Samson or any ssh software to connect to one of the firebirds, i.e. firebird1.stanford.edu through firebird15.stanford.edu

If you want to be able to use the command "tgrep2" without typing its full path name you need to add

    /afs/ir/data/linguistic-data/bin/linux_2_4

to your PATH variable and then log in again. You can do this by entering

    setenv PATH /afs/ir/data/linguistic-data/bin/linux_2_4:$PATH

The files for the tgrep2able versions of the Brown, Switchboard, WSJ, NEGRA, and Chinese Treebank corpora are in

    /afs/ir/data/linguistic-data/Treebank/tgrep2able

If you go there, you can type commands like

    tgrep2 -c wsj_mrg.t2c.gz 'NP < VP'

The above command line is in this form only possible if you have added the above path to your PATH variable as described (otherwise the command would be "/afs/ir/data/linguistic-data/bin/linux_2_4/tgrep2 -c /afs/ir/data/linguistic-data/Treebank/tgrep2able/wsj_mrg.t2c.gz 'NP < VP'").

There is a manual for TGrep2, which can also be found in

    /afs/ir/data/linguistic-data/Treebank/tgrep2able/tgrep2-manual.pdf

See also the TGrep syntax and some examples, The Treebank Page, Switchboard Tags, and Jeanette Pettibone's page on Treebank tags, but note that some of this information may be specific to TGrep and not TGrep2.

Setting up TGrep

By Susanne Riehemann (ed. by Florian Jaeger): To use TGrep you need to be logged in to an epic computer - this doesn't work on the elaines! Use Samson or any ssh software to connect to one of the epics, i.e. epic1.stanford.edu through epic28.stanford.edu

Assuming you use csh or tcsh, you need to do the following to set everything up properly. (You can do these at the prompt, but if you want to avoid having to do this every time you use TGrep, put them at the end of the .login file in your home directory. The first time you do this you'll need to log in again.):

    setenv TGREP_CORPUS /afs/ir/data/linguistic-data/Treebank/tgrepabl/swbd_mrg.crp
    setenv PATH /afs/ir/data/linguistic-data/bin/sun4x_57:$PATH
    setenv MANPATH /afs/ir/data/linguistic-data/man:$MANPATH

Then, you should be able to do:

    tgrep 'NP < VP'

which finds NPs which immediately dominate VPs, and have it work! You can find usage instructions in 'man tgrepdoc', while 'man tgrep' tells you about the command flags. See also the notes below on the TGrep syntax and some examples, The Treebank Page, Switchboard Tags, and Jeanette Pettibone's page on Treebank tags.

This sets things up for Switchboard (merged). On AFS there are now 4 TGrep indices, covering the parsed sections of switchboard, the WSJ (2 versions: one with PoS tags, one without), and Brown (only a small fragment was treebanked). You can change the value of TGREP_CORPUS above appropriately, or specify one on the command line.

    setenv TGREP_CORPUS /afs/ir/data/linguistic-data/Treebank/tgrepabl/swbd_mrg.crp
    setenv TGREP_CORPUS /afs/ir/data/linguistic-data/Treebank/tgrepabl/brown_mrg.crp
    setenv TGREP_CORPUS /afs/ir/data/linguistic-data/Treebank/tgrepabl/wsj_mrg.crp
    setenv TGREP_CORPUS /afs/ir/data/linguistic-data/Treebank/tgrepabl/wsj_skel.crp

Note that we now have TGrep2 as well!

The TGrep syntax and some examples

Searching for NPs that immediately dominate VPs:
    tgrep 'NP < VP'

If you want to look at this output as it is generated, use:

    tgrep 'NP < VP' | more

If you want to save the output to a file, use:

    tgrep 'NP < VP' > filename
Some useful command-line options: If you want to see only the terminal nodes of the tree, use:
    tgrep -t 'NP < VP'

If you want to see the tree for the whole sentence in which the match occurred, use:

    tgrep -w 'NP < VP'

If there are multiple matches for the pattern in a sentence, you can find them all with:

    tgrep -a 'NP < VP'

These switches can be combined, e.g. if you want to see the whole sentence that was matched, use:

    tgrep -tw 'NP < VP'
Some useful operators
A < B      A immediately dominates B
A << B     A dominates B
A <- B     B is the last child of A
A <<, B    B is a leftmost descendant of A
A <<` B    B is a rightmost descendant of A
A . B      A immediately precedes B
A .. B     A precedes B
A $ B      A and B are sisters
A $. B     A and B are sisters and A immediately precedes B
A $.. B    A and B are sisters and A precedes B

Some examples using these operators: To search for NPs that are coordinations of plural nouns:

    tgrep -at '/NP*/ <1 NNS <2 (CC < and) <3 NNS'

If you've done any interesting TGrep searches for your research, please send the commands to (Corpus TA), so other people can learn by example.

Differences between TGrep2 and Tgrep

The manual for TGrep2, locally stored at:

    /afs/ir/data/linguistic-data/Treebank/tgrep2able/tgrep2-manual.pdf
also contains a list of new features and changes with regards to the first version of TGrep in section 6 (p. 16-17). The main difference is that TGrep2 allows reference to edge labels (and as far as I see, secondary edge labels). Those labels are usually used to mark up additional information about a phrase, including:
  • That a to-PP is dative, or locative, etc.
  • The grammatical function of a constituent
  • X-bar tags (e.g. headedness)
  • expletive 'es' (for German)
  • etc.

From that it should be clear that this is a considerable improvement over TGrep. Another tool that allows searches that refer to any kind of egde labels is TigerSearch. Tgrep2 has also been improved in terms of the control it gives you over the form of the output, the speed of searches, and in that it now accepts search patterns from files as input rather than only command line pattern inputs.

Note that some features of the old TGrep are not anymore supported in TGrep2 (for a list, cf. the manual).