This site
| Corpus-tools & other useful softwareCorpora@Stanford |
|||||||||||||||||||
Getting started
|
||||||||||||||||||||
| :: |
Intro & Overview Where corpora grow and why you like them |
| :: |
Playground rules & registration Apply for your visa to the land of corpora |
| :: |
Setting up your account Pack your suitcase to the land of corpora |
| :: | User support The Corpus TA & our corpora-email-list |
| :: | Corpora [Ordering corpora | Checking out CDs] |
| :: | Corpora-tools & Software [Documents] |
| :: | Corpus-related classes & projects |
| :: | Top 10 info-sources E-resources out there |
| :: | Guidelines & help |
Tgrep and the improved new version TGrep2 are Unix-based tools that allow you to search syntactically & POS-annotated corpora on AFS. The syntax needs some getting used to but is is worth it since the searches you can do with this tool are quite powerful. If you prefer a graphical interface you can use TIGERsearch which has the same search options. Below you find information on
Tip-1: Much of the information on this page is summarized on this handout on TGrep 1 by Tatiana Nikitina and Jeanette Pettibone (PDF file; ~130KB) which will provide you with a short intro and a comprised summary of the TGrep syntax.
Tip-2: New presentation available:
By Susanne Riehemann (ed. by Florian Jaeger): To use TGrep2 you need to be logged in to a firebird or raptor computer - TGrep2 is compiled only for linux and doesn't work on the elaines! Use Samson or any ssh software to connect to one of the firebirds, i.e. firebird1.stanford.edu through firebird15.stanford.edu
If you want to be able to use the command "tgrep2" without typing its full path name you need to add
to your PATH variable and then log in again. You can do this by entering
The files for the tgrep2able versions of the Brown, Switchboard, WSJ, NEGRA, and Chinese Treebank corpora are in
If you go there, you can type commands like
The above command line is in this form only possible if you have added the above path to your PATH variable as described (otherwise the command would be "/afs/ir/data/linguistic-data/bin/linux_2_4/tgrep2 -c /afs/ir/data/linguistic-data/Treebank/tgrep2able/wsj_mrg.t2c.gz 'NP < VP'").
There is a manual for TGrep2, which can also be found in
See also the TGrep syntax and some examples, The Treebank Page, Switchboard Tags, and Jeanette Pettibone's page on Treebank tags, but note that some of this information may be specific to TGrep and not TGrep2.
By Susanne Riehemann (ed. by Florian Jaeger): To use TGrep you need to be logged in to an epic computer - this doesn't work on the elaines! Use Samson or any ssh software to connect to one of the epics, i.e. epic1.stanford.edu through epic28.stanford.edu
Assuming you use csh or tcsh, you need to do the following to set everything up properly. (You can do these at the prompt, but if you want to avoid having to do this every time you use TGrep, put them at the end of the .login file in your home directory. The first time you do this you'll need to log in again.):
Then, you should be able to do:
which finds NPs which immediately dominate VPs, and have it work! You can find usage instructions in 'man tgrepdoc', while 'man tgrep' tells you about the command flags. See also the notes below on the TGrep syntax and some examples, The Treebank Page, Switchboard Tags, and Jeanette Pettibone's page on Treebank tags.
This sets things up for Switchboard (merged). On AFS there are now 4 TGrep indices, covering the parsed sections of switchboard, the WSJ (2 versions: one with PoS tags, one without), and Brown (only a small fragment was treebanked). You can change the value of TGREP_CORPUS above appropriately, or specify one on the command line.
Note that we now have TGrep2 as well!
If you want to look at this output as it is generated, use:
If you want to save the output to a file, use:
If you want to see the tree for the whole sentence in which the match occurred, use:
If there are multiple matches for the pattern in a sentence, you can find them all with:
These switches can be combined, e.g. if you want to see the whole sentence that was matched, use:
A < B A immediately dominates B A << B A dominates B A <- B B is the last child of A A <<, B B is a leftmost descendant of A A <<` B B is a rightmost descendant of A A . B A immediately precedes B A .. B A precedes B A $ B A and B are sisters A $. B A and B are sisters and A immediately precedes B A $.. B A and B are sisters and A precedes B
Some examples using these operators: To search for NPs that are coordinations of plural nouns:
If you've done any interesting TGrep searches for your research, please send the commands to (Corpus TA), so other people can learn by example.
The manual for TGrep2, locally stored at:
From that it should be clear that this is a considerable improvement over TGrep. Another tool that allows searches that refer to any kind of egde labels is TigerSearch. Tgrep2 has also been improved in terms of the control it gives you over the form of the output, the speed of searches, and in that it now accepts search patterns from files as input rather than only command line pattern inputs.
Note that some features of the old TGrep are not anymore supported in TGrep2 (for a list, cf. the manual).