An introdution to Gsearch

By Susanne Riehemann (ed. by Florian Jaeger): Gsearch allows syntactic pattern matching over tagged corpora. It can be set up to tag corpora on the fly, by calling out to some tagging program, though this page does not contain information on this (if you happen to have experience with this, please contact the Corpus TA. The instructions on this page are based on the assumption that you use a Sun machine (i.e. a Leland machine, or a CSLI/ling one).

For more detailed information than given on this page, you can find documentation at:

    /afs/ir/data/linguistic-data/src/gsearch/doc/

Getting started

First, make sure you have the following included in your PATH variable ('echo $PATH' shows you what is included in your PATH):
    /afs/ir/data/linguistic-data/bin/sun4x_57

You can do this by typing (which will append the above-mentioned path to your PATH variable):

    setenv PATH /afs/ir/data/linguistic-data/bin/sun4x_57:$PATH

Corpora available for Gsearch

Next, to know the corpora names you will need for your search, have a look at the following file in which they are defined:

    /afs/ir/data/linguistic-data/src/gsearch/config/Setup

However, note that I just quickly edited it to get a few things working. Not everything defined there exists or works. These ones currently do:

  • bnc
    • bnc_1
    • bnc_10
    • bncsamp
  • brown
  • wsj

To search for a word:

    gsearch bnc_1 - '<word="suspicious">'

Examples for searches

To search for a tagged word, here 'butter' as a verb (note: BNC uses a different tag set to UPenn stuff):

    gsearch bnc_1 - "<tag=V.* & word=butter>"

To search for more than a word or tag, you need a grammar. There are some sample ones in:

    /afs/ir/data/linguistic-data/src/gsearch/Demo

Here is an example of a search using a the grammar in GrammarBNC in which the search pattern given below is defined (including the pps). The search looks for show as a noun with 2 prepositional phrases (pps) after it (the command line is one line):

    gsearch bnc_1 /afs/ir/data/linguistic-data/src/gsearch/Demo/GrammarBNC "<tag=NN.* & word=show>" pp pp

You will notice that the search is nothing else but pattern matching. It isn't doing any disambiguation, and so for the above example the PPs in the output of the search aren't actually modifying show but something else. C'est la vie.

Note that gsearch corpora aren't pre-indexed, it's actually searching through a gigabyte of data exhaustively if you search the whole BNC. So expect it to take a while.