This site

::  HOME
What? What not?
::  Site map
::  About this site
 
 

 

Corpus-tools & other useful software

Corpora@Stanford

Getting started
@Stanford

::  Intro & Overview
Where corpora grow and why you like them
::  Playground rules
& registration

Apply for your visa to the land of corpora
::  Setting up your account
Pack your suitcase to the land of corpora

Available resources
@Stanford

::  User support
The Corpus TA &
our corpora-email-list
::  Corpora
[Ordering corpora | Checking out CDs]

::  Corpora-tools & Software
[Documents]

::  Corpus-related classes
& projects

Beyond Stanford

::  Top 10 info-sources
E-resources out there

For the Corpus TA

::  Guidelines & help
 

The grep-family

The primary Unix tools for searching corpora are the members of the "grep" family of programs:

General help on grep is available online. You can also see the manual pages under Unix/Linux (e.g. while logged into AFS) for detailed descriptions of these commands. For that type:

    man

Tip- be sure to check out Jeanette Pettibone's grep help.

Examples of usage - grep

    grep 'istic ' 072.Indept.55.corp
returns a list of the lines in the file 072.Indept.55.corp that contain the string 'istic' followed by a space character.
    grep -c 'istic ' 072.Indept.55.corp
returns the number of lines (-c for "count") containing at least one instance of the string 'istic '.
    grep 'istic' *
searches all files in the directory and returns the lines containing at least one instance of the string 'istic'
    grep 'istic' * | more
pipes the output through _more_ so that only one screenful will be displayed at a time
    grep 'as far as .* goes' * | more
searches all files in the directory for lines containing the string 'as far as ' followed by any number of characters followed by ' goes'. The period stands for 'any character' and the star stands for 'any number of times' - see the manpage for a description of the kinds of regular expressions you can use.
    grep 'may [a-z]* be' 003.Guardn.02.corp > ~/out.txt
returns all instances of 'may' followed by a sequence of lower-case alphabetic characters between spaces (roughly a word) followed by 'be' and sends the output to the file 'out.txt' in your home directory. (The corpora directories are write-protected.) Note that this will overwrite existing files with the same name without warning, so check first whether you already have a file with the name you're planning to use.
    grep 'to\.' *
searches for 'to' followed by a period. The period has to be 'escaped', i.e. preceded by a backslash, because it has a special meaning in regular expressions.
    grep 'to[\.,;\?!-]' *
searches for 'to' followed by any of the characters in the square brackets. The dash has to be last so it won't be interpreted as a range. The exclamation mark can't be last.
    grep "don't" *
when you search for a single quote, you have to use double quotes on the outside.

Examples of usage - egrep

    egrep 'qualit(y|ies)' *
returns lines containing either the string 'quality' or the string 'qualities'. egrep allows full regular expressions including disjunctions.

Examples of usage - sgrep

    sgrep 'gander' *
returns contiguous sequences of characters between periods, (roughly, "sentences") containing the string 'gander'. A few '.' characters are not counted as periods: those after Dr, Mrs, Mr, Prof, A-Z. sgrep uses perl regular expressions, which give even more flexibility.

Examples of usage - cgrep

    cgrep -4 ' Tom ' *
returns 4 lines of context before and after the matching lines. -p displays only context that belongs to the current paragraph.

Examples of usage - agrep

    agrep -2 'misspell' *
returns all lines containing strings that match the string 'misspell' with up to 2 errors.