This site

::	HOME What? What not?
::	Site map
::	About this site

Corpus-tools & other useful software

Corpora@Stanford

Getting started
@Stanford

::	Intro & Overview Where corpora grow and why you like them
::	Playground rules & registration Apply for your visa to the land of corpora
::	Setting up your account Pack your suitcase to the land of corpora

Available resources
@Stanford

::	User support The Corpus TA & our corpora-email-list
::	Corpora [Ordering corpora \| Checking out CDs]
::	Corpora-tools & Software [Documents]
::	Corpus-related classes & projects

Beyond Stanford

::	Top 10 info-sources E-resources out there

For the Corpus TA

::	Guidelines & help

The grep-family

The primary Unix tools for searching corpora are the members of the "grep" family of programs:

grep
egrep
sgrep
cgrep
agrep
tgrep (a separate tutorial is available)

General help on grep is available online. You can also see the manual pages under Unix/Linux (e.g. while logged into AFS) for detailed descriptions of these commands. For that type:

Tip- be sure to check out Jeanette Pettibone's grep help.

Examples of usage - grep

grep 'istic ' 072.Indept.55.corp returns a list of the lines in the file 072.Indept.55.corp that contain the string 'istic' followed by a space character.

grep -c 'istic ' 072.Indept.55.corp returns the number of lines (-c for "count") containing at least one instance of the string 'istic '.

grep 'istic' * searches all files in the directory and returns the lines containing at least one instance of the string 'istic'

grep 'istic' * | more pipes the output through _more_ so that only one screenful will be displayed at a time

grep 'as far as .* goes' * | more searches all files in the directory for lines containing the string 'as far as ' followed by any number of characters followed by ' goes'. The period stands for 'any character' and the star stands for 'any number of times' - see the manpage for a description of the kinds of regular expressions you can use.

grep 'may [a-z]* be' 003.Guardn.02.corp > ~/out.txt returns all instances of 'may' followed by a sequence of lower-case alphabetic characters between spaces (roughly a word) followed by 'be' and sends the output to the file 'out.txt' in your home directory. (The corpora directories are write-protected.) Note that this will overwrite existing files with the same name without warning, so check first whether you already have a file with the name you're planning to use.

grep 'to\.' * searches for 'to' followed by a period. The period has to be 'escaped', i.e. preceded by a backslash, because it has a special meaning in regular expressions.

grep 'to[\.,;\?!-]' * searches for 'to' followed by any of the characters in the square brackets. The dash has to be last so it won't be interpreted as a range. The exclamation mark can't be last.

grep "don't" * when you search for a single quote, you have to use double quotes on the outside.

Examples of usage - egrep

egrep 'qualit(y|ies)' * returns lines containing either the string 'quality' or the string 'qualities'. egrep allows full regular expressions including disjunctions.

Examples of usage - sgrep

sgrep 'gander' * returns contiguous sequences of characters between periods, (roughly, "sentences") containing the string 'gander'. A few '.' characters are not counted as periods: those after Dr, Mrs, Mr, Prof, A-Z. sgrep uses perl regular expressions, which give even more flexibility.

Examples of usage - cgrep

cgrep -4 ' Tom ' * returns 4 lines of context before and after the matching lines. -p displays only context that belongs to the current paragraph.

Examples of usage - agrep

agrep -2 'misspell' * returns all lines containing strings that match the string 'misspell' with up to 2 errors.