The grep-family

The primary Unix tools for searching corpora are the members of the "grep" family of programs:

General help on grep is available online. You can also see the manual pages under Unix/Linux (e.g. while logged into AFS) for detailed descriptions of these commands. For that type:


Tip- be sure to check out Jeanette Pettibone's grep help.

Examples of usage - grep

    grep 'istic ' 072.Indept.55.corp
returns a list of the lines in the file 072.Indept.55.corp that contain the string 'istic' followed by a space character.
    grep -c 'istic ' 072.Indept.55.corp
returns the number of lines (-c for "count") containing at least one instance of the string 'istic '.
    grep 'istic' *
searches all files in the directory and returns the lines containing at least one instance of the string 'istic'
    grep 'istic' * | more
pipes the output through _more_ so that only one screenful will be displayed at a time
    grep 'as far as .* goes' * | more
searches all files in the directory for lines containing the string 'as far as ' followed by any number of characters followed by ' goes'. The period stands for 'any character' and the star stands for 'any number of times' - see the manpage for a description of the kinds of regular expressions you can use.
    grep 'may [a-z]* be' 003.Guardn.02.corp > ~/out.txt
returns all instances of 'may' followed by a sequence of lower-case alphabetic characters between spaces (roughly a word) followed by 'be' and sends the output to the file 'out.txt' in your home directory. (The corpora directories are write-protected.) Note that this will overwrite existing files with the same name without warning, so check first whether you already have a file with the name you're planning to use.
    grep 'to\.' *
searches for 'to' followed by a period. The period has to be 'escaped', i.e. preceded by a backslash, because it has a special meaning in regular expressions.
    grep 'to[\.,;\?!-]' *
searches for 'to' followed by any of the characters in the square brackets. The dash has to be last so it won't be interpreted as a range. The exclamation mark can't be last.
    grep "don't" *
when you search for a single quote, you have to use double quotes on the outside.

Examples of usage - egrep

    egrep 'qualit(y|ies)' *
returns lines containing either the string 'quality' or the string 'qualities'. egrep allows full regular expressions including disjunctions.

Examples of usage - sgrep

    sgrep 'gander' *
returns contiguous sequences of characters between periods, (roughly, "sentences") containing the string 'gander'. A few '.' characters are not counted as periods: those after Dr, Mrs, Mr, Prof, A-Z. sgrep uses perl regular expressions, which give even more flexibility.

Examples of usage - cgrep

    cgrep -4 ' Tom ' *
returns 4 lines of context before and after the matching lines. -p displays only context that belongs to the current paragraph.

Examples of usage - agrep

    agrep -2 'misspell' *
returns all lines containing strings that match the string 'misspell' with up to 2 errors.