| |
The grep-family
The primary Unix tools for searching corpora are the members of the
"grep" family of programs:
General help on grep is available
online. You can also see the manual pages under Unix/Linux (e.g. while logged into AFS) for detailed descriptions of
these commands. For that type:
Tip- be sure to check out Jeanette Pettibone's grep help.
Examples of usage - grep
grep 'istic ' 072.Indept.55.corp
returns a list of the lines in the file 072.Indept.55.corp that
contain the string 'istic' followed by a space character.
grep -c 'istic ' 072.Indept.55.corp
returns the number of lines (-c for "count") containing at
least one instance of the string 'istic '.
searches all files in the directory and returns the lines
containing at least one instance of the string 'istic'
pipes the output through _more_ so that only one screenful
will be displayed at a time
grep 'as far as .* goes' * | more
searches all files in the directory for lines containing the
string 'as far as ' followed by any number of characters followed
by ' goes'. The period stands for 'any character' and the star
stands for 'any number of times' - see the manpage for a
description of the kinds of regular expressions you can use.
grep 'may [a-z]* be' 003.Guardn.02.corp > ~/out.txt
returns all instances of 'may' followed by a sequence of lower-case
alphabetic characters between spaces (roughly a word) followed
by 'be' and sends the output to the file 'out.txt' in your
home directory. (The corpora directories are write-protected.)
Note that this will overwrite existing files with the same
name without warning, so check first whether you already have
a file with the name you're planning to use.
searches for 'to' followed by a period. The period has to be
'escaped', i.e. preceded by a backslash, because it has a
special meaning in regular expressions.
searches for 'to' followed by any of the characters in the
square brackets. The dash has to be last so it won't be
interpreted as a range. The exclamation mark can't be last.
when you search for a single quote, you have to use double
quotes on the outside.
Examples of usage - egrep
returns lines containing either the string 'quality' or the string
'qualities'. egrep allows full regular expressions including
disjunctions.
Examples of usage - sgrep
returns contiguous sequences of characters between periods, (roughly,
"sentences") containing the string 'gander'. A few '.' characters
are not counted as periods: those after Dr, Mrs, Mr, Prof, A-Z.
sgrep uses perl regular expressions, which give even more flexibility.
Examples of usage - cgrep
returns 4 lines of context before and after the matching lines.
-p displays only context that belongs to the current paragraph.
Examples of usage - agrep
returns all lines containing strings that match the string 'misspell'
with up to 2 errors.
|