The grep-family
The primary Unix tools for searching corpora are the members
of the
"grep" family of programs:
General help on grep is available online. You can
also see the manual pages under Unix/Linux (e.g. while logged into AFS)
for detailed descriptions of these commands. For that type:
Tip- be sure to check out Jeanette Pettibone's grep
help.
Examples of usage - grep
grep 'istic ' 072.Indept.55.corp
returns a list of the lines in the file 072.Indept.55.corp that contain
the string 'istic' followed by a space character.
grep -c 'istic ' 072.Indept.55.corp
returns the number of lines (-c for "count") containing at least one
instance of the string 'istic '.
searches all files in the directory and returns the lines containing at
least one instance of the string 'istic'
pipes the output through _more_ so that only one screenful will be
displayed at a time
grep 'as far as .* goes' * | more
searches all files in the directory for lines containing the string 'as
far as ' followed by any number of characters followed by ' goes'. The
period stands for 'any character' and the star stands for 'any number
of times' - see the manpage for a description of the kinds of regular
expressions you can use.
grep 'may [a-z]* be' 003.Guardn.02.corp > ~/out.txt
returns all instances of 'may' followed by a sequence of lower-case
alphabetic characters between spaces (roughly a word) followed by 'be'
and sends the output to the file 'out.txt' in your home directory. (The
corpora directories are write-protected.) Note that this will overwrite
existing files with the same name without warning, so check first
whether you already have a file with the name you're planning to use.
searches for 'to' followed by a period. The period has to be 'escaped',
i.e. preceded by a backslash, because it has a special meaning in
regular expressions.
searches for 'to' followed by any of the characters in the square
brackets. The dash has to be last so it won't be interpreted as a
range. The exclamation mark can't be last.
when you search for a single quote, you have to use double quotes on
the outside.
Examples of usage - egrep
returns lines containing either the string 'quality' or the string
'qualities'. egrep allows full regular expressions including
disjunctions.
Examples of usage - sgrep
returns contiguous sequences of characters between periods, (roughly,
"sentences") containing the string 'gander'. A few '.' characters are
not counted as periods: those after Dr, Mrs, Mr, Prof, A-Z. sgrep uses
perl regular expressions, which give even more flexibility.
Examples of usage - cgrep
returns 4 lines of context before and after the matching lines. -p
displays only context that belongs to the current paragraph.
Examples of usage - agrep
returns all lines containing strings that match the string 'misspell'
with up to 2 errors. |