Corpora@Stanford

Corpus Tools

This page contains a list of programs and scripts that you may find useful in your corpus-based research. It is by no means complete! If you need further advice, please feel free contact the corpus TA for help.

Overview

This page is primarily intended as an inventory of "low-level" tools, i.e. tools to help you with basic corpus tasks: for example, searching a corpus, creating subcorpora, sorting your results, saving your results, getting frequency lists, etc. We have focused on programs that are locally available but some links to scripts available on the net can also be found. Please be aware that there is simply no way to keep this list complete and up-to-date. If you do not find what you were looking for on this page consider one of the following steps: ask the Corpus TA for help, see if the particular corpus you're interested in comes with search tools of its own, search on Google for possible resources we've missed on this page, or write your own new search tool! Also, keep in mind that many of the computational projects at PARC or CSLI work with corpora and may have their own specialized tools.

This page has three parts:

Locally available corpus-software
Other software (useful but not specifically designed for corpora)
Scripts, little helpers

Locally available corpus-software

The table below summarizes the available software tools at Stanford. The first column gives the name of the software; the second column gives a short description of the tool; the third column lists all locations where the software is installed; the fourth column specifies whether the software requires the corpus to be in special format ('special') or whether the tools works for any format ('any') - for more detail, see the introduction page for this tool which should also state where you can find the specific corpora if needed (or ask the Corpus TA); the fifth column list links to manuals if available; the sixth column lists links to tutorials and — if available — summaries and introductions written specifically for Stanford. Click on the tutorial link (if available) to learn more about the specific software. These kind of introductions pay specific attention to the local hardware setup and tell you how you have to prepare your account (if necessary) to use the respective tool.

Jan Strunk's code to perform linguistic searches using Google.
Overview of search tools for syntactically annotated corpora (from the 02/06/2004 tutorial) available as pdf and as interactive html presentation.

Name Description Where Format External Info Internal Info

Stuttgart Corpus Workbench (CQP, XKWIC) Regular expression searches, sorting, frequencies, subcorpora Turing special yes yes

Gsearch Tag and word searches; syntactic searches with self-defined grammar AFS special
yes

Proposition Bank Java API (Scott Cotton) browse and annotate Proposition Bank corpus AFS special yes yes

CorpusSearch Lite
[v1.1] search corpora in the Penn Treebank format. It is not corpus specific, but will work on any corpus in the correct format. It can be used to search any of the English Parsed Corpora series. AFS special yes -

TIGERsearch
[v2.1] searches or browses syntactically & POS-tagged corpora; graphical user interface; graphic tree display AFS special yes yes

TGrep & TGrep2
[v1 and v2] searches syntactically & POS-tagged corpora AFS special yes yes

The grep-family:
grep, egrep, sgrep, cgrep, agrep non-syntactic regular expression searches of text-files AFS any yes yes

UNIX commands:
wc, freq, cat non-syntactic regular expression searches of text-files AFS any - -

Thorsten Brants's part-of-speech tagger (TnT) POS tagging; preparation of corpora AFS any
yes

Name	Description	Where	Format	External Info	Internal Info
Stuttgart Corpus Workbench (CQP, XKWIC)	Regular expression searches, sorting, frequencies, subcorpora	Turing	special	yes	yes
Gsearch	Tag and word searches; syntactic searches with self-defined grammar	AFS	special		yes
Proposition Bank Java API (Scott Cotton)	browse and annotate Proposition Bank corpus	AFS	special	yes	yes
CorpusSearch Lite [v1.1]	search corpora in the Penn Treebank format. It is not corpus specific, but will work on any corpus in the correct format. It can be used to search any of the English Parsed Corpora series.	AFS	special	yes	-
TIGERsearch [v2.1]	searches or browses syntactically & POS-tagged corpora; graphical user interface; graphic tree display	AFS	special	yes	yes
TGrep & TGrep2 [v1 and v2]	searches syntactically & POS-tagged corpora	AFS	special	yes	yes
The grep-family: grep, egrep, sgrep, cgrep, agrep	non-syntactic regular expression searches of text-files	AFS	any	yes	yes
UNIX commands: wc, freq, cat	non-syntactic regular expression searches of text-files	AFS	any	-	-
Thorsten Brants's part-of-speech tagger (TnT)	POS tagging; preparation of corpora	AFS	any		yes

Other software

Below we have compiled a list of software and where you can find it that is not specifically designed for corpus-based research but nevertheless often useful. For example, you may work with sound files, annotation and transcription, or simply need a good editor that allows you to open different file formats (e.g. to do searches in those files). The list below is by no means complete, so ask your colleagues, fellow students, or the Corpus TA for more leads.

Name Description Where Manual

Xwaves sound file player; manupulation; reads annotated files Phonetics Lab

Praat sound file player; phonetic analysis; manupulation; reads/creates annotated files; Phonetics Lab guide
tutorials

Name	Description	Where	Manual
Xwaves	sound file player; manupulation; reads annotated files	Phonetics Lab
Praat	sound file player; phonetic analysis; manupulation; reads/creates annotated files;	Phonetics Lab	guide tutorials

Scripts and little helpers

If you find any scripts that are not listed here that you consider useful, please let us know and maybe send a short description of the script along (2-5 lines) — this will help others a lot.

Brett Kessler's Search
From Brett Kessler's webpage: This program searches text corpora for arbitrary regular expressions and produces a report in HTML format. It can read local files, or those available by HTTP or FTP, and it knows how to unpack ZIP files. It requires Perl 5, and the following network modules: Net:: FTP, LWP:: Simple, and LWP:: UserAgent.
Chris Manning's sgrep.prl script
Based on an earlier version by Tom Veatch. Does whole sentence matching of newswire corpus.
Chris Manning's extractbody.prl script
Takes out a particular SGML element (given stand-off annotation), and thus can prefilter LDC newswire.
Jason Brenier's ExtractUnitAcoustics script
On AFS at /afs/ir/data/linguistic-data/Switchboard/swbd-tools/ExtractUnitAcoustics
Extracts acoustic features from Switchboard data. See 00README.TXT file in the directory for instructions.

Corpus Resources

Corpus Tools

Overview

Locally available corpus-software

Other software

Scripts and little helpers