This site

::  HOME
What? What not?
::  Site map
::  About this site
 
 

 

Corpus-tools & other useful software

Corpora@Stanford

Getting started
@Stanford

::  Intro & Overview
Where corpora grow and why you like them
::  Playground rules
& registration

Apply for your visa to the land of corpora
::  Setting up your account
Pack your suitcase to the land of corpora

Available resources
@Stanford

::  User support
The Corpus TA &
our corpora-email-list
::  Corpora
[Ordering corpora | Checking out CDs]

::  Corpora-tools & Software
[Documents]

::  Corpus-related classes
& projects

Beyond Stanford

::  Top 10 info-sources
E-resources out there

For the Corpus TA

::  Guidelines & help
 

Overview

This page contains a list of programs and scripts that you may find useful in your corpus-related research. It is primarily intended as an inventory of "low-level" tools, i.e. tools that are helping you with basic corpus task, such as searching a corpus, creating subcorpora, sorting your results, saving your results, getting frequency lists, etc. We have focused on programs that are locally available but some links to scripts available on the net can also be found. Please be aware that there is simply no way to keep this list complete and up-to-date. If you do not find what you were looking for on this page consider one of the following steps: ask the Corpus TA for help or look at our favorites, the top 10 info-sources for further information. Also, keep in mind that many of the computational projects at PARC or CSLI work with corpora and may have their own specialized tools.

This page has three parts:

Note that this page also contains links to tutorials and manuals.

Locally available corpus-software

The table below summarizes the available software tools at Stanford. The first column gives the name of the software; the second column gives a short description of the tool; the third column lists all locations (AFS, CC .ak.a. Corpus PC, etc.) where the software is installed; the fourth column specifies whether the software requires the corpus to be in special format ('special') or whether the tools works for any format ('any') - for more detail, see the introduction page for this tool which should also state where you can find the specific corpora if needed (or ask the Corpus TA); the fifth column list links to manuals if available (note that most of them and additional materials are also available in printed form at the corpus computer in the computer cluster and as files on the corpus computer); the sixth column lists links to tutorials and - if available - summaries/ introductions written specifically for you as an audience. Click on the tutorial link (if available) to learn more about the specific software. These kind of introductions pay specific attention to the local hardware setup and tell you how you have to prepare your account (if necessary) to use the respective tool.

  • NEW: Jan Strunk's presentation on the new online search engine from the University of Maryland, The Linguist's Search Engine. [ppt] - the presentation contains many useful links and a quick overview (from the 02/10/2004 meeting).
  • Also, check out Jan's page which contains a link to another tool developed by him for doing enhanced searches on google.
  • NEW: Overview of search tools for syntactically annotated corpora (from the 02/06/2004 tutorial) available as pdf and as interactive html presentation.
Name Description Where Format Info
(external)
Intro
(internal)
Stuttgart Corpus Workbench (CQP, XKWIC) Regular expression searches, sorting, frequencies, subcorpora Turing special yes yes
Gsearch Tag and word searches; syntactic searches with self-defined grammar AFS special yes
Proposition Bank Java API (Scott Cotton) browse and annotate Proposition Bank corpus AFS special yes yes
COSMAS II Online Client searches huge German online corpus COSMAS II CC special yes -
WordSmith
searches plain & tagged corpora; word & frequency lists, etc. CC any/
special
yes -
MonoConc
[v2.0 Pro]
searches plain & tagged corpora; frequency lists, etc. CC special tour
manual
-
CorpusSearch Lite
[v1.1]
search corpora in the Penn Treebank format. It is not corpus specific, but will work on any corpus in the correct format. It can be used to search any of the English Parsed Corpora series. AFS special yes -
TIGERsearch
[v2.1]
searches or browses syntactically & POS-tagged corpora; graphical user interface; graphic tree display CC, AFS special yes yes
TGrep & TGrep2
[v1 and v2]
searches syntactically & POS-tagged corpora AFS special yes yes
The grep-family:
grep, egrep, sgrep, cgrep, agrep
non-syntactic regular expression searches of text-files AFS any yes yes
UNIX commands:
wc, freq, cat
non-syntactic regular expression searches of text-files AFS any - -
Thorsten Brants's part-of-speech tagger (TnT) POS tagging; preparation of corpora AFS any yes

Other software

Below we have compiled a list of software and where you can find it that is not specifically designed for corpus-based research but nevertheless often useful. For example, you may work with sound files, annotation & transcription, or simply need a good editor that allows you to open different file formats (e.g. to do searches in those files). The list below is by no means complete (e.g. the obvious media players on the Corpus PC are not listed), so ask your colleagues, fellow students, or the Corpus TA (you guessed it).

Name Description Where Manual
(external)
UltraEdit Opens all kinds of text files (Unix, Windows, Mac); powerful search functions; RegExp and more; etc. Corpus PC
Xwaves sound file player; manupulation; reads annotated files Phonetics Lab
Praat sound file player; phonetic analysis; manupulation; reads/creates annotated files; Corpus PC/
phonetics lab
guide
tutorials

Script & little helpers

A word about scripts: if you find any that are not listed here that you consider useful, please let us know and maybe send a short description of the script along (2-5 lines) - this will help others a lot. Think science =).

  • Brett Kessler's Search
      From Brett Kessler's webpage: This program searches text corpora for arbitrary regular expressions and produces a report in HTML format. It can read local files, or those available by HTTP or FTP, and it knows how to unpack ZIP files. It requires Perl 5, and the following network modules: Net:: FTP, LWP:: Simple, and LWP:: UserAgent.
  • Chris Manning's sgrep.prl script (based on an earlier version by Tom Veatch): does whole sentence matching of newswire corpus.
  • Chris Manning's extractbody.prl script: will take out a particular SGML element (given stand-off annotation), and thus can prefilter LDC newswire.