This site

::	HOME What? What not?
::	Site map
::	About this site

This page is NOT LONGER KEPT UP-TO-DATE as of 04/01/2004

Corpora@Stanford

Getting started
@Stanford

::	Intro & Overview Where corpora grow and why you like them
::	Playground rules & registration Apply for your visa to the land of corpora
::	Setting up your account Pack your suitcase to the land of corpora

Available resources
@Stanford

::	User support The Corpus TA & our corpora-email-list
::	Corpora [Ordering corpora \| Checking out CDs]
::	Corpora-tools & Software [Documents]
::	Corpus-related classes & projects

Beyond Stanford

::	Top 10 info-sources E-resources out there

For the Corpus TA

::	Guidelines & help

Overview

Beside the corpora that we own on CD (which you can get from the Corpus TA, many corpora are installed and ready-to-use on either the AFS space or the corpus computer (CC). Some additional speech corpora may be available in the phonetics lab (see also the tip below). Please contact the Phonetics RA or the Corpus TA if you have questions about speech corpora. Although this page is not intended to give an overview of available online corpora (outside of Stanford), a very small selection is actually provided on this page. Nevertheless, we strongly encourage you to take advantage of the variety of freely accessible online corpora - for some links to sites that will provide you with an overview of the colorful world of online corpora, please browse & click through our subjectively construed list of the top 10 info-sources "out there".

This page has four main parts:

Recent acquisitions
Corpora on AFS
Corpora on CC
Corpora on CD/DVD
Corpora in archives
Corpora on the WWW

In addition, you will often find the most recently acquired corpora summarized at the top of this page.

Tip-1: Since this page is not a database you may find it useful to just use your browser search function if you are interested in a specific corpus, or corpora for a specific language, or any other key word (e.g. "syntactically annotated"). Try it!

Tip-2: Interested in prosody? See Florian Jaeger's page of annotated links to prosodically annotated speech corpora.

Tip-3: If you cannot find your the corpus you were looking for - see if we can order it! Our department is a member of the LDC (Linguistic Data Consortium) which gives us free access to a lot of corpora. It may also be able to order other corpora. Simply tell the corpus TA what you need, but have a look at the information on "ordering corpora from the LDC" first, or browse the web to see whether what you need can be found online (and maybe for free). Also keep in mind that our inventory may be outdated since our resources to maintain this list are limited.

Recently acquired corpora (as of 02/04/04)

We've acquired a fair number of corpora and tools recently. Notably we've now got several new treebanks at Stanford and we update some older corpora to newer versions:

Winter 2004

ICSI Meeting transcripts [AFS] - information not entered below yet
Arabic Gigaword [DVD | CC]
Chinese Gigaword [DVD | CC]
The IViE Corpus (English Intonation in the British Isles) [CC]
Bavarian Speech Archive (BAS) annotation of VerbMobil 1 & 2 data [AFS]
Proposition Bank [AFS]
SLX Corpus of Classic Sociolinguistic Interviews [DVD]
Santa Barbara Corpus of Spoken American English Part-II [DVD | CC]
ECI Multilingual Text [AFS | CD]
English Gigaword [DVD | CC]
UN Parallel Text (Complete) [CD]
The AQUAINT Corpus of English News Text [CD]

Fall 2003
Topic Detection and Tracking (TDT3) Multilanguage Text 2 [AFS]
LUCY, initial release [AFS]
SUSANNE Corpus, Release 5 [AFS]
CHRISTINE, Stage I, Release 2 [AFS]
The York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE) [AFS]
Corpus of Spoken Professional American English (tagged & untagged) [CC]

Summer 2003
TIGER release 1.0 [CC]
Penn Chinese Treebank [AFS | CC]
Penn Arabic Treebank [AFS | CC]
NEGRA treebank (German) [AFS | CC]
TIGER corpus (German) [AFS | CC]
Prague Dependency Bank (Czech) [AFS]

Corpora on AFS space (as of 01/16/2004)

Air Traffic Control Corpus - Transcripts only, LDC94S14A:
/afs/ir/data/linguistic-data/Air-Traffic-Control
Arabic Treebank, LDC2003T06:
/afs/ir/data/linguistic-data/Arabic-Treebank
Bavarian Speech Archive (BAS) annotation of VerbMobil 1 & 2 data BAS website
/afs/ir/data/linguistic-data/Verbmobil-Dialogs/BAS-VM-annotation/
BLLIP-WSJ (Brown Laboratory for Linguistic Information Processing) LDC2000T43:
/afs/ir/data/linguistic-data/BLLIP-WSJ
Boston University Radio Speech Corpus, LDC96S36:
/afs/ir/data/linguistic-data/Boston-University-Radio
BNC World Edition (license conditions and installation of SARA software being studied)
/afs/ir/data/linguistic-data/BNC-world
Broadcast News Transcripts (CSR-VI), LDC98T28:
/afs/ir/data/linguistic-data/Broadcast-News-Transcripts
CALLHOME:
/afs/ir/data/linguistic-data/CALLHOME
- CALLHOME American English Lexicon, LDC97L20:
  /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-English-Lexicon
- CALLHOME American English Transcripts, LDC97T14:
  /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-English-Transcripts
- CALLHOME Egyptian Arabic Lexicon, LDC97L19:
  /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Arabic-Lexicon
- CALLHOME German Lexicon LDC97L18:
  /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-German-Lexicon
- CALLHOME German Transcripts LDC97T15:
  /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-German-Transcripts
- CALLHOME Japanese Lexicon LDC96L17:
  /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Japanese-Lexicon
- CALLHOME Mandarin Chinese Lexicon LDC96L16:
  /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Mandarin-Lexicon
- CALLHOME Spanish Lexicon, LDC96L16:
  /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Spanish-Lexicon
- CALLHOME Spanish Transcripts, LDC96T17:
  /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Spanish-Transcripts
CELEX 2, LDC96L14:
[special license condition: one license per research group]
/afs/ir/data/linguistic-data/CELEX
Chinese Treebanks
- Chinese Treebank 2, LDC2001T11:
  /afs/ir/data/linguistic-data/Chinese-Treebank/Chinese-Treebank-2
- Chinese Treebank 3, LDC2003E06:
  /afs/ir/data/linguistic-data/Chinese-Treebank/Chinese-Treebank-3
CHRISTINE, Stage I, Release 2 CHRISTINE project
/afs/ir/data/linguistic-data/CHRISTINE/
CMU Pronouncing Dictionary:
/afs/ir/data/linguistic-data/CMU-Pronouncing-Dict
ECI Multilingual Text LDC94T5
/afs/ir/data/linguistic-data/ECI-Multilingual
EXCITE:
/afs/ir/data/linguistic-data/IR/EXCITE
Hansard French/English, LDC95T20:
/afs/ir/data/linguistic-data/Hansard-French
HCRC Maptask, LDC93S12:
/afs/ir/data/linguistic-data/HCRC-Maptask-Transcripts
Hong Kong Hansards Parallel Text, LDC2000T50:
/afs/ir/data/linguistic-data/Hansard-Hong-Kong
Hong Kong Laws, LDC2000T47:
/afs/ir/data/linguistic-data/Hong-Kong-Laws
Hong Kong News, LDC2000T46:
/afs/ir/data/linguistic-data/Hong-Kong-News
Hub-5 Spanish Transcripts, LDC98T27:
/afs/ir/data/linguistic-data/Hub5-Spanish-Transcripts
ICAME:
/afs/ir/data/linguistic-data/ICAME
ICE-GB (International Corpus of English - The British Component):
/afs/ir/data/linguistic-data/ICE-GB (If you want to borrow the CD to install the search software on your Windows PC let me know. It doesn't work for Macs or Unix computers.)
IE (Information Extraction):
/afs/ir/data/linguistic-data/IE
- Corporate Acquisitions Annotated Reuters Texts:
  /afs/ir/data/linguistic-data/IE/CorpAcq-Reuters-Freitag
- Kristie Seymore's Information Extraction Data:
  /afs/ir/data/linguistic-data/IE/Kristie-Seymore-IE
- MUC3-4 (Message Understanding Conference):
  /afs/ir/data/linguistic-data/IE/MUC/MUC3-4
- MUC-6 (Message Understanding Conference) Text collection, LDC96T10:
  /afs/ir/data/linguistic-data/IE/MUC/MUC6/LDC96T10-Muc6
- MUC-6 (Message Understanding Conference), LDC2003T13:
  /afs/ir/data/linguistic-data/IE/MUC/MUC6/LDC2003T13-muc6
- Mooney Job Data:
  /afs/ir/data/linguistic-data/IE/Mooney-Job-Data
- Census 1990 Names:
  /afs/ir/data/linguistic-data/IE/census1990names
Japanese Business News, LDC95T8:
/afs/ir/data/linguistic-data/Japanese-Business-News
LUCY, initial release (copyright free version) LUCY project
/afs/ir/data/linguistic-data/LUCY
North American News Text Corpus, LDC95T21:
/afs/ir/data/linguistic-data/North-American-News
PPCME2 PPCME2 website [requires membership in a special group]:
/afs/ir/data/linguistic-data/TREC/PPCME2
Prague Dependency Bank (Czech) LDC2001T10
/afs/ir/data/linguistic-data/PragueDependencyTreebank_v1.0
Proposition Bank (experimental pre-release) Proposition Bank website (predicate structure enriched treebank) [related tools]
/afs/ir/data/linguistic-data/PropBank
Remedia Story Comprehension: (use requires special permission)
/afs/ir/data/linguistic-data/QA/Remedia-Story-Comprehension
Reuters Corpus
/afs/ir/data/linguistic-data/Reuters-Corpus
SAID (A Syntactically Annotated Idiom Dataset), LDC2003T10
/afs/ir/data/linguistic-data/SAID
Santa Barbara Corpus of Spoken American English, LDC2000S85:
/afs/ir/data/linguistic-data/Santa-Barbara
Spanish Broadcast News, LDC98T29:
/afs/ir/data/linguistic-data/Spanish-Broadcast-News
SPINE, Speech in Noisy Environments, LDC2000S87 and LDC2000T49:
/afs/ir/data/linguistic-data/SPINE
SUSANNE Corpus, Release 5 SUSANNE project
/afs/ir/data/linguistic-data/SUSANNE
Switchboard Transcripts, LDC93S7-T:
/afs/ir/data/linguistic-data/Switchboard/Switchboard-Transcripts
TDT Pilot Study, LDC98T25 [special user agreement]:
/afs/ir/data/linguistic-data/TDT-Pilot-Study
TDT2 Careful Transcription, LDC2000T44:
/afs/ir/data/linguistic-data/TDT2-Careful
TDT2 Multilanguage Text 4, LDC2001T57
/afs/ir/data/linguistic-data/TDT2-Multilingual
TDT3 Multilanguage Text 2, LDC2001T58
/afs/ir/data/linguistic-data/TDT2-Multilingual
Text Categorization:
/afs/ir/data/linguistic-data/TextCat
- 20Newsgroups:
  /afs/ir/data/linguistic-data/TextCat/20Newsgroups
- DavidLewis (Reuters, TREC-AP):
  /afs/ir/data/linguistic-data/TextCat/DavidLewis
- Spam Filtering:
  /afs/ir/data/linguistic-data/TextCat/Spam-Filtering
TIDIGITS, LDC93S10:
/afs/ir/data/linguistic-data/TIDIGITS
TIMIT, LDC93S1:
/afs/ir/data/linguistic-data/TIMIT
Tipster Complete, LDC93T3A [each user needs to sign license]:
/afs/ir/data/linguistic-data/Tipster
TRAINS, LDC95S25:
/afs/ir/data/linguistic-data/TRAINS
TREC (Information Retrieval Text Research Collection):
/afs/ir/data/linguistic-data/TREC/
- TREC-1 (=Tipster 1): [each user needs to sign license]
  /afs/ir/data/linguistic-data/TREC/TREC-1
- TREC-2 (=Tipster 2): [each user needs to sign license]
  /afs/ir/data/linguistic-data/TREC/TREC-2
- TREC-3 (=Tipster 3): [each user needs to sign license]
  /afs/ir/data/linguistic-data/TREC/TREC-3
- TREC-4:
  /afs/ir/data/linguistic-data/TREC/TREC-4
- TREC-5:
  /afs/ir/data/linguistic-data/TREC/TREC-5
- OHSUMED-TREC-9:
  /afs/ir/data/linguistic-data/TREC/OHSUMED-TREC-9
Treebank Release 2 and 3, LDC95T7 and LDC99T42:
/afs/ir/data/linguistic-data/Treebank
UMLS (Unified Medical Language System):
/afs/ir/data/linguistic-data/UMLS
Verbmobil Dialogs:
/afs/ir/data/linguistic-data/Verbmobil-Dialogs
Voicemail Corpus Part 1, Speech and Transcripts, LDC98S77:
/afs/ir/data/linguistic-data/Voicemail1
WSD (Word Sense Disambiguation):
/afs/ir/data/linguistic-data/WSD
- DSO Sense-Tagged, LDC97T12:
  /afs/ir/data/linguistic-data/WSD/DSO-Sense-Tagged
- Leacock's Data:
  /afs/ir/data/linguistic-data/WSD/leacock
- Pedersen's Data:
  /afs/ir/data/linguistic-data/WSD/pedersen
- Senseval1:
  /afs/ir/data/linguistic-data/WSD/senseval/senseval1
York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), YCOE project [special user agreement]
/afs/ir/data/linguistic-data/YCOE

Corpora on the Corpus Computer

In addition to the corpora on AFS, a couple of corpora are only stored on the corpus computer. All corpora are stored on the D-partition of the corpus computer. This section will undergo further revision and more details about the available corpora will be added soon:

Name	Annotation	Language(s)	Format	Associated tools
Aleksova's corpus	-	Bulgarian (spoken)	Winword files	-
Arabic Gigaword		Arabic
ATIS	Syntax, POS, some argument structure	English	TIGER XML, MRG	TIGERSearch
Bavarian Archive of Speech Corpora (only annotations)	Prosody, syntax, POS, transcribed	German, English, Japanese	BAS format	-
Brown Corpus	Syntax, POS, some argument structure	English	TIGER XML, MRG	TIGERSearch
Chinese Gigaword		Chinese
Chinese Treebank	Syntax, POS, some argument structure	Chinese	TIGER XML, MRG	TIGERSearch
Corpus of Spoken Professional American English	POS	American English (spoken)	SGML-tagged, plain text	MonoConc
English Gigaword		English
IMS German radio news (Nachrichten) corpus	Prosodically annotated & transcribed speech files	German (spoken)	ToBI annotation	-
IViE	Prosody, phonetic, etc.	British dialects	-	-
NEGRA	Syntax (LFG-based), POS, some argument structure	German	TIGER XML, NEGRA format	TIGERSearch
Santa Barbara Corpus of Spoken American English Part-II	speech, intonation, transcribed	English	text, CHAT-format	TIGERSearch
Switchboard Corpus	Syntax, POS, some argument structure	English (spoken)	TIGER XML, MRG	TIGERSearch
TIGER Treebank [Version 1]	Syntax (LFG-based), POS, some argument structure	German	TIGER XML, NEGRA format	TIGERSearch
TIGER sample corpora	Syntax, POS, some argument structure	English	TIGER XML, MRG	TIGERSearch
YCOE	Syntax, POS, CAT, lemma	German	TIGER XML, NEGRA format	TIGERSearch
Wallstreet Journal	Syntax, POS, some argument structure	English	TIGER XML, MRG	TIGERSearch

Corpora only available on CD, DVD, or as packed archive on AFS
(as of 02/04/2004)

You can check out these CDs from us or ask the corpus TA to install their content on the corpus computer or AFS.

ACL/DCL, Association For Computational Linguistics Data Collection Initiative, CD-ROM 1, LDC93T1, 1991, 1 disc
The AQUAINT Corpus of English News Text, LDC2002T31, 2 CDs
Arabic Gigaword LDC2003T12, 1 DVD
ATCO Complete, LDC94S14A:
ATCO, Air Traffic Control Corpus, Dallas Fort Worth (DFW), NIST Speech Discs 16-1.1, 16-2.1, 16-3.1, 1994, NIST/LDC, 3 discs
ATCO, Air Traffic Control Corpus, Logan International (BOS), NIST Speech Discs 16-4.1, 16-5.1, 1994, NIST/LDC, 2 discs
ATCO, Air Traffic Control Corpus, Washington National (DCA), NIST Speech Discs 16-6.1, 16-7.1, 16-8.1, 1994, NIST/LDC, 3 discs
ATIS0 Complete, LDC93S4A:
ATIS0, Air Travel Information System, Spontaneous Speech Pilot Corpus and Relational Database, NIST Speech Disc 5-1.1, NTIS PB91-505354, DARPA, 1990, 1 disc
ATIS0, Air Travel Information System, Read Versions of Spontaneous Data, NIST Speech Disc 5-2.1, NTIS PB91-505362, DARPA, 1990, 1 disc
ATIS0, Air Travel Information System, Speaker-Dependent Training Data, NIST Speech Discs 5-3.1, 5-4.1, 5-5.1, 5-6.1, NTIS PB91-505370, DARPA, 1991, 4 discs
ATIS2, Air Travel Information System, Multi-Site Speech Collection, NIST Speech Discs 12-1.1 to 12-4.1, LDC93S5, 1990, 4 discs
BLLIP-WSJ (Brown Laboratory for Linguistic Information Processing), LDC2000T43, 2 CDs
Boston U. Radio Speech Corpus, LDC96S36, 4 discs
British National Corpus Sampler, 1999
CALLFRIEND American English Non Southern Dialect, 60 Telephone Conversations, LDC96S46, 3 discs
CALLFRIEND American English Southern Dialect, 60 Telephone Conversations, LDC96S47, 3 discs
CALLFRIEND Japanese, LDC96S53, 3 discs
CALLFRIEND Hindi, LDC96S52, 3 discs
CALLFRIEND Tamil, LDC96S59, 3 discs
CALLHOME American English, 120 Telephone Conversations, LDC97S42, 3 discs
CALLHOME German, 100 Telephone Conversations, LDC97S43, 3 discs
CALLHOME Japanese, LDC96S37, 3 discs
CELEX, The celex Lexical Database, Release 2 (Dutch Version 3.1, English Version 2.5, German Version 2.5), LDC/Centre for Lexical Information Max Planck Institute for Psycholinguistics Nijmegen, LDC96L14, 1995, 1 disc [special user agreement]
Chinese Gigaword LDC2003T09, 1 DVD
CSR-II (WSJ1) Complete, LDC94S13A: WSJ1, Continuous Speech Recognition Corpus, NIST/LDC, 1993, 34 discs
CTIMIT, Cellular Telephone Acoustic-Phonetic Continous Speech Corpus, LDC96S30, 1995, 1 disc
DCIEM/HCRC, LDC96S38, 12 parts
ECI Multilingual Text, LDC94T5, 1 CD
English Gigaword, LDC2003T05, 1 DVD
FFMTIMIT, Acoustic-Phonetic Continuous Speech Corpus Secondary (Far Field) Microphone Recordings, NIST Speech Disc 21-1.1, NTIS Order No. PB95-504569, LDC96S32, 1 disc
Hansard French/English, LDC95T20, 1 disc
HCRC Map Task Corpus, Discs 1-4 of 8, Human Communication Research Centre, University of Edinburgh, LDC93S12, 1992, 8 discs
Hong Kong Hansards Parallel Text, LDC2000T50, 1 disc
ICE-GB (International Corpus of English, British Component), 1 disc
Japanese Business News Text, LDC95T8, 1 disc
JURIS, Justice Retrieval and Inquiry System, LDC98T32, 2 discs
NTIMIT, Telephone Network Acoustic-Phonetic Continuous Speech Corpus, NIST Speech Discs 10-1.1/10-2.1, NTIS Order No. PB92-502087, LDC93S2, 1992, 2 discs
RM1, Resource Management, Continuous Speech Database:
Speaker-Dependent Training Data, NIST Corpus 2-1.1 and 2-2.1, 1989, NTIS Order No. PB89-226666, DARPA, 2 discs
Speaker Independent Training Data, NISC Disc 2-3.1, NTIS Order No. PB90-500539, 1989, DARPA, 1 disc
Development Test and Evaluation Test Data and Scoring Software, NIST Speech Disc 2-4.2, 1992, DARPA, 1 disc
LDC93S3B
RM1, Resource Management, Continuous Speech Database, Isolated - and Spelled - Word Data, NIST Speech Disc 2-5.1, 1996, DARPA, LDC96S39, 1 disc (2 copies)
RM2, Extended Resource Management, Continuous Speech Speaker-Dependent Corpus (RM2), NIST Speech Discs 3-1.2 and 3-2.2, NTIS Order No. PB90-501776, LDC93S3C, 1990, 2 discs
Santa Barbara Corpus of Spoken American English, LDC2000S85, 3 discs
Santa Barbara Corpus of Spoken American English Part-II, LDC2003S06, 1 DVD
SLX Corpus of Classic Sociolinguistic Interviews, LDC2003T15, 1 DVD
SPIDRE, Speaker Identification Research Corpus, NIST speech discs 18-1.1 and 18-2.1, 1994, LDC94S15, 2 discs
SPINE, Speech in Noisy Environments, LDC2000S87, 4 discs
Switchboard Corpus, Recorded Telephone Conversations, NIST, 26 discs, 1992, obsolete
Switchboard Corpus, Excerpts, Credit Card Conversations, NIST Speech Disc 8-1.2, LDC93S8, 1992, 1 disc
Switchboard-1 Release 2, LDC97S62, 23 parts
The Penn Treebank Project, Preliminary Release 0.5, 1992, LDC, 1 disc, obsolete
The Penn Treebank Project, Release 2, 1995, LDC95T7, 1 disc
The Penn Treebank Project, Release 3, LDC99T42, 1 disc
The Prague Dependency Bank 1.0 (Czech), LDC2001T10, 1 disc
Topic Detection and Tracking (TDT2), LDC2000S92, 2 discs [special user agreement]
Topic Detection and Tracking (TDT3) Multilanguage Text 2, LDC2001T58, 1 disc
TIDIGITS, Studio Quality Speaker-Independent Connected-Digit Corpus, NIST Speech Discs 4-1, 4-2, 4-3, NTIS PB-91-506592, LDC93S10, 1991
TIMIT, Acoustic-Phonetic Continuous Speech Corpus, NIST Speech Disc 1-1.1, NTIS Order No. PB91-505065, LDC93S1, 1990, 1 disc
Tipster Complete, LDC93T3A, 3 discs [special user agreement]
TRAINS Spoken Dialog Corpus, LDC95S25
TREC (Text Research Collection) Vol. 4, 1 disc
TREC (Text Research Collection) Vol. 5, 1 disc
UN Parallel Text (Complete), LDC94T4A, 3 CDs
VAHA, Voice Across Hispanic America, LDC96S41, 2 discs
Voicemail Corpus Part 1, Speech and Transcripts, LDC98S77, 1 disc
WSJ0, Continuous Speech Recognition Corpus, NIST, LDC93S6A, 1993, 15 discs
1997 Broadcast News Speech Corpus (CSR-VI: Hub 4), LDC98S71, 1997, 18 discs
1998 Speaker Recognition Evaluation, NIST/LDC, LDC98S76, 1998, 6 discs

Corpora only available in archive form
(as of 02/04/2004)

Some corpora are distributed via ftp by the LDC. Thus we don't have any CDs for them and if we have downloaded them but not yet installed them, they are listed here. The archives are stored on AFS under /afs/ir/data/linguistic-data/ldc/LDC-tarfiles/ if not mentioned otherwise, and all filenames contain the LDC catalogue number which should make the identification of the corpus unproblematic.

Note that the just-mentioned directory contains several other files which are archives of corpora that are already installed under AFS and therefore not currently listed in this section.

LDC2003T07 Arabic Treebank: Part 1 - 10K-word English Translation
LDC2002L27 Chinese-English Translation Lexicon Version 3.0
LDC2001T02 Message Understanding Conference (MUC) 7
LDC2002T01 Multiple-Translation Chinese Corpus
LDC2003T17 Multiple-Translation Chinese (MTC), Part 2
LDC2002T07 RST Discourse Treebank
LDC2003T10 SAID

Corpora on the WWW (a very small collection)

This is only a very small collection of online corpora, please see the top 10 info-sources page for links to sites with far more information.

British National Corpus
COBUILD Corpus
Lexis-Nexis Academic Universe
OED new version (slow and requires a graphical browser) or old version (fast and lynx-friendly)
DIALOGUE DIVERSITY CORPUS: Version 2.0
TITUS corpus and search engine [signed license conditions & user agreement need to be faxed to the number stated on the user agreement].
COSMAS II is a German giga corpus with almost 2 billion (!) text words. It is accessible via the COSMAS II Online Client. Unfortunately, all help and information available for this corpus is given in German.
Right in front of your door you can find the Text@Humanities project of the Human Digital Information Service, a collection of online searchable (literary) texts drawn from American English, Irish, other varieties of English, German, French and Spanish.
The W3-Corpora Search Engine is an online search engine on a large collection of corpora (still in prototype stage but looks promising).