This site

::  HOME
What? What not?
::  Site map
::  About this site
 
 

 

This page is NOT LONGER KEPT UP-TO-DATE as of 04/01/2004

Corpora@Stanford

Getting started
@Stanford

::  Intro & Overview
Where corpora grow and why you like them
::  Playground rules
& registration

Apply for your visa to the land of corpora
::  Setting up your account
Pack your suitcase to the land of corpora

Available resources
@Stanford

::  User support
The Corpus TA &
our corpora-email-list
::  Corpora
[Ordering corpora | Checking out CDs]

::  Corpora-tools & Software
[Documents]

::  Corpus-related classes
& projects

Beyond Stanford

::  Top 10 info-sources
E-resources out there

For the Corpus TA

::  Guidelines & help
 

Overview

Beside the corpora that we own on CD (which you can get from the Corpus TA, many corpora are installed and ready-to-use on either the AFS space or the corpus computer (CC). Some additional speech corpora may be available in the phonetics lab (see also the tip below). Please contact the Phonetics RA or the Corpus TA if you have questions about speech corpora. Although this page is not intended to give an overview of available online corpora (outside of Stanford), a very small selection is actually provided on this page. Nevertheless, we strongly encourage you to take advantage of the variety of freely accessible online corpora - for some links to sites that will provide you with an overview of the colorful world of online corpora, please browse & click through our subjectively construed list of the top 10 info-sources "out there".

This page has four main parts:

In addition, you will often find the most recently acquired corpora summarized at the top of this page.

Tip-1: Since this page is not a database you may find it useful to just use your browser search function if you are interested in a specific corpus, or corpora for a specific language, or any other key word (e.g. "syntactically annotated"). Try it!

Tip-2: Interested in prosody? See Florian Jaeger's page of annotated links to prosodically annotated speech corpora.

Tip-3: If you cannot find your the corpus you were looking for - see if we can order it! Our department is a member of the LDC (Linguistic Data Consortium) which gives us free access to a lot of corpora. It may also be able to order other corpora. Simply tell the corpus TA what you need, but have a look at the information on "ordering corpora from the LDC" first, or browse the web to see whether what you need can be found online (and maybe for free). Also keep in mind that our inventory may be outdated since our resources to maintain this list are limited.

Recently acquired corpora (as of 02/04/04)

We've acquired a fair number of corpora and tools recently. Notably we've now got several new treebanks at Stanford and we update some older corpora to newer versions:

    Winter 2004
  • ICSI Meeting transcripts [AFS] - information not entered below yet
  • Arabic Gigaword [DVD | CC]
  • Chinese Gigaword [DVD | CC]
  • The IViE Corpus (English Intonation in the British Isles) [CC]
  • Bavarian Speech Archive (BAS) annotation of VerbMobil 1 & 2 data [AFS]
  • Proposition Bank [AFS]
  • SLX Corpus of Classic Sociolinguistic Interviews [DVD]
  • Santa Barbara Corpus of Spoken American English Part-II [DVD | CC]
  • ECI Multilingual Text [AFS | CD]
  • English Gigaword [DVD | CC]
  • UN Parallel Text (Complete) [CD]
  • The AQUAINT Corpus of English News Text [CD]

    Fall 2003
  • Topic Detection and Tracking (TDT3) Multilanguage Text 2 [AFS]
  • LUCY, initial release [AFS]
  • SUSANNE Corpus, Release 5 [AFS]
  • CHRISTINE, Stage I, Release 2 [AFS]
  • The York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE) [AFS]
  • Corpus of Spoken Professional American English (tagged & untagged) [CC]

    Summer 2003
  • TIGER release 1.0 [CC]
  • Penn Chinese Treebank [AFS | CC]
  • Penn Arabic Treebank [AFS | CC]
  • NEGRA treebank (German) [AFS | CC]
  • TIGER corpus (German) [AFS | CC]
  • Prague Dependency Bank (Czech) [AFS]


[ Corpora on AFS | CC | CD/DVD | archives | the WWW | goto top ]

Corpora on AFS space (as of 01/16/2004)

  • Air Traffic Control Corpus - Transcripts only, LDC94S14A:
    /afs/ir/data/linguistic-data/Air-Traffic-Control
  • Arabic Treebank, LDC2003T06:
    /afs/ir/data/linguistic-data/Arabic-Treebank
  • Bavarian Speech Archive (BAS) annotation of VerbMobil 1 & 2 data BAS website
    /afs/ir/data/linguistic-data/Verbmobil-Dialogs/BAS-VM-annotation/
  • BLLIP-WSJ (Brown Laboratory for Linguistic Information Processing) LDC2000T43:
    /afs/ir/data/linguistic-data/BLLIP-WSJ
  • Boston University Radio Speech Corpus, LDC96S36:
    /afs/ir/data/linguistic-data/Boston-University-Radio
  • BNC World Edition (license conditions and installation of SARA software being studied)
    /afs/ir/data/linguistic-data/BNC-world
  • Broadcast News Transcripts (CSR-VI), LDC98T28:
    /afs/ir/data/linguistic-data/Broadcast-News-Transcripts
  • CALLHOME:
    /afs/ir/data/linguistic-data/CALLHOME
    • CALLHOME American English Lexicon, LDC97L20:
      /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-English-Lexicon
    • CALLHOME American English Transcripts, LDC97T14:
      /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-English-Transcripts
    • CALLHOME Egyptian Arabic Lexicon, LDC97L19:
      /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Arabic-Lexicon
    • CALLHOME German Lexicon LDC97L18:
      /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-German-Lexicon
    • CALLHOME German Transcripts LDC97T15:
      /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-German-Transcripts
    • CALLHOME Japanese Lexicon LDC96L17:
      /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Japanese-Lexicon
    • CALLHOME Mandarin Chinese Lexicon LDC96L16:
      /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Mandarin-Lexicon
    • CALLHOME Spanish Lexicon, LDC96L16:
      /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Spanish-Lexicon
    • CALLHOME Spanish Transcripts, LDC96T17:
      /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Spanish-Transcripts
  • CELEX 2, LDC96L14:
    [special license condition: one license per research group]
    /afs/ir/data/linguistic-data/CELEX
  • Chinese Treebanks
      Chinese Treebank (1), LDC2000T48:
      /afs/ir/data/linguistic-data/Chinese-Treebank/Chinese-Treebank
    • Chinese Treebank 2, LDC2001T11:
      /afs/ir/data/linguistic-data/Chinese-Treebank/Chinese-Treebank-2
    • Chinese Treebank 3, LDC2003E06:
      /afs/ir/data/linguistic-data/Chinese-Treebank/Chinese-Treebank-3
  • CHRISTINE, Stage I, Release 2 CHRISTINE project
    /afs/ir/data/linguistic-data/CHRISTINE/
  • CMU Pronouncing Dictionary:
    /afs/ir/data/linguistic-data/CMU-Pronouncing-Dict
  • ECI Multilingual Text LDC94T5
    /afs/ir/data/linguistic-data/ECI-Multilingual
  • EXCITE:
    /afs/ir/data/linguistic-data/IR/EXCITE
  • Hansard French/English, LDC95T20:
    /afs/ir/data/linguistic-data/Hansard-French
  • HCRC Maptask, LDC93S12:
    /afs/ir/data/linguistic-data/HCRC-Maptask-Transcripts
  • Hong Kong Hansards Parallel Text, LDC2000T50:
    /afs/ir/data/linguistic-data/Hansard-Hong-Kong
  • Hong Kong Laws, LDC2000T47:
    /afs/ir/data/linguistic-data/Hong-Kong-Laws
  • Hong Kong News, LDC2000T46:
    /afs/ir/data/linguistic-data/Hong-Kong-News
  • Hub-5 Spanish Transcripts, LDC98T27:
    /afs/ir/data/linguistic-data/Hub5-Spanish-Transcripts
  • ICAME:
    /afs/ir/data/linguistic-data/ICAME
  • ICE-GB (International Corpus of English - The British Component):
    /afs/ir/data/linguistic-data/ICE-GB (If you want to borrow the CD to install the search software on your Windows PC let me know. It doesn't work for Macs or Unix computers.)
  • IE (Information Extraction):
    /afs/ir/data/linguistic-data/IE
    • Corporate Acquisitions Annotated Reuters Texts:
      /afs/ir/data/linguistic-data/IE/CorpAcq-Reuters-Freitag
    • Kristie Seymore's Information Extraction Data:
      /afs/ir/data/linguistic-data/IE/Kristie-Seymore-IE
    • MUC3-4 (Message Understanding Conference):
      /afs/ir/data/linguistic-data/IE/MUC/MUC3-4
    • MUC-6 (Message Understanding Conference) Text collection, LDC96T10:
      /afs/ir/data/linguistic-data/IE/MUC/MUC6/LDC96T10-Muc6
    • MUC-6 (Message Understanding Conference), LDC2003T13:
      /afs/ir/data/linguistic-data/IE/MUC/MUC6/LDC2003T13-muc6
    • Mooney Job Data:
      /afs/ir/data/linguistic-data/IE/Mooney-Job-Data
    • Census 1990 Names:
      /afs/ir/data/linguistic-data/IE/census1990names
  • Japanese Business News, LDC95T8:
    /afs/ir/data/linguistic-data/Japanese-Business-News
  • LUCY, initial release (copyright free version) LUCY project
    /afs/ir/data/linguistic-data/LUCY
  • North American News Text Corpus, LDC95T21:
    /afs/ir/data/linguistic-data/North-American-News
  • PPCME2 PPCME2 website [requires membership in a special group]:
    /afs/ir/data/linguistic-data/TREC/PPCME2
  • Prague Dependency Bank (Czech) LDC2001T10
    /afs/ir/data/linguistic-data/PragueDependencyTreebank_v1.0
  • Proposition Bank (experimental pre-release) Proposition Bank website (predicate structure enriched treebank) [related tools]
    /afs/ir/data/linguistic-data/PropBank
  • Remedia Story Comprehension: (use requires special permission)
    /afs/ir/data/linguistic-data/QA/Remedia-Story-Comprehension
  • Reuters Corpus
    /afs/ir/data/linguistic-data/Reuters-Corpus
  • SAID (A Syntactically Annotated Idiom Dataset), LDC2003T10
    /afs/ir/data/linguistic-data/SAID
  • Santa Barbara Corpus of Spoken American English, LDC2000S85:
    /afs/ir/data/linguistic-data/Santa-Barbara
  • Spanish Broadcast News, LDC98T29:
    /afs/ir/data/linguistic-data/Spanish-Broadcast-News
  • SPINE, Speech in Noisy Environments, LDC2000S87 and LDC2000T49:
    /afs/ir/data/linguistic-data/SPINE
  • SUSANNE Corpus, Release 5 SUSANNE project
    /afs/ir/data/linguistic-data/SUSANNE
  • Switchboard Transcripts, LDC93S7-T:
    /afs/ir/data/linguistic-data/Switchboard/Switchboard-Transcripts
  • TDT Pilot Study, LDC98T25 [special user agreement]:
    /afs/ir/data/linguistic-data/TDT-Pilot-Study
  • TDT2 Careful Transcription, LDC2000T44:
    /afs/ir/data/linguistic-data/TDT2-Careful
  • TDT2 Multilanguage Text 4, LDC2001T57
    /afs/ir/data/linguistic-data/TDT2-Multilingual
  • TDT3 Multilanguage Text 2, LDC2001T58
    /afs/ir/data/linguistic-data/TDT2-Multilingual
  • Text Categorization:
    /afs/ir/data/linguistic-data/TextCat
    • 20Newsgroups:
      /afs/ir/data/linguistic-data/TextCat/20Newsgroups
    • DavidLewis (Reuters, TREC-AP):
      /afs/ir/data/linguistic-data/TextCat/DavidLewis
    • Spam Filtering:
      /afs/ir/data/linguistic-data/TextCat/Spam-Filtering
  • TIDIGITS, LDC93S10:
    /afs/ir/data/linguistic-data/TIDIGITS
  • TIMIT, LDC93S1:
    /afs/ir/data/linguistic-data/TIMIT
  • Tipster Complete, LDC93T3A [each user needs to sign license]:
    /afs/ir/data/linguistic-data/Tipster
  • TRAINS, LDC95S25:
    /afs/ir/data/linguistic-data/TRAINS
  • TREC (Information Retrieval Text Research Collection):
    /afs/ir/data/linguistic-data/TREC/
  • Treebank Release 2 and 3, LDC95T7 and LDC99T42:
    /afs/ir/data/linguistic-data/Treebank
  • UMLS (Unified Medical Language System):
    /afs/ir/data/linguistic-data/UMLS
  • Verbmobil Dialogs:
    /afs/ir/data/linguistic-data/Verbmobil-Dialogs
  • Voicemail Corpus Part 1, Speech and Transcripts, LDC98S77:
    /afs/ir/data/linguistic-data/Voicemail1
  • WSD (Word Sense Disambiguation):
    /afs/ir/data/linguistic-data/WSD
    • DSO Sense-Tagged, LDC97T12:
      /afs/ir/data/linguistic-data/WSD/DSO-Sense-Tagged
    • Leacock's Data:
      /afs/ir/data/linguistic-data/WSD/leacock
    • Pedersen's Data:
      /afs/ir/data/linguistic-data/WSD/pedersen
    • Senseval1:
      /afs/ir/data/linguistic-data/WSD/senseval/senseval1
  • York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), YCOE project [special user agreement]
    /afs/ir/data/linguistic-data/YCOE


[ Corpora on AFS | CC | CD/DVD | archives | the WWW | goto top ]

Corpora on the Corpus Computer

In addition to the corpora on AFS, a couple of corpora are only stored on the corpus computer. All corpora are stored on the D-partition of the corpus computer. This section will undergo further revision and more details about the available corpora will be added soon:

Name Annotation Language(s) Format Associated tools
Aleksova's corpus - Bulgarian (spoken) Winword files -
Arabic Gigaword Arabic
ATIS Syntax, POS, some argument structure English TIGER XML, MRG TIGERSearch
Bavarian Archive of Speech Corpora (only annotations) Prosody, syntax, POS, transcribed German, English, Japanese BAS format -
Brown Corpus Syntax, POS, some argument structure English TIGER XML, MRG TIGERSearch
Chinese Gigaword Chinese
Chinese Treebank Syntax, POS, some argument structure Chinese TIGER XML, MRG TIGERSearch
Corpus of Spoken Professional American English POS American English (spoken) SGML-tagged, plain text MonoConc
English Gigaword English
IMS German radio news (Nachrichten) corpus Prosodically annotated & transcribed speech files German (spoken) ToBI annotation -
IViE Prosody, phonetic, etc. British dialects - -
NEGRA Syntax (LFG-based), POS, some argument structure German TIGER XML, NEGRA format TIGERSearch
Santa Barbara Corpus of Spoken American English Part-II speech, intonation, transcribed English text, CHAT-format TIGERSearch
Switchboard Corpus Syntax, POS, some argument structure English (spoken) TIGER XML, MRG TIGERSearch
TIGER Treebank
[Version 1]
Syntax (LFG-based), POS, some argument structure German TIGER XML, NEGRA format TIGERSearch
TIGER sample corpora Syntax, POS, some argument structure English TIGER XML, MRG TIGERSearch
YCOE
Syntax, POS, CAT, lemma German TIGER XML, NEGRA format TIGERSearch
Wallstreet Journal Syntax, POS, some argument structure English TIGER XML, MRG TIGERSearch


[ Corpora on AFS | CC | CD/DVD | archives | the WWW | goto top ]

Corpora only available on CD, DVD, or as packed archive on AFS
(as of 02/04/2004)

You can check out these CDs from us or ask the corpus TA to install their content on the corpus computer or AFS.

  • ACL/DCL, Association For Computational Linguistics Data Collection Initiative, CD-ROM 1, LDC93T1, 1991, 1 disc
  • The AQUAINT Corpus of English News Text, LDC2002T31, 2 CDs
  • Arabic Gigaword LDC2003T12, 1 DVD
  • ATCO Complete, LDC94S14A:
    ATCO, Air Traffic Control Corpus, Dallas Fort Worth (DFW), NIST Speech Discs 16-1.1, 16-2.1, 16-3.1, 1994, NIST/LDC, 3 discs
    ATCO, Air Traffic Control Corpus, Logan International (BOS), NIST Speech Discs 16-4.1, 16-5.1, 1994, NIST/LDC, 2 discs
    ATCO, Air Traffic Control Corpus, Washington National (DCA), NIST Speech Discs 16-6.1, 16-7.1, 16-8.1, 1994, NIST/LDC, 3 discs
  • ATIS0 Complete, LDC93S4A:
    ATIS0, Air Travel Information System, Spontaneous Speech Pilot Corpus and Relational Database, NIST Speech Disc 5-1.1, NTIS PB91-505354, DARPA, 1990, 1 disc
    ATIS0, Air Travel Information System, Read Versions of Spontaneous Data, NIST Speech Disc 5-2.1, NTIS PB91-505362, DARPA, 1990, 1 disc
    ATIS0, Air Travel Information System, Speaker-Dependent Training Data, NIST Speech Discs 5-3.1, 5-4.1, 5-5.1, 5-6.1, NTIS PB91-505370, DARPA, 1991, 4 discs
  • ATIS2, Air Travel Information System, Multi-Site Speech Collection, NIST Speech Discs 12-1.1 to 12-4.1, LDC93S5, 1990, 4 discs
  • BLLIP-WSJ (Brown Laboratory for Linguistic Information Processing), LDC2000T43, 2 CDs
  • Boston U. Radio Speech Corpus, LDC96S36, 4 discs
  • British National Corpus Sampler, 1999
  • CALLFRIEND American English Non Southern Dialect, 60 Telephone Conversations, LDC96S46, 3 discs
  • CALLFRIEND American English Southern Dialect, 60 Telephone Conversations, LDC96S47, 3 discs
  • CALLFRIEND Japanese, LDC96S53, 3 discs
  • CALLFRIEND Hindi, LDC96S52, 3 discs
  • CALLFRIEND Tamil, LDC96S59, 3 discs
  • CALLHOME American English, 120 Telephone Conversations, LDC97S42, 3 discs
  • CALLHOME German, 100 Telephone Conversations, LDC97S43, 3 discs
  • CALLHOME Japanese, LDC96S37, 3 discs
  • CELEX, The celex Lexical Database, Release 2 (Dutch Version 3.1, English Version 2.5, German Version 2.5), LDC/Centre for Lexical Information Max Planck Institute for Psycholinguistics Nijmegen, LDC96L14, 1995, 1 disc [special user agreement]
  • Chinese Gigaword LDC2003T09, 1 DVD
  • CSR-II (WSJ1) Complete, LDC94S13A: WSJ1, Continuous Speech Recognition Corpus, NIST/LDC, 1993, 34 discs
  • CTIMIT, Cellular Telephone Acoustic-Phonetic Continous Speech Corpus, LDC96S30, 1995, 1 disc
  • DCIEM/HCRC, LDC96S38, 12 parts
  • ECI Multilingual Text, LDC94T5, 1 CD
  • English Gigaword, LDC2003T05, 1 DVD
  • FFMTIMIT, Acoustic-Phonetic Continuous Speech Corpus Secondary (Far Field) Microphone Recordings, NIST Speech Disc 21-1.1, NTIS Order No. PB95-504569, LDC96S32, 1 disc
  • Hansard French/English, LDC95T20, 1 disc
  • HCRC Map Task Corpus, Discs 1-4 of 8, Human Communication Research Centre, University of Edinburgh, LDC93S12, 1992, 8 discs
  • Hong Kong Hansards Parallel Text, LDC2000T50, 1 disc
  • ICE-GB (International Corpus of English, British Component), 1 disc
  • Japanese Business News Text, LDC95T8, 1 disc
  • JURIS, Justice Retrieval and Inquiry System, LDC98T32, 2 discs
  • NTIMIT, Telephone Network Acoustic-Phonetic Continuous Speech Corpus, NIST Speech Discs 10-1.1/10-2.1, NTIS Order No. PB92-502087, LDC93S2, 1992, 2 discs
  • RM1, Resource Management, Continuous Speech Database:
    Speaker-Dependent Training Data, NIST Corpus 2-1.1 and 2-2.1, 1989, NTIS Order No. PB89-226666, DARPA, 2 discs
    Speaker Independent Training Data, NISC Disc 2-3.1, NTIS Order No. PB90-500539, 1989, DARPA, 1 disc
    Development Test and Evaluation Test Data and Scoring Software, NIST Speech Disc 2-4.2, 1992, DARPA, 1 disc
    LDC93S3B
  • RM1, Resource Management, Continuous Speech Database, Isolated - and Spelled - Word Data, NIST Speech Disc 2-5.1, 1996, DARPA, LDC96S39, 1 disc (2 copies)
  • RM2, Extended Resource Management, Continuous Speech Speaker-Dependent Corpus (RM2), NIST Speech Discs 3-1.2 and 3-2.2, NTIS Order No. PB90-501776, LDC93S3C, 1990, 2 discs
  • Santa Barbara Corpus of Spoken American English, LDC2000S85, 3 discs
  • Santa Barbara Corpus of Spoken American English Part-II, LDC2003S06, 1 DVD
  • SLX Corpus of Classic Sociolinguistic Interviews, LDC2003T15, 1 DVD
  • SPIDRE, Speaker Identification Research Corpus, NIST speech discs 18-1.1 and 18-2.1, 1994, LDC94S15, 2 discs
  • SPINE, Speech in Noisy Environments, LDC2000S87, 4 discs
  • Switchboard Corpus, Recorded Telephone Conversations, NIST, 26 discs, 1992, obsolete
  • Switchboard Corpus, Excerpts, Credit Card Conversations, NIST Speech Disc 8-1.2, LDC93S8, 1992, 1 disc
  • Switchboard-1 Release 2, LDC97S62, 23 parts
  • The Penn Treebank Project, Preliminary Release 0.5, 1992, LDC, 1 disc, obsolete
  • The Penn Treebank Project, Release 2, 1995, LDC95T7, 1 disc
  • The Penn Treebank Project, Release 3, LDC99T42, 1 disc
  • The Prague Dependency Bank 1.0 (Czech), LDC2001T10, 1 disc
  • Topic Detection and Tracking (TDT2), LDC2000S92, 2 discs [special user agreement]
  • Topic Detection and Tracking (TDT3) Multilanguage Text 2, LDC2001T58, 1 disc
  • TIDIGITS, Studio Quality Speaker-Independent Connected-Digit Corpus, NIST Speech Discs 4-1, 4-2, 4-3, NTIS PB-91-506592, LDC93S10, 1991
  • TIMIT, Acoustic-Phonetic Continuous Speech Corpus, NIST Speech Disc 1-1.1, NTIS Order No. PB91-505065, LDC93S1, 1990, 1 disc
  • Tipster Complete, LDC93T3A, 3 discs [special user agreement]
  • TRAINS Spoken Dialog Corpus, LDC95S25
  • TREC (Text Research Collection) Vol. 4, 1 disc
  • TREC (Text Research Collection) Vol. 5, 1 disc
  • UN Parallel Text (Complete), LDC94T4A, 3 CDs
  • VAHA, Voice Across Hispanic America, LDC96S41, 2 discs
  • Voicemail Corpus Part 1, Speech and Transcripts, LDC98S77, 1 disc
  • WSJ0, Continuous Speech Recognition Corpus, NIST, LDC93S6A, 1993, 15 discs
  • 1997 Broadcast News Speech Corpus (CSR-VI: Hub 4), LDC98S71, 1997, 18 discs
  • 1998 Speaker Recognition Evaluation, NIST/LDC, LDC98S76, 1998, 6 discs


[ Corpora on AFS | CC | CD/DVD | archives | the WWW | goto top ]

Corpora only available in archive form
(as of 02/04/2004)

Some corpora are distributed via ftp by the LDC. Thus we don't have any CDs for them and if we have downloaded them but not yet installed them, they are listed here. The archives are stored on AFS under /afs/ir/data/linguistic-data/ldc/LDC-tarfiles/ if not mentioned otherwise, and all filenames contain the LDC catalogue number which should make the identification of the corpus unproblematic.

Note that the just-mentioned directory contains several other files which are archives of corpora that are already installed under AFS and therefore not currently listed in this section.



[ Corpora on AFS | CC | CD/DVD | archives | the WWW | goto top ]

Corpora on the WWW (a very small collection)

This is only a very small collection of online corpora, please see the top 10 info-sources page for links to sites with far more information.

  • British National Corpus
      By Beth Levin: You can find information on doing searches by looking in the SARA Manual (http://thetis.bl.uk/CHAP4/) this manual is intended for the full version of the BNC, but it contains information on pattern searching in section 3. With some ingenuity, you can even do searches by lexical category. Since you only get 50 randomly chosen examples of the pattern you are searching for at a time, if there are 80-100 or more examples of this pattern, search for it again since you may get some more examples. (We now have the BNC in AFS space, but we haven't installed the SARA server yet. But you can use gsearch with it.)

      Note - You may also find it useful to have the the BNC Basic Tagset open in a separate window while doing your first searches in the BNC.

  • COBUILD Corpus
      By Beth Levin: This web site also has material from a whole range of written and spoken sources, used in the development of the Collins COBUILD dictionaries and ESL materials. It allows for some relatively sophisticated searching; click on ``query syntax'' for details. Particularly nice is ``@'': read@ searches for all the inflected forms of read! You only get 40 examples at a time, but can partially get around this in the same way as with the BNC. The other shortcoming is that the window of text is very small -- often not even a whole sentence. If you want more text, you have to search for the string of text defining the left or right edge of the example.

      Note - There is also the Cobuild Concordance and Collocations Sampler.
  • Lexis-Nexis Academic Universe
      By Beth Levin: This web site has material from major newspapers, wire services, and television news programs. Although it is not a ``balanced'' corpus, it has so much text that you can find things you won't find anywhere else (well, maybe through a web search!); for example, outgraze, outdefend, outweed or he reads himself quasi-blind and the fans spin themselves dizzy. The other major drawback is that it is not designed for linguists, so that you cannot take full advantage of what's there and you also need to work around search procedures that were designed for research into current events.

      From the Lexis-Nexis home page choose ``General News''. From ``General News'' choose the ``More Options'' tab; using the ``Basic'' tab will only allow searches of the title and first paragraph of an article! On the ``More Options'' search page, always be sure to click on ``Headline'' and then select ``Full text'' from the popup menu that comes up. You can select from a range of sources; the default is ``Major Newpapers''; similarly, you can also select from a range of dates. When you get the list of results, click on ``Expanded List''; this will show you the part of the text with your search pattern for the results.

      There are tips at the bottom of the search page that tell you how to construct searches. A few pointers. You can use ``!'' as a wild card for truncation on the right edges of words. Also helpful is ``pre/n'', for some ``n'', which allows you to search for a word that precedes another within an window of n words; ``read! pre/2 way'' will find patterns such as reads her way or reading our slow way, etc. (But ``read!'' will also find reader.) Another disadvantage: Lexis-Nexis has a large list of stop words -- words that you can't search for -- including just about any determiner, auxiliary, and preposition.

  • OED new version (slow and requires a graphical browser) or old version (fast and lynx-friendly)
  • DIALOGUE DIVERSITY CORPUS: Version 2.0
      From their website: The DDC gives direct access to a set of dialogue transcripts (13 sources, more than 12 hours of dialogue, all in English.). It also gives a set of links and methods for indirect access to hundreds of additional dialogues (principally in English.) Many sources provide speech data as well as transcripts. The emphasis is on free or inexpensive access.

      Volume 2.0 presents access to hundreds of dialogues that were not represented in the original release in October 2002. It is more diverse in terms of situations and dynamic patterns. Access to oral history interviews, the Watergate tapes (by several paths), diverse regional varieties of English (both British and international), the just-emerging American National Corpus (ANC), the U. S. Supreme Court, and other originally non-linguistic sources are presented for the first time.

      The dialogues in this corpus occurred in a very diverse collection of interactive situations. Thus it is a data resource for studies of the breadth of coverage of particular dialogue models, and for studies that compare dialogue from different situations.

  • TITUS corpus and search engine [signed license conditions & user agreement need to be faxed to the number stated on the user agreement].
      By Florian Jaeger: TITUS is a large collection of Indo-European text that can be searched with the TITUS search engine. Several types of searches are available. You can search specific texts, restrict the search to specific languages (e.g. Farsi) or language groups (e.g. Avestan, Lithuanian, or simply Old Prussion) or combinations of these restrictions. Wildcards (e.g. '*') or logical operators (e.g. 'AND' or 'OR') can be used, too. The output is a list of texts that contain matches to your search. The site also contains lexica, links to several other text databases, tutorials, etc. A rich source for anyone working on Indo-European languages.
  • COSMAS II is a German giga corpus with almost 2 billion (!) text words. It is accessible via the COSMAS II Online Client. Unfortunately, all help and information available for this corpus is given in German.
      By Florian Jaeger: COSMAS II contains morphosyntactic annotated parts, speech, formal writings (news texts, novels, etc.), as well as special sociolinguistic corpora. You can load subcorpora (so called 'archives') and construct searches using a text or a graphical interface - both of which need some time getting used to (but it's worth it). The searches have almost regular expression power; you can search for tagged morphosyntactic information; save your searches and filter your results. A inflection and derivation operator, '&' is also availabel.
  • Right in front of your door you can find the Text@Humanities project of the Human Digital Information Service, a collection of online searchable (literary) texts drawn from American English, Irish, other varieties of English, German, French and Spanish.
  • The W3-Corpora Search Engine is an online search engine on a large collection of corpora (still in prototype stage but looks promising).