This site

::  HOME
What? What not?
::  Site map
::  About this site
 
 

 

Available corpora

Corpora@Stanford

Getting started
@Stanford

::  Intro & Overview
Where corpora grow and why you like them
::  Playground rules
& registration

Apply for your visa to the land of corpora
::  Setting up your account
Pack your suitcase to the land of corpora

Available resources
@Stanford

::  User support
The Corpus TA &
our corpora-email-list
::  Corpora
[Ordering corpora | Checking out CDs]

::  Corpora-tools & Software
[Documents]

::  Corpus-related classes
& projects

Beyond Stanford

::  Top 10 info-sources
E-resources out there

For the Corpus TA

::  Guidelines & help
 

Overview

This page has replaced an older corpus inventory page as of 04/01/2004. If you for some reason want to access the old page that is still possible.

Beside the corpora that we own on CD (which you can get from the Corpus TA, many corpora are installed and ready-to-use on either the AFS space or the corpus computer (CC). Some additional speech corpora may be available in the phonetics lab (see also the tip below). Please contact the Phonetics RA or the Corpus TA if you have questions about speech corpora. Although this page is not intended to give an overview of available online corpora (outside of Stanford), a very small selection is actually provided on this page. Nevertheless, we strongly encourage you to take advantage of the variety of freely accessible online corpora - for some links to sites that will provide you with an overview of the colorful world of online corpora, please browse & click through our subjectively construed list of the top 10 info-sources "out there".

This page contains the following information:

Tip-1: Since this page is not a database you may find it useful to just use your browser search function if you are interested in a specific corpus, or corpora for a specific language, or any other key word (e.g. "syntactically annotated"). Try it!

Tip-2: Interested in prosody? See Florian Jaeger's page of annotated links to prosodically annotated speech corpora.

Tip-3: If you cannot find your the corpus you were looking for - see if we can order it! Our department is a member of the LDC (Linguistic Data Consortium) which gives us free access to a lot of corpora. It may also be able to order other corpora. Simply tell the corpus TA what you need, but have a look at the information on "ordering corpora from the LDC" first, or browse the web to see whether what you need can be found online (and maybe for free). Also keep in mind that our inventory may be outdated since our resources to maintain this list are limited.

Tip-4: Thematically grouped corpora on AFS: There are a couple of thematic groups of corpora on AFS:

  • IE (Information Extraction):
    /afs/ir/data/linguistic-data/IE
    • Corporate Acquisitions Annotated Reuters Texts:
      /afs/ir/data/linguistic-data/IE/CorpAcq-Reuters-Freitag
    • Kristie Seymore's Information Extraction Data:
      /afs/ir/data/linguistic-data/IE/Kristie-Seymore-IE
    • MUC3-4 (Message Understanding Conference):
      /afs/ir/data/linguistic-data/IE/MUC/MUC3-4
    • MUC-6 (Message Understanding Conference) Text collection, LDC96T10:
      /afs/ir/data/linguistic-data/IE/MUC/MUC6/LDC96T10-Muc6
    • MUC-6 (Message Understanding Conference), LDC2003T13:
      /afs/ir/data/linguistic-data/IE/MUC/MUC6/LDC2003T13-muc6
    • Mooney Job Data:
      /afs/ir/data/linguistic-data/IE/Mooney-Job-Data
    • Census 1990 Names:
      /afs/ir/data/linguistic-data/IE/census1990names
  • Text Categorization:
    /afs/ir/data/linguistic-data/TextCat
    • 20Newsgroups:
      /afs/ir/data/linguistic-data/TextCat/20Newsgroups
    • DavidLewis (Reuters, TREC-AP):
      /afs/ir/data/linguistic-data/TextCat/DavidLewis
    • Spam Filtering:
      /afs/ir/data/linguistic-data/TextCat/Spam-Filtering
  • TREC (Information Retrieval Text Research Collection):
    /afs/ir/data/linguistic-data/TREC/
  • WSD (Word Sense Disambiguation):
    /afs/ir/data/linguistic-data/WSD
    • DSO Sense-Tagged, LDC97T12:
      /afs/ir/data/linguistic-data/WSD/DSO-Sense-Tagged
    • Leacock's Data:
      /afs/ir/data/linguistic-data/WSD/leacock
    • Pedersen's Data:
      /afs/ir/data/linguistic-data/WSD/pedersen
    • Senseval1:
      /afs/ir/data/linguistic-data/WSD/senseval/senseval1

Recently acquired corpora

We've acquired a fair number of corpora and tools recently. Notably we've now got several new treebanks at Stanford and we update some older corpora to newer versions:

  • Arabic Treebank: Part 1 v3.0
  • Multiple Translation Arabic
  • ACE Time Normalization (TERN) 2004 English Training Data v1.0
  • BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
  • Chinese Treebank 5.0
  • Levantine Arabic QT Training Data Set 3 Speech
  • Levantine Arabic QT Training Data Set 3 Transcripts
  • Fisher English Training Speech Part 1 Transcripts
  • Fisher English Training Speech Part 1 Speech
  • Prague Arabic Dependency Treebank 1.0
  • Switchboard Cellular Part 2 Audio
  • Talkbank Ethology Data: Field Recordings of Vervet Monkey Calls
  • Arabic English Parallel News Part 1
  • Arabic News Translation Text Part 1
  • Santa Barbara Corpus of Spoken American English 3

  • Acquisitions that predate 01/01/2004 are listed on the old corpus inventory page.


[ Recent acquisitions | LDC Corpora | Non-LDC Corpora | Online corpora | top ]

LDC corpora

Below you find a complete list of all our LDC corpora, chronologically ordered. If an entry is in grey font this means that we do not own that corpus (because Stanford wasn't a member of the LDC in that year).

How to read the table below:
In the table below the first column gives the location of the corpus either on AFS or on the Corpus Computer (CC). An empty entry means that the corpus is currently not installed. If an AFS path is given, then the corpus is stored on AFS. If the path starts with "CC://" the corpus is installed on the Corpus Computer. Exchange "CC://" by the following path to derive the location of the corpus on CC:
    D:/
The second column list the number of CDs/DVDs that the corpus is stored on or that the corpus was delivered via ftp/email. The ftp archives are stored on AFS under the following path (all filenames contain the LDC catalogue number which should make the identification of the corpus unproblematic):
    /afs/ir/data/linguistic-data/ldc/LDC-tarfiles/
The third column in each row lists the LDC catalog number. Click on the link to be linked to the LDC catalog entry for the corpus. Before you use a corpus, please inform yourself about the copyrights and license restrictions that are given in the catalog entry. Finally, the last column contains the name of the corpus.

How to read the LDC catalog numbers: Each LDC corpus has a unique catalog number. The first three digits are always 'LDC'. The next two digits represent the year in which the corpus was released. The third part of the catalog number is a single digit representing the corpus type (Lexicon, Speech or Text). The final digits uniquely distinguish that corpus from other corpora of that type. Note that availability of corpora is not necessarily restricted to members of the release year. To see what membership years a certain corpus is available for, click on the catalog number and check the detailed listing for the corpus.

Tip 1: You can also search the LDC Catalog directly by type and source or by year or the projects. You may also use the general catalog search.

Location Orig. ID Name of corpus
1DVD  LDC2005T02 Arabic Treebank: Part 1 v3.0
1CD  LDC2005T06 Chinese News Translation Text Part 1
1CD  LDC2005T08 Discourse Graphbank
1CD  LDC2005T09 ACE 2004 Multilingual Training Corpus
1CD  LDC2005T05 Multiple Translation Arabic
1CD  LDC2005T07 ACE Time Normalization (TERN) 2004 English Training Data v1.0
1CD  LDC2005T03 Levantine Arabic QT Training Data Set 3 Transcripts
1DVD  LDC2005S07 Levantine Arabic QT Training Data Set 3 Speech
/afs/ir/data/linguistic-data/Treebank/LDC2005T01-Chinese-Treebank-5.0 1CD  LDC2005T01 Chinese Treebank 5.0
2DVD  LDC2005S08 BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
/afs/ir/data/linguistic-data/Buckwalter-Arabic-Morphological-Analyzer-2.0 ftp  LDC2004L02 Buckwalter Arabic Morphological Analyzer Version 2.0
/afs/ir/data/linguistic-data/LDC2004T19-Fisher-Transcripts 1CD  LDC2004T19 Fisher English Training Speech Part 1 Transcripts

7DVD  LDC2004S13 Fisher English Training Speech Part 1 Speech

1CD  LDC2004T23 Prague Arabic Dependency Treebank 1.0

3DVD  LDC2004S07 Switchboard Cellular Part 2 Audio

1DVD  LDC2004S12 Talkbank Ethology Data: Field Recordings of Vervet Monkey Calls

ftp  LDC2004T18 Arabic English Parallel News Part 1

ftp  LDC2004T17 Arabic News Translation Text Part 1

1DVD  LDC2004S10 Santa Barbara Corpus of Spoken American English 3

2DVD  LDC2004S08 MDE RT-03 Training Data Speech

9DVD  LDC2004S09 NIST Meeting Pilot Corpus Speech

1DVD  LDC2004T08 Hong Kong Parallel Text

1DVD  LDC2004T12 MDE RT-03 Training Data Text and Annotations

1CD  LDC2004T16 2001 Communicator Dialogue Act Tagged

1DVD  LDC2004V01 FORM1 Kinematic Gesture

ftp  LDC2004T15 2000 Communicator Dialogue Act Tagged
/afs/ir/data/linguistic-data/Proposition-Bank-1 ftp  LDC2004T14 Proposition Bank I

ftp  LDC2004T07 Multiple-Translation Chinese (MTC) Part 3

ftp  LDC2004T13 NIST Meeting Pilot Corpus Transcripts and Metadata

2 DVD  LDC2004S04 2002 NIST Speaker Recognition Evaluation (SRE)

1 CD  LDC2004T11 Arabic Treebank: Part 3 v.1.0

2 DVD  LDC2004S05 ISL Meeting Corpus Speech Part 1

ftp  LDC2004T10 ISL Meeting Corpus Transcripts Part 1

ftp  LDC2004T01 Czech Broadcast News Transcripts

2 DVD  LDC2004S01 Czech Broadcast News Speech
/afs/ir/data/linguistic-data/Chinese-Treebank ftp  LDC2004T05 Chinese Treebank Version 4.0

9 DVD  LDC2004S02 ICSI Meeting Speech

ftp  LDC2004T04 ICSI Meeting Transcripts

ftp  LDC2004L01 Klex: Finite-State Lexical Transducer for Korean

ftp  LDC2004T03 Morphologically Annotated Korean Text

ftp  LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data

ftp  LDC2003T04 1997 HUB5 Spanish Transcripts

ftp  LDC2003T03 1997 HUB5 German Transcripts

ftp  LDC2003T02 1998 HUB5 English Transcripts

1 DVD  LDC2003S01 2001 Communicator Evaluation
/afs/ir/data/linguistic-data/ldc/LDC2003T01-2001-HUB5-Mandarin-Transcripts ftp  LDC2003T01 2001 HUB5 Mandarin Transcripts

ftp  LDC2003T11 ACE-2 Version 1.0

1 CD  LDC2003T20 ANC First Release
CC://Arabic Gigaword/ 1 DVD  LDC2003T12 Arabic Gigaword
/afs/ir/data/linguistic-data/Arabic-Treebank/Arabic-Treebank-Trans ftp  LDC2003T07 Arabic Treebank: Part 1 - 10K-word English Translation
/afs/ir/data/linguistic-data/Arabic-Treebank/Arabic-Treebank-2.0 ftp  LDC2003T06 Arabic Treebank: Part 1 v 2.0
CC://Chinese Gigaword/ 1 DVD  LDC2003T09 Chinese Gigaword
/afs/ir/data/linguistic-data/English Gigaword/,CC://English Gigaword/ 1 DVD  LDC2003T05 English Gigaword

1 CD  LDC2003V01 FORM2 Kinematic Gesture

1 CD  LDC2003L01 Grassfields Bantu Fieldwork: Dschang Lexicon

1 CD  LDC2003S02 Grassfields Bantu Fieldwork: Dschang Tone Paradigms

3 CD  LDC2003P01 Korean Telephone Conversations Complete Set

ftp  LDC2003L02 Korean Telephone Conversations Lexicon

3 CD  LDC2003S03 Korean Telephone Conversations Speech

ftp  LDC2003T08 Korean Telephone Conversations Transcripts
/afs/ir/data/linguistic-data/IE/MUC/MUC6/LDC2003T13-muc6/ ftp  LDC2003T13 Message Understanding Conference (MUC) 6
/afs/ir/data/linguistic-data/MTA ftp  LDC2003T18 Multiple-Translation Arabic (MTA) Part 1

ftp  LDC2003T17 Multiple-Translation Chinese (MTC) Part 2
/afs/ir/data/linguistic-data/SAID/ ftp  LDC2003T10 SAID

1 DVD  LDC2003T15 SLX Corpus of Classic Sociolinguistic Interviews
CC://Santa Barbara II/ 1 DVD  LDC2003S06 Santa Barbara Corpus of Spoken American English Part-II

4 DVD  LDC2003T16 SummBank 1.0

1 CD  LDC2003S05 West Point Russian Speech

1 CD  LDC2002S22 1997 HUB5 Arabic Evaluation

ftp  LDC2002T39 1997 HUB5 Arabic Transcripts

1 CD  LDC2002S24 1997 HUB5 German Evaluation

1 CD  LDC2002S25 1997 HUB5 Spanish Evaluation

1 CD  LDC2002S10 1998 HUB5 English Evaluation

7 CD  LDC2002S56 2000 Communicator Evaluation

1 CD  LDC2002S13 2001 HUB5 English Evaluation

1 CD  LDC2002S12 2001 HUB5 Mandarin Evaluation

1 CD  LDC2002S34 2001 NIST Speaker Recognition Evaluation Corpus

ftp  LDC2002L49 Buckwalter Arabic Morphological Analyzer Version 1.0

1 CD  LDC2002S37 Callhome Egyptian Arabic Speech Supplement

ftp  LDC2002T38 Callhome Egyptian Arabic Transcripts Supplement

ftp  LDC2002L27 Chinese-English Translation Lexicon Version 3.0

5 CD  LDC2002S28 Emotional Prosody Speech and Transcripts

ftp  LDC2002T26 Korean English Treebank Annotations

ftp  LDC2002T01 Multiple-Translation Chinese Corpus

ftp  LDC2002T07 RST Discourse Treebank
CC://Switchboard - Dan's version/
/afs/ir/data/linguistic-data/Switchboard/ (only transcripts)
20 CD  LDC2002S06 Switchboard-2 Phase III Audio
/afs/ir/data/linguistic-data/AQUAINT 2 CD  LDC2002T31 The AQUAINT Corpus of English News Text

6 CD  LDC2002S04 Translanguage English Database (TED) Speech

ftp  LDC2002T03 Translanguage English Database (TED) Transcripts

1 CD  LDC2002S35 Voicemail Corpus Part II

3 CD  LDC2002S02 West Point Arabic Speech Corpus

8 CD  LDC2001S97 2000 NIST Speaker Recognition Evaluation

1 CD  LDC2001T55 Arabic Newswire Part 1

ftp  LDC2001T61 CALLHOME Spanish Dialogue Act Annotation

1 CD  LDC2001T62 Cetempublico
/afs/ir/data/linguistic-data/Chinese-Treebank/Chinese-Treebank-2/ ftp  LDC2001T11 Chinese Treebank Version 2.0

1 CD  LDC2001S16 Grassfields Bantu Fieldwork: Ngomba Tone Paradigms

ftp  LDC2001T02 Message Understanding Conference (MUC) 7
/afs/ir/data/linguistic-data/PragueDependencyTreebank_v1.0/ 1 CD  LDC2001T10 Prague Dependency Treebank 1.0

3 CD  LDC2001S04 Speech in Noisy Environments (SPINE2) Part 1 Audio

ftp  LDC2001T05 Speech in Noisy Environments (SPINE2) Part 1 Transcripts

2 CD  LDC2001S06 Speech in Noisy Environments (SPINE2) Part 2 Audio

ftp  LDC2001T07 Speech in Noisy Environments (SPINE2) Part 2 Transcripts

3 CD  LDC2001S08 Speech in Noisy Environments (SPINE2) Part 3 Audio

ftp  LDC2001T09 Speech in Noisy Environments (SPINE2) Part 3 Transcripts

8 CD  LDC2001S99 Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio

13 CD  LDC2001S13 Switchboard Cellular Part 1 Audio

3 CD  LDC2001S15 Switchboard Cellular Part 1 Transcribed Audio
/afs/ir/data/linguistic-data/ldc/LDC2001T14-Swbd-Cell-1-Trans ftp  LDC2001T14 Switchboard Cellular Part 1 Transcription

ftp  LDC2001T60 Syllable-Final /s/ Lenition

6 CD  LDC2001S93 TDT2 Mandarin Audio Corpus
/afs/ir/data/linguistic-data/TDT2-Multilingual/ 1 CD  LDC2001T57 TDT2 Multilanguage Text Version 4.0

55 CD  LDC2001S94 TDT3 English Audio

13 CD  LDC2001S95 TDT3 Mandarin Audio
/afs/ir/data/linguistic-data/TDT2-Multilingual/ 1 CD  LDC2001T58 TDT3 Multilanguage Text Version 2.0
/afs/ir/data/linguistic-data/LDC2000S88-1999-HUB4-Test 1 CD  LDC2000S88 1999 HUB-4 Broadcast News Evaluation English Test Material
/afs/ir/data/linguistic-data/BLLIP-WSJ/ 2 CD  LDC2000T43 BLLIP 1987-89 WSJ Corpus Release 1
/afs/ir/data/linguistic-data/Hansard-Hong-Kong/ 1 CD  LDC2000T50 Hong Kong Hansards Parallel Text
/afs/ir/data/linguistic-data/Hong-Kong-Laws/ ftp  LDC2000T47 Hong Kong Laws Parallel Text
/afs/ir/data/linguistic-data/Hong-Kong-News/ ftp  LDC2000T46 Hong Kong News Parallel Text

1 CD  LDC2000T45 Korean Newswire
/afs/ir/data/linguistic-data/Santa-Barbara/ 3 CD  LDC2000S85 Santa Barbara Corpus of Spoken American English Part-I

4 CD  LDC2000S96 Speech in Noisy Environments (SPINE) Evaluation Audio

ftp  LDC2000T54 Speech in Noisy Environments (SPINE) Evaluation Transcripts
/afs/ir/data/linguistic-data/SPINE/ 4 CD  LDC2000S87 Speech in Noisy Environments (SPINE) Training Audio

ftp  LDC2000T49 Speech in Noisy Environments (SPINE) Training Transcripts

2 CD  LDC2000S92 TDT2 Careful Transcription Audio
/afs/ir/data/linguistic-data/TDT2-Careful/ ftp  LDC2000T44 TDT2 Careful Transcription Text

1 CD  LDC2000T52 TREC Mandarin

1 CD  LDC2000T51 TREC Spanish

ftp  LDC2000T53 Voice of America (VOA) Broadcast News Czech Transcript Corpus

6 CD  LDC2000S89 Voice of America (VOA) Czech Broadcast News Audio

5 CD  LDC99S81 1999 Speaker Recognition Benchmark

6 CD  LDC99S80 1997 Speaker Recognition Benchmark

4 CD  LDC99L23 American English Spoken Lexicon

ftp  LDC99L22 Egyptian Colloquial Arabic Lexicon

1 CD  LDC99T34 Japanese Business News Text Supplement

1 CD  LDC99T40 Portuguese Newswire Text

1 CD  LDC99S78 SUSAS

ftp  LDC99T33 SUSAS Transcripts

1 CD  LDC99T41 Spanish Newswire Text, Volume 2
CC://Switchboard - Dan's version/
/afs/ir/data/linguistic-data/Switchboard/ (only transcripts)
32 CD  LDC99S79 Switchboard-2 Phase II

73 CD  LDC99S84 TDT2 English Audio

10 CD  LDC99S83 Tactical Speaker Identification Speech Corpus (TSID)
/afs/ir/data/linguistic-data/Treebank 1 CD  LDC99T42 Treebank-3

7 CD  LDC99S82 USC Marketplace Broadcast News Speech

ftp  LDC99T36 USC Marketplace Broadcast News Transcripts

18 CD  LDC98S71 1997 English Broadcast News Speech (Hub-4)
/afs/ir/data/linguistic-data/Broadcast-News-Transcripts/ ftp  LDC98T28 1997 English Broadcast News Transcripts (Hub-4)

8 CD  LDC98S73 1997 Mandarin Broadcast News Speech (Hub-4NE)
/afs/ir/data/linguistic-data/ldc/LDC98T24-1997-Mandarin-Broadcast-News-Transcripts ftp  LDC98T24 1997 Mandarin Broadcast News Transcripts (Hub-4NE)

9 CD  LDC98S74 1997 Spanish Broadcast News Speech (Hub-4NE)
/afs/ir/data/linguistic-data/Spanish-Broadcast-News/ ftp  LDC98T29 1997 Spanish Broadcast News Transcripts (Hub-4NE)

6 CD  LDC98S76 1998 Speaker Recognition Benchmark

ftp  LDC98L21 COMLEX English Syntax Lexicon

3 CD  LDC98S67 HTIMIT

2 CD  LDC98S69 Hub-5 Mandarin Telephone Speech Corpus
/afs/ir/data/linguistic-data/ldc/LDC98T26-Hub-5-Mandarin-Transcripts ftp  LDC98T26 Hub-5 Mandarin Transcripts

5 CD  LDC98S70 Hub-5 Spanish Telephone Speech Corpus
/afs/ir/data/linguistic-data/Hub5-Spanish-Transcripts/ ftp  LDC98T27 Hub-5 Spanish Transcripts
/afs/ir/data/linguistic-data/1996-CSR-Hub-4-LM 2CD  LDC98T31 1996 CSR Hub-4 Language Model

2 CD  LDC98T32 JURIS

2 CD  LDC98S68 LLHDB

2 CD  LDC98T30 North American News Text Supplement
CC://Switchboard - Dan's version/
/afs/ir/data/linguistic-data/Switchboard/ (only transcripts)3
26 CD  LDC98S75 Switchboard-2 Phase 1
/afs/ir/data/linguistic-data/TDT-Pilot-Study/ ftp  LDC98T25 TDT Pilot Study Corpus

2 CD  LDC98S72 Taiwanese Putonghua Speech and Transcripts
/afs/ir/data/linguistic-data/Voicemail1/ 1 CD  LDC98S77 Voicemail Corpus-Part I

19 CD  LDC97S44 1996 English Broadcast News Speech (Hub-4)
/afs/ir/data/linguistic-data/Broadcast-News-Transcripts/hub4_eng_train_trans/ ftp  LDC97T22 1996 English Broadcast News Transcripts (Hub-4)
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-English-Lexicon/ ftp  LDC97L20 CALLHOME American English Lexicon (PRONLEX)

3 CD  LDC97S42 CALLHOME American English Speech
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-English-Transcripts/ ftp  LDC97T14 CALLHOME American English Transcripts

3 CD  LDC97S45 CALLHOME Egyptian Arabic Speech

ftp  LDC97T19 CALLHOME Egyptian Arabic Transcripts
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Arabic-Lexicon/ ftp  LDC97L19 CALLHOME Egyptian Arabic Lexicon
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-German-Lexicon/ ftp  LDC97L18 CALLHOME German Lexicon

3 CD  LDC97S43 CALLHOME German Speech
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-German-Transcripts/ ftp  LDC97T15 CALLHOME German Transcripts
/afs/ir/data/linguistic-data/WSD/DSO-Sense-Tagged/ ftp  LDC97T12 DSO Corpus of Sense-Tagged English
/afs/ir/data/linguistic-data/Switchboard/Audio-swbd1ph2 23 CD  LDC97S62 SWITCHBOARD-1 Release 2

2 CD  LDC97S63 The CMU Kids Corpus
/afs/ir/data/linguistic-data/Boston-University-Radio/ 4 CD  LDC96S36 Boston University Radio Speech Corpus

3 CD  LDC96S46 CALLFRIEND American English-Non-Southern Dialect

3 CD  LDC96S47 CALLFRIEND American English-Southern Dialect

3 CD  LDC96S48 CALLFRIEND Canadian French

3 CD  LDC96S49 CALLFRIEND Egyptian Arabic

3 CD  LDC96S50 CALLFRIEND Farsi

3 CD  LDC96S51 CALLFRIEND German

3 CD  LDC96S52 CALLFRIEND Hindi

3 CD  LDC96S53 CALLFRIEND Japanese

3 CD  LDC96S54 CALLFRIEND Korean

3 CD  LDC96S55 CALLFRIEND Mandarin Chinese-Mainland Dialect

3 CD  LDC96S56 CALLFRIEND Mandarin Chinese-Taiwan Dialect

3 CD  LDC96S57 CALLFRIEND Spanish-Caribbean Dialect

3 CD  LDC96S58 CALLFRIEND Spanish-Non-Caribbean Dialect

3 CD  LDC96S59 CALLFRIEND Tamil

3 CD  LDC96S60 CALLFRIEND Vietnamese

3 CD  LDC96S61 1996 Speaker Rcognition Benchmark
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Japanese-Lexicon/ ftp  LDC96L17 CALLHOME Japanese Lexicon

3 CD  LDC96S37 CALLHOME Japanese Speech

ftp  LDC96T18 CALLHOME Japanese Transcripts
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Mandarin-Lexicon/ ftp  LDC96L15 CALLHOME Mandarin Chinese Lexicon

2 CD  LDC96S34 CALLHOME Mandarin Chinese Speech
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Mandarin-Transcripts/ ftp  LDC96T16 CALLHOME Mandarin Chinese Transcripts
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Spanish-Lexicon/ ftp  LDC96L16 CALLHOME Spanish Lexicon

2 CD  LDC96S35 CALLHOME Spanish Speech
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Spanish-Transcripts/ ftp  LDC96T17 CALLHOME Spanish Transcripts
/afs/ir/data/linguistic-data/CELEX/ 1 CD  LDC96L14 CELEX2

4 CD  LDC96S33 CSR-IV Hub 3

3 CD  LDC96S31 CSR-IV Hub 4

1 CD  LDC96S30 CTIMIT

12 CD  LDC96S38 DCIEM/HCRC

1 CD  LDC96S32 FFMTIMIT

1 CD  LDC96S29 Frontiers in Speech Processing 93

1 CD  LDC96S40 Frontiers in Speech Processing 94

6 CD  LDC96S64-1 JEIDA/JCSD-Channel 0 City Names

20 CD  LDC96S64 JEIDA/JCSD-Channel 0 Complete

4 CD  LDC96S64-2 JEIDA/JCSD-Channel 0 Control Words

3 CD  LDC96S64-4 JEIDA/JCSD-Channel 0 Four Digit Sequences

1 CD  LDC96S64-3 JEIDA/JCSD-Channel 0 Isolated Digits

6 CD  LDC96S64-5 JEIDA/JCSD-Channel 0 Mono Syllables

6 CD  LDC96S65-1 JEIDA/JCSD-Channel 1 City Names

20 CD  LDC96S65 JEIDA/JCSD-Channel 1 Complete

4 CD  LDC96S65-2 JEIDA/JCSD-Channel 1 Control Words

3 CD  LDC96S65-4 JEIDA/JCSD-Channel 1 Four Digit Sequences

1 CD  LDC96S65-3 JEIDA/JCSD-Channel 1 Isolated Digits

6 CD  LDC96S65-5 JEIDA/JCSD-Channel 1 Mono Syllables
/afs/ir/data/linguistic-data/IE/MUC/MUC6/LDC96T10-Muc6/ ftp  LDC96T10 Message Understanding Conference (MUC) 6 Additional News Text

2 CD  LDC96S41 VAHA (POLYPHONE II)

3 CD  LDC95S23 CSR-III Speech

4 CD  LDC95T6 CSR-III Text

1 CD  LDC95T11 European Language Newspaper Text
/afs/ir/data/linguistic-data/Hansard-French/ 2 CD  LDC95T20 Hansard French/English
/afs/ir/data/linguistic-data/Japanese-Business-News/ 1 CD  LDC95T8 Japanese Business News Text

1 CD  LDC95S22 KING Speaker Verification

2 CD  LDC95S28 LATINO-40 Spanish Read News

1 CD  LDC95T13 Mandarin Chinese News Text
/afs/ir/data/linguistic-data/North-American-News/ 2 CD  LDC95T21 North American News Text Corpus

3 CD  LDC95S27 PHONEBOOK: NYNEX Isolated Words

1 CD  LDC95T9 Spanish News Text
/afs/ir/data/linguistic-data/TRAINS/ 1 CD  LDC95S25 TRAINS spoken dialog corpus
/afs/ir/data/linguistic-data/Treebank/ 1 CD  LDC95T7 Treebank-2

6 CD  LDC95S24 WSJCAM0 Cambridge Read News
/afs/ir/data/linguistic-data/Air-Traffic-Control/ 8 CD  LDC94S14A Air Traffic Control Complete

2 CD  LDC94S14B Air Traffic Control BOS

3 CD  LDC94S14C Air Traffic Control DCA

3 CD  LDC94S14D Air Traffic Control DFW

4 CD  LDC94S20 BRAMSHILL

34 CD  LDC94S13A CSR-II (WSJ1) Complete

20 CD  LDC94S13C CSR-II (WSJ1) Other

19 CD  LDC94S13B CSR-II (WSJ1) Sennheiser
/afs/ir/data/linguistic-data/ECI-Multilingual/ 1 CD  LDC94T5 ECI Multilingual Text

8 CD  LDC94S21 MACROPHONE

1 CD  LDC94S17 OGI Multilanguage Corpus

1 CD  LDC94S18 OGI Spelled and Spoken Word

2 CD  LDC94S15 SPIDRE

3 CD  LDC94T4A UN Parallel Text (Complete)

1 CD  LDC94T4B-1 UN Parallel Text (English)

1 CD  LDC94T4B-2 UN Parallel Text (French)

1 CD  LDC94T4B-3 UN Parallel Text (Spanish)

1 CD  LDC94S16 YOHO Speaker Verification

1 CD  LDC93S11 Road Rally

6 CD  LDC93S4A ATIS0 Complete

1 CD  LDC93S4B ATIS0 Pilot

1 CD  LDC93S4B-2 ATIS0 Read

4 CD  LDC93S4B-3 ATIS0 SD Read

4 CD  LDC93S5 ATIS2

15 CD  LDC93S6A CSR-I (WSJ0) Complete

9 CD  LDC93S6C CSR-I (WSJ0) Other

9 CD  LDC93S6B CSR-I (WSJ0) Sennheiser
/afs/ir/data/linguistic-data/HCRC-Maptask-Transcripts/ 8 CD  LDC93S12 HCRC Map Task Corpus

2 CD  LDC93S2 NTIMIT

6 CD  LDC93S3A Resource Management Complete Set 2.0

4 CD  LDC93S3B Resource Management RM1 2.0

2 CD  LDC93S3C Resource Management RM2 2.0

1 CD  LDC93S8 SWITCHBOARD Credit Card
/afs/ir/data/linguistic-data/TIDIGITS/ 3 CD  LDC93S10 TIDIGITS
/afs/ir/data/linguistic-data/TIMIT/ 1 CD  LDC93S1 TIMIT Acoustic-Phonetic Continuous Speech Corpus
/afs/ir/data/linguistic-data/Tipster/ 3 CD  LDC93T3A TIPSTER Complete

1 CD  LDC93T3B TIPSTER Volume 1

1 CD  LDC93T3C TIPSTER Volume 2

1 CD  LDC93T3D TIPSTER Volume 3


[ Recent acquisitions | LDC Corpora | Non-LDC Corpora | Online corpora | top ]

Non-LDC corpora

The non-LDC corpora are listed alphabetically below. As for the LDC corpora, an AFS directory location and/or the abbreviation "CC://" indicate whether that corpus is installed on AFS or the Corpus Computer (CC) or on both. If the path starts with "CC://", the corpus is installed on the Corpus Computer. Exchange "CC://" by the following path to derive the location of the corpus on CC:

    D:/

Name Annotation Language(s) Location Associated tools
John Rylands Univ Corpus of late 18c prose - Early Modern English /afs/ir/data/linguistic-data/Rylands_Univ_Corpus_Late_18c_prose
Cornell SMART Archive - English /afs/ir/data/linguistic-data/SMART-Archive -
Enron Email Corpus - English /afs/ir/data/linguistic-data/Enron-Email-Corpus -
20Newsgroups
English /afs/ir/data/linguistic-data/TextCat/20Newsgroups/
Aleksova's corpus - Bulgarian (spoken) CC://Bugarian Corpora/Aleksova/ -
ATIS Syntax, POS, some argument structure English CC://TIGERCorpora/Atis - parsed and tagged (Stanford release)/ TIGERSearch
Bavarian Archive of Speech Corpora (only annotations) Prosody, syntax, POS, transcribed German, English, Japanese CC://BAScorpora/
/afs/ir/data/linguistic-data/Verbmobil-Dialogs/BAS-VM-annotation/
-
British National Corpus (BNC) World Edition
English /afs/ir/data/linguistic-data/BNC-World/
Also search with Mark Davies excellent interface to the BNC - Variation In English Words and Phrases
VIEW
Brown Corpus Syntax, POS, some argument structure English CC://TIGERCorpora/Brown Corpus - parsed and tagged (Stanford release)
/afs/ir/data/linguistic-data/Treebank/tgrep2able/
TIGERSearch
Census 1990 Names
English /afs/ir/data/linguistic-data/IE/census1990names/
CHRISTINE, Stage I, Release 2
English /afs/ir/data/linguistic-data/CHRISTINE/
CMU Pronouncing Dictionary
English /afs/ir/data/linguistic-data/CMU-Pronouncing-Dict/
Corporate Acquisitions Annotated Reuters Texts

/afs/ir/data/linguistic-data/IE/CorpAcq-Reuters-Freitag/
Corpus of Spoken Professional American English POS American English (spoken) CC://CSPA - Corpus of Spoken Professional American English/ MonoConc
DavidLewis (Reuters, TREC-AP)
English /afs/ir/data/linguistic-data/TextCat/DavidLewis/
Excite log
English /afs/ir/data/linguistic-data/IR/EXCITE/
International Computer Archive of Modern and Medieval English (ICAME) diachronic corpus English /afs/ir/data/linguistic-data/ICAME/
International Corpus of English - The British Component (ICE GB)
English /afs/ir/data/linguistic-data/ICE-GB/
/afs/ir/data/linguistic-data/Treebank/tgrep2able/
tgrep, tgrep2
IViE Prosody, phonetic, etc. British dialects CC://IViE/ -
Kristie Seymore's Information Extraction Data
English /afs/ir/data/linguistic-data/IE/Kristie-Seymore-IE/
LUCY, initial release
English /afs/ir/data/linguistic-data/LUCY/
MUC3-4 (Message Understanding Conference)
English /afs/ir/data/linguistic-data/IE/MUC/MUC3-4/
Mooney Job Data
English /afs/ir/data/linguistic-data/IE/Mooney-Job-Data/
PPCME2 [requires membership in a special group] diachronic corpus
/afs/ir/data/linguistic-data/PPCME2/
Proposition Bank (experimental pre-release) predicate structure enriched treebank English /afs/ir/data/linguistic-data/PropBank/ related tools
NEGRA Syntax (LFG-based), POS, some argument structure German CC://TIGERCorpora/NEGRA-parsed/ TIGERSearch
Remedia Story Comprehension: (use requires special permission)
English /afs/ir/data/linguistic-data/QA/Remedia-Story-Comprehension/
Reuters Corpus
English /afs/ir/data/linguistic-data/Reuters-Corpus/
RNC German radio news (Nachrichten) corpus Prosodically annotated & transcribed speech files German (spoken) CC://RNC - German Radio News Corpus/ -
Spam Filtering
English /afs/ir/data/linguistic-data/TextCat/Spam-Filtering/
Switchboard Corpus Syntax, POS, some argument structure English (spoken) CC://TIGERCorpora/Switchboard Corpus - parsed and tagged (Stanford release)/
/afs/ir/data/linguistic-data/Treebank/tgrep2able/
TIGERSearch
Switchboard LINK Project Corpus Syntax, POS; some arg str,  animacy, information status, and coreference
English (spoken) /afs/ir/data/linguistic-data/Treebank/LINK-swbd/

requires special permission and understanding of citation requirements in the README file
tgrep2
SUSANNE Corpus, Release 5
English /afs/ir/data/linguistic-data/SUSANNE/
TIGER Treebank
[Version 1]
Syntax (LFG-based), POS, some argument structure German CC://TIGERCorpora/TIGERcorpus 1.0 (July 2003)/ TIGERSearch
TIGER sample corpora Syntax, POS, some argument structure English CC://TIGERCorpora/CorpusSamplers/ TIGERSearch
Unified Medical Language System (UMLS)
English /afs/ir/data/linguistic-data/UMLS/
YCOE
Syntax, POS, CAT, lemma English CC://TIGERCorpora/YCOE/
/afs/ir/data/linguistic-data/YCOE/
TIGERSearch
Verbmobil Dialogs
German, English, Japanese /afs/ir/data/linguistic-data/Verbmobil-Dialogs
Wallstreet Journal Syntax, POS, some argument structure English CC://TIGERCorpora/Wall Street Journal - parsed and tagged (Stanford release)/
/afs/ir/data/linguistic-data/Treebank/tgrep2able/
TIGERSearch


[ Recent acquisitions | LDC Corpora | Non-LDC Corpora | Online corpora | top ]

Corpora on the WWW (a very small collection)

This is only a very small collection of online corpora, please see the top 10 info-sources page for links to sites with far more information.

  • British National Corpus
      By Beth Levin: You can find information on doing searches by looking in the SARA Manual (http://thetis.bl.uk/CHAP4/) this manual is intended for the full version of the BNC, but it contains information on pattern searching in section 3. With some ingenuity, you can even do searches by lexical category. Since you only get 50 randomly chosen examples of the pattern you are searching for at a time, if there are 80-100 or more examples of this pattern, search for it again since you may get some more examples. (We now have the BNC in AFS space, but we haven't installed the SARA server yet. But you can use gsearch with it.)

      Note - You may also find it useful to have the the BNC Basic Tagset open in a separate window while doing your first searches in the BNC.

  • COBUILD Corpus
      By Beth Levin: This web site also has material from a whole range of written and spoken sources, used in the development of the Collins COBUILD dictionaries and ESL materials. It allows for some relatively sophisticated searching; click on ``query syntax'' for details. Particularly nice is ``@'': read@ searches for all the inflected forms of read! You only get 40 examples at a time, but can partially get around this in the same way as with the BNC. The other shortcoming is that the window of text is very small -- often not even a whole sentence. If you want more text, you have to search for the string of text defining the left or right edge of the example.

      Note - There is also the Cobuild Concordance and Collocations Sampler.
  • Lexis-Nexis Academic Universe
      By Beth Levin: This web site has material from major newspapers, wire services, and television news programs. Although it is not a ``balanced'' corpus, it has so much text that you can find things you won't find anywhere else (well, maybe through a web search!); for example, outgraze, outdefend, outweed or he reads himself quasi-blind and the fans spin themselves dizzy. The other major drawback is that it is not designed for linguists, so that you cannot take full advantage of what's there and you also need to work around search procedures that were designed for research into current events.

      From the Lexis-Nexis home page choose ``General News''. From ``General News'' choose the ``More Options'' tab; using the ``Basic'' tab will only allow searches of the title and first paragraph of an article! On the ``More Options'' search page, always be sure to click on ``Headline'' and then select ``Full text'' from the popup menu that comes up. You can select from a range of sources; the default is ``Major Newpapers''; similarly, you can also select from a range of dates. When you get the list of results, click on ``Expanded List''; this will show you the part of the text with your search pattern for the results.

      There are tips at the bottom of the search page that tell you how to construct searches. A few pointers. You can use ``!'' as a wild card for truncation on the right edges of words. Also helpful is ``pre/n'', for some ``n'', which allows you to search for a word that precedes another within an window of n words; ``read! pre/2 way'' will find patterns such as reads her way or reading our slow way, etc. (But ``read!'' will also find reader.) Another disadvantage: Lexis-Nexis has a large list of stop words -- words that you can't search for -- including just about any determiner, auxiliary, and preposition.

  • OED new version (slow and requires a graphical browser) or old version (fast and lynx-friendly)
  • DIALOGUE DIVERSITY CORPUS: Version 2.0
      From their website: The DDC gives direct access to a set of dialogue transcripts (13 sources, more than 12 hours of dialogue, all in English.). It also gives a set of links and methods for indirect access to hundreds of additional dialogues (principally in English.) Many sources provide speech data as well as transcripts. The emphasis is on free or inexpensive access.

      Volume 2.0 presents access to hundreds of dialogues that were not represented in the original release in October 2002. It is more diverse in terms of situations and dynamic patterns. Access to oral history interviews, the Watergate tapes (by several paths), diverse regional varieties of English (both British and international), the just-emerging American National Corpus (ANC), the U. S. Supreme Court, and other originally non-linguistic sources are presented for the first time.

      The dialogues in this corpus occurred in a very diverse collection of interactive situations. Thus it is a data resource for studies of the breadth of coverage of particular dialogue models, and for studies that compare dialogue from different situations.

  • TITUS corpus and search engine [signed license conditions & user agreement need to be faxed to the number stated on the user agreement].
      By Florian Jaeger: TITUS is a large collection of Indo-European text that can be searched with the TITUS search engine. Several types of searches are available. You can search specific texts, restrict the search to specific languages (e.g. Farsi) or language groups (e.g. Avestan, Lithuanian, or simply Old Prussion) or combinations of these restrictions. Wildcards (e.g. '*') or logical operators (e.g. 'AND' or 'OR') can be used, too. The output is a list of texts that contain matches to your search. The site also contains lexica, links to several other text databases, tutorials, etc. A rich source for anyone working on Indo-European languages.
  • COSMAS II is a German giga corpus with almost 2 billion (!) text words. It is accessible via the COSMAS II Online Client. Unfortunately, all help and information available for this corpus is given in German.
      By Florian Jaeger: COSMAS II contains morphosyntactic annotated parts, speech, formal writings (news texts, novels, etc.), as well as special sociolinguistic corpora. You can load subcorpora (so called 'archives') and construct searches using a text or a graphical interface - both of which need some time getting used to (but it's worth it). The searches have almost regular expression power; you can search for tagged morphosyntactic information; save your searches and filter your results. A inflection and derivation operator, '&' is also availabel.
  • Right in front of your door you can find the Text@Humanities project of the Human Digital Information Service, a collection of online searchable (literary) texts drawn from American English, Irish, other varieties of English, German, French and Spanish.
  • The W3-Corpora Search Engine is an online search engine on a large collection of corpora (still in prototype stage but looks promising).


[ Recent acquisitions | LDC Corpora | Non-LDC Corpora | Online corpora | top ]