This site
|
|
Available corporaCorpora@Stanford |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Getting started
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| :: | Intro &
Overview Where corpora grow and why you like them |
| :: | Playground
rules & registration Apply for your visa to the land of corpora |
| :: | Setting up your
account Pack your suitcase to the land of corpora |
| :: | User support The Corpus TA & our corpora-email-list |
| :: | Corpora [Ordering corpora | Checking out CDs] |
| :: | Corpora-tools & Software [Documents] |
| :: | Corpus-related classes & projects |
| :: | Top 10 info-sources E-resources out there |
| :: | Guidelines & help |
This page has replaced an older corpus inventory page as of 04/01/2004. If you for some reason want to access the old page that is still possible.
Beside the corpora that we own on CD (which you can get from the Corpus TA, many corpora are installed and ready-to-use on either the AFS space or the corpus computer (CC). Some additional speech corpora may be available in the phonetics lab (see also the tip below). Please contact the Phonetics RA or the Corpus TA if you have questions about speech corpora. Although this page is not intended to give an overview of available online corpora (outside of Stanford), a very small selection is actually provided on this page. Nevertheless, we strongly encourage you to take advantage of the variety of freely accessible online corpora - for some links to sites that will provide you with an overview of the colorful world of online corpora, please browse & click through our subjectively construed list of the top 10 info-sources "out there".
This page contains the following information:
Tip-1: Since this page is not a database you may find it useful to just use your browser search function if you are interested in a specific corpus, or corpora for a specific language, or any other key word (e.g. "syntactically annotated"). Try it!
Tip-2: Interested in prosody? See Florian Jaeger's page of annotated links to prosodically annotated speech corpora.
Tip-3: If you cannot find your the corpus you were looking for - see if we can order it! Our department is a member of the LDC (Linguistic Data Consortium) which gives us free access to a lot of corpora. It may also be able to order other corpora. Simply tell the corpus TA what you need, but have a look at the information on "ordering corpora from the LDC" first, or browse the web to see whether what you need can be found online (and maybe for free). Also keep in mind that our inventory may be outdated since our resources to maintain this list are limited.
Tip-4: Thematically grouped corpora on AFS: There are a couple of thematic groups of corpora on AFS:
We've acquired a fair number of corpora and tools recently. Notably we've now got several new treebanks at Stanford and we update some older corpora to newer versions:
How to read the LDC catalog numbers: Each LDC corpus has a unique catalog number. The first three digits are always 'LDC'. The next two digits represent the year in which the corpus was released. The third part of the catalog number is a single digit representing the corpus type (Lexicon, Speech or Text). The final digits uniquely distinguish that corpus from other corpora of that type. Note that availability of corpora is not necessarily restricted to members of the release year. To see what membership years a certain corpus is available for, click on the catalog number and check the detailed listing for the corpus.
Tip 1: You can also search the LDC Catalog directly by type and source or by year or the projects. You may also use the general catalog search.
| Location | Orig. | ID | Name of corpus |
|---|---|---|---|
| 1DVD | LDC2005T02 | Arabic Treebank: Part 1 v3.0 | |
| 1CD | LDC2005T06 | Chinese News Translation Text Part 1 | |
| 1CD | LDC2005T08 | Discourse Graphbank | |
| 1CD | LDC2005T09 | ACE 2004 Multilingual Training Corpus | |
| 1CD | LDC2005T05 | Multiple Translation Arabic | |
| 1CD | LDC2005T07 | ACE Time Normalization (TERN) 2004 English Training Data v1.0 | |
| 1CD | LDC2005T03 | Levantine Arabic QT Training Data Set 3 Transcripts | |
| 1DVD | LDC2005S07 | Levantine Arabic QT Training Data Set 3 Speech | |
| /afs/ir/data/linguistic-data/Treebank/LDC2005T01-Chinese-Treebank-5.0 | 1CD | LDC2005T01 | Chinese Treebank 5.0 |
| 2DVD | LDC2005S08 | BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts | |
| /afs/ir/data/linguistic-data/Buckwalter-Arabic-Morphological-Analyzer-2.0 | ftp | LDC2004L02 | Buckwalter Arabic Morphological Analyzer Version 2.0 |
| /afs/ir/data/linguistic-data/LDC2004T19-Fisher-Transcripts | 1CD | LDC2004T19 | Fisher English Training Speech Part 1 Transcripts |
| 7DVD | LDC2004S13 | Fisher English Training Speech Part 1 Speech | |
| 1CD | LDC2004T23 | Prague Arabic Dependency Treebank 1.0 | |
| 3DVD | LDC2004S07 | Switchboard Cellular Part 2 Audio | |
| 1DVD | LDC2004S12 | Talkbank Ethology Data: Field Recordings of Vervet Monkey Calls | |
| ftp | LDC2004T18 | Arabic English Parallel News Part 1 | |
| ftp | LDC2004T17 | Arabic News Translation Text Part 1 | |
| 1DVD | LDC2004S10 | Santa Barbara Corpus of Spoken American English 3 | |
| 2DVD | LDC2004S08 | MDE RT-03 Training Data Speech | |
| 9DVD | LDC2004S09 | NIST Meeting Pilot Corpus Speech | |
| 1DVD | LDC2004T08 | Hong Kong Parallel Text | |
| 1DVD | LDC2004T12 | MDE RT-03 Training Data Text and Annotations | |
| 1CD | LDC2004T16 | 2001 Communicator Dialogue Act Tagged | |
| 1DVD | LDC2004V01 | FORM1 Kinematic Gesture | |
| ftp | LDC2004T15 | 2000 Communicator Dialogue Act Tagged | |
| /afs/ir/data/linguistic-data/Proposition-Bank-1 | ftp | LDC2004T14 | Proposition Bank I |
| ftp | LDC2004T07 | Multiple-Translation Chinese (MTC) Part 3 | |
| ftp | LDC2004T13 | NIST Meeting Pilot Corpus Transcripts and Metadata | |
| 2 DVD | LDC2004S04 | 2002 NIST Speaker Recognition Evaluation (SRE) | |
| 1 CD | LDC2004T11 | Arabic Treebank: Part 3 v.1.0 | |
| 2 DVD | LDC2004S05 | ISL Meeting Corpus Speech Part 1 | |
| ftp | LDC2004T10 | ISL Meeting Corpus Transcripts Part 1 | |
| ftp | LDC2004T01 | Czech Broadcast News Transcripts | |
| 2 DVD | LDC2004S01 | Czech Broadcast News Speech | |
| /afs/ir/data/linguistic-data/Chinese-Treebank | ftp | LDC2004T05 | Chinese Treebank Version 4.0 |
| 9 DVD | LDC2004S02 | ICSI Meeting Speech | |
| ftp | LDC2004T04 | ICSI Meeting Transcripts | |
| ftp | LDC2004L01 | Klex: Finite-State Lexical Transducer for Korean | |
| ftp | LDC2004T03 | Morphologically Annotated Korean Text | |
| ftp | LDC2004T09 | TIDES Extraction (ACE) 2003 Multilingual Training Data | |
| ftp | LDC2003T04 | 1997 HUB5 Spanish Transcripts | |
| ftp | LDC2003T03 | 1997 HUB5 German Transcripts | |
| ftp | LDC2003T02 | 1998 HUB5 English Transcripts | |
| 1 DVD | LDC2003S01 | 2001 Communicator Evaluation | |
| /afs/ir/data/linguistic-data/ldc/LDC2003T01-2001-HUB5-Mandarin-Transcripts | ftp | LDC2003T01 | 2001 HUB5 Mandarin Transcripts |
| ftp | LDC2003T11 | ACE-2 Version 1.0 | |
| 1 CD | LDC2003T20 | ANC First Release | |
| CC://Arabic Gigaword/ | 1 DVD | LDC2003T12 | Arabic Gigaword |
| /afs/ir/data/linguistic-data/Arabic-Treebank/Arabic-Treebank-Trans | ftp | LDC2003T07 | Arabic Treebank: Part 1 - 10K-word English Translation |
| /afs/ir/data/linguistic-data/Arabic-Treebank/Arabic-Treebank-2.0 | ftp | LDC2003T06 | Arabic Treebank: Part 1 v 2.0 |
| CC://Chinese Gigaword/ | 1 DVD | LDC2003T09 | Chinese Gigaword |
| /afs/ir/data/linguistic-data/English Gigaword/,CC://English Gigaword/ | 1 DVD | LDC2003T05 | English Gigaword |
| 1 CD | LDC2003V01 | FORM2 Kinematic Gesture | |
| 1 CD | LDC2003L01 | Grassfields Bantu Fieldwork: Dschang Lexicon | |
| 1 CD | LDC2003S02 | Grassfields Bantu Fieldwork: Dschang Tone Paradigms | |
| 3 CD | LDC2003P01 | Korean Telephone Conversations Complete Set | |
| ftp | LDC2003L02 | Korean Telephone Conversations Lexicon | |
| 3 CD | LDC2003S03 | Korean Telephone Conversations Speech | |
| ftp | LDC2003T08 | Korean Telephone Conversations Transcripts | |
| /afs/ir/data/linguistic-data/IE/MUC/MUC6/LDC2003T13-muc6/ | ftp | LDC2003T13 | Message Understanding Conference (MUC) 6 |
| /afs/ir/data/linguistic-data/MTA | ftp | LDC2003T18 | Multiple-Translation Arabic (MTA) Part 1 |
| ftp | LDC2003T17 | Multiple-Translation Chinese (MTC) Part 2 | |
| /afs/ir/data/linguistic-data/SAID/ | ftp | LDC2003T10 | SAID |
| 1 DVD | LDC2003T15 | SLX Corpus of Classic Sociolinguistic Interviews | |
| CC://Santa Barbara II/ | 1 DVD | LDC2003S06 | Santa Barbara Corpus of Spoken American English Part-II |
| 4 DVD | LDC2003T16 | SummBank 1.0 | |
| 1 CD | LDC2003S05 | West Point Russian Speech | |
| 1 CD | LDC2002S22 | 1997 HUB5 Arabic Evaluation | |
| ftp | LDC2002T39 | 1997 HUB5 Arabic Transcripts | |
| 1 CD | LDC2002S24 | 1997 HUB5 German Evaluation | |
| 1 CD | LDC2002S25 | 1997 HUB5 Spanish Evaluation | |
| 1 CD | LDC2002S10 | 1998 HUB5 English Evaluation | |
| 7 CD | LDC2002S56 | 2000 Communicator Evaluation | |
| 1 CD | LDC2002S13 | 2001 HUB5 English Evaluation | |
| 1 CD | LDC2002S12 | 2001 HUB5 Mandarin Evaluation | |
| 1 CD | LDC2002S34 | 2001 NIST Speaker Recognition Evaluation Corpus | |
| ftp | LDC2002L49 | Buckwalter Arabic Morphological Analyzer Version 1.0 | |
| 1 CD | LDC2002S37 | Callhome Egyptian Arabic Speech Supplement | |
| ftp | LDC2002T38 | Callhome Egyptian Arabic Transcripts Supplement | |
| ftp | LDC2002L27 | Chinese-English Translation Lexicon Version 3.0 | |
| 5 CD | LDC2002S28 | Emotional Prosody Speech and Transcripts | |
| ftp | LDC2002T26 | Korean English Treebank Annotations | |
| ftp | LDC2002T01 | Multiple-Translation Chinese Corpus | |
| ftp | LDC2002T07 | RST Discourse Treebank | |
| CC://Switchboard - Dan's version/ /afs/ir/data/linguistic-data/Switchboard/ (only transcripts) |
20 CD | LDC2002S06 | Switchboard-2 Phase III Audio |
| /afs/ir/data/linguistic-data/AQUAINT | 2 CD | LDC2002T31 | The AQUAINT Corpus of English News Text |
| 6 CD | LDC2002S04 | Translanguage English Database (TED) Speech | |
| ftp | LDC2002T03 | Translanguage English Database (TED) Transcripts | |
| 1 CD | LDC2002S35 | Voicemail Corpus Part II | |
| 3 CD | LDC2002S02 | West Point Arabic Speech Corpus | |
| 8 CD | LDC2001S97 | 2000 NIST Speaker Recognition Evaluation | |
| 1 CD | LDC2001T55 | Arabic Newswire Part 1 | |
| ftp | LDC2001T61 | CALLHOME Spanish Dialogue Act Annotation | |
| 1 CD | LDC2001T62 | Cetempublico | |
| /afs/ir/data/linguistic-data/Chinese-Treebank/Chinese-Treebank-2/ | ftp | LDC2001T11 | Chinese Treebank Version 2.0 |
| 1 CD | LDC2001S16 | Grassfields Bantu Fieldwork: Ngomba Tone Paradigms | |
| ftp | LDC2001T02 | Message Understanding Conference (MUC) 7 | |
| /afs/ir/data/linguistic-data/PragueDependencyTreebank_v1.0/ | 1 CD | LDC2001T10 | Prague Dependency Treebank 1.0 |
| 3 CD | LDC2001S04 | Speech in Noisy Environments (SPINE2) Part 1 Audio | |
| ftp | LDC2001T05 | Speech in Noisy Environments (SPINE2) Part 1 Transcripts | |
| 2 CD | LDC2001S06 | Speech in Noisy Environments (SPINE2) Part 2 Audio | |
| ftp | LDC2001T07 | Speech in Noisy Environments (SPINE2) Part 2 Transcripts | |
| 3 CD | LDC2001S08 | Speech in Noisy Environments (SPINE2) Part 3 Audio | |
| ftp | LDC2001T09 | Speech in Noisy Environments (SPINE2) Part 3 Transcripts | |
| 8 CD | LDC2001S99 | Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio | |
| 13 CD | LDC2001S13 | Switchboard Cellular Part 1 Audio | |
| 3 CD | LDC2001S15 | Switchboard Cellular Part 1 Transcribed Audio | |
| /afs/ir/data/linguistic-data/ldc/LDC2001T14-Swbd-Cell-1-Trans | ftp | LDC2001T14 | Switchboard Cellular Part 1 Transcription |
| ftp | LDC2001T60 | Syllable-Final /s/ Lenition | |
| 6 CD | LDC2001S93 | TDT2 Mandarin Audio Corpus | |
| /afs/ir/data/linguistic-data/TDT2-Multilingual/ | 1 CD | LDC2001T57 | TDT2 Multilanguage Text Version 4.0 |
| 55 CD | LDC2001S94 | TDT3 English Audio | |
| 13 CD | LDC2001S95 | TDT3 Mandarin Audio | |
| /afs/ir/data/linguistic-data/TDT2-Multilingual/ | 1 CD | LDC2001T58 | TDT3 Multilanguage Text Version 2.0 |
| /afs/ir/data/linguistic-data/LDC2000S88-1999-HUB4-Test | 1 CD | LDC2000S88 | 1999 HUB-4 Broadcast News Evaluation English Test Material |
| /afs/ir/data/linguistic-data/BLLIP-WSJ/ | 2 CD | LDC2000T43 | BLLIP 1987-89 WSJ Corpus Release 1 |
| /afs/ir/data/linguistic-data/Hansard-Hong-Kong/ | 1 CD | LDC2000T50 | Hong Kong Hansards Parallel Text |
| /afs/ir/data/linguistic-data/Hong-Kong-Laws/ | ftp | LDC2000T47 | Hong Kong Laws Parallel Text |
| /afs/ir/data/linguistic-data/Hong-Kong-News/ | ftp | LDC2000T46 | Hong Kong News Parallel Text |
| 1 CD | LDC2000T45 | Korean Newswire | |
| /afs/ir/data/linguistic-data/Santa-Barbara/ | 3 CD | LDC2000S85 | Santa Barbara Corpus of Spoken American English Part-I |
| 4 CD | LDC2000S96 | Speech in Noisy Environments (SPINE) Evaluation Audio | |
| ftp | LDC2000T54 | Speech in Noisy Environments (SPINE) Evaluation Transcripts | |
| /afs/ir/data/linguistic-data/SPINE/ | 4 CD | LDC2000S87 | Speech in Noisy Environments (SPINE) Training Audio |
| ftp | LDC2000T49 | Speech in Noisy Environments (SPINE) Training Transcripts | |
| 2 CD | LDC2000S92 | TDT2 Careful Transcription Audio | |
| /afs/ir/data/linguistic-data/TDT2-Careful/ | ftp | LDC2000T44 | TDT2 Careful Transcription Text |
| 1 CD | LDC2000T52 | TREC Mandarin | |
| 1 CD | LDC2000T51 | TREC Spanish | |
| ftp | LDC2000T53 | Voice of America (VOA) Broadcast News Czech Transcript Corpus | |
| 6 CD | LDC2000S89 | Voice of America (VOA) Czech Broadcast News Audio | |
| 5 CD | LDC99S81 | 1999 Speaker Recognition Benchmark | |
| 6 CD | LDC99S80 | 1997 Speaker Recognition Benchmark | |
| 4 CD | LDC99L23 | American English Spoken Lexicon | |
| ftp | LDC99L22 | Egyptian Colloquial Arabic Lexicon | |
| 1 CD | LDC99T34 | Japanese Business News Text Supplement | |
| 1 CD | LDC99T40 | Portuguese Newswire Text | |
| 1 CD | LDC99S78 | SUSAS | |
| ftp | LDC99T33 | SUSAS Transcripts | |
| 1 CD | LDC99T41 | Spanish Newswire Text, Volume 2 | |
| CC://Switchboard - Dan's version/ /afs/ir/data/linguistic-data/Switchboard/ (only transcripts) |
32 CD | LDC99S79 | Switchboard-2 Phase II |
| 73 CD | LDC99S84 | TDT2 English Audio | |
| 10 CD | LDC99S83 | Tactical Speaker Identification Speech Corpus (TSID) | |
| /afs/ir/data/linguistic-data/Treebank | 1 CD | LDC99T42 | Treebank-3 |
| 7 CD | LDC99S82 | USC Marketplace Broadcast News Speech | |
| ftp | LDC99T36 | USC Marketplace Broadcast News Transcripts | |
| 18 CD | LDC98S71 | 1997 English Broadcast News Speech (Hub-4) | |
| /afs/ir/data/linguistic-data/Broadcast-News-Transcripts/ | ftp | LDC98T28 | 1997 English Broadcast News Transcripts (Hub-4) |
| 8 CD | LDC98S73 | 1997 Mandarin Broadcast News Speech (Hub-4NE) | |
| /afs/ir/data/linguistic-data/ldc/LDC98T24-1997-Mandarin-Broadcast-News-Transcripts | ftp | LDC98T24 | 1997 Mandarin Broadcast News Transcripts (Hub-4NE) |
| 9 CD | LDC98S74 | 1997 Spanish Broadcast News Speech (Hub-4NE) | |
| /afs/ir/data/linguistic-data/Spanish-Broadcast-News/ | ftp | LDC98T29 | 1997 Spanish Broadcast News Transcripts (Hub-4NE) |
| 6 CD | LDC98S76 | 1998 Speaker Recognition Benchmark | |
| ftp | LDC98L21 | COMLEX English Syntax Lexicon | |
| 3 CD | LDC98S67 | HTIMIT | |
| 2 CD | LDC98S69 | Hub-5 Mandarin Telephone Speech Corpus | |
| /afs/ir/data/linguistic-data/ldc/LDC98T26-Hub-5-Mandarin-Transcripts | ftp | LDC98T26 | Hub-5 Mandarin Transcripts |
| 5 CD | LDC98S70 | Hub-5 Spanish Telephone Speech Corpus | |
| /afs/ir/data/linguistic-data/Hub5-Spanish-Transcripts/ | ftp | LDC98T27 | Hub-5 Spanish Transcripts |
| /afs/ir/data/linguistic-data/1996-CSR-Hub-4-LM | 2CD | LDC98T31 | 1996 CSR Hub-4 Language Model |
| 2 CD | LDC98T32 | JURIS | |
| 2 CD | LDC98S68 | LLHDB | |
| 2 CD | LDC98T30 | North American News Text Supplement | |
| CC://Switchboard - Dan's version/ /afs/ir/data/linguistic-data/Switchboard/ (only transcripts)3 |
26 CD | LDC98S75 | Switchboard-2 Phase 1 |
| /afs/ir/data/linguistic-data/TDT-Pilot-Study/ | ftp | LDC98T25 | TDT Pilot Study Corpus |
| 2 CD | LDC98S72 | Taiwanese Putonghua Speech and Transcripts | |
| /afs/ir/data/linguistic-data/Voicemail1/ | 1 CD | LDC98S77 | Voicemail Corpus-Part I |
| 19 CD | LDC97S44 | 1996 English Broadcast News Speech (Hub-4) | |
| /afs/ir/data/linguistic-data/Broadcast-News-Transcripts/hub4_eng_train_trans/ | ftp | LDC97T22 | 1996 English Broadcast News Transcripts (Hub-4) |
| /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-English-Lexicon/ | ftp | LDC97L20 | CALLHOME American English Lexicon (PRONLEX) |
| 3 CD | LDC97S42 | CALLHOME American English Speech | |
| /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-English-Transcripts/ | ftp | LDC97T14 | CALLHOME American English Transcripts |
| 3 CD | LDC97S45 | CALLHOME Egyptian Arabic Speech | |
| ftp | LDC97T19 | CALLHOME Egyptian Arabic Transcripts | |
| /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Arabic-Lexicon/ | ftp | LDC97L19 | CALLHOME Egyptian Arabic Lexicon |
| /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-German-Lexicon/ | ftp | LDC97L18 | CALLHOME German Lexicon |
| 3 CD | LDC97S43 | CALLHOME German Speech | |
| /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-German-Transcripts/ | ftp | LDC97T15 | CALLHOME German Transcripts |
| /afs/ir/data/linguistic-data/WSD/DSO-Sense-Tagged/ | ftp | LDC97T12 | DSO Corpus of Sense-Tagged English |
| /afs/ir/data/linguistic-data/Switchboard/Audio-swbd1ph2 | 23 CD | LDC97S62 | SWITCHBOARD-1 Release 2 |
| 2 CD | LDC97S63 | The CMU Kids Corpus | |
| /afs/ir/data/linguistic-data/Boston-University-Radio/ | 4 CD | LDC96S36 | Boston University Radio Speech Corpus |
| 3 CD | LDC96S46 | CALLFRIEND American English-Non-Southern Dialect | |
| 3 CD | LDC96S47 | CALLFRIEND American English-Southern Dialect | |
| 3 CD | LDC96S48 | CALLFRIEND Canadian French | |
| 3 CD | LDC96S49 | CALLFRIEND Egyptian Arabic | |
| 3 CD | LDC96S50 | CALLFRIEND Farsi | |
| 3 CD | LDC96S51 | CALLFRIEND German | |
| 3 CD | LDC96S52 | CALLFRIEND Hindi | |
| 3 CD | LDC96S53 | CALLFRIEND Japanese | |
| 3 CD | LDC96S54 | CALLFRIEND Korean | |
| 3 CD | LDC96S55 | CALLFRIEND Mandarin Chinese-Mainland Dialect | |
| 3 CD | LDC96S56 | CALLFRIEND Mandarin Chinese-Taiwan Dialect | |
| 3 CD | LDC96S57 | CALLFRIEND Spanish-Caribbean Dialect | |
| 3 CD | LDC96S58 | CALLFRIEND Spanish-Non-Caribbean Dialect | |
| 3 CD | LDC96S59 | CALLFRIEND Tamil | |
| 3 CD | LDC96S60 | CALLFRIEND Vietnamese | |
| 3 CD | LDC96S61 | 1996 Speaker Rcognition Benchmark | |
| /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Japanese-Lexicon/ | ftp | LDC96L17 | CALLHOME Japanese Lexicon |
| 3 CD | LDC96S37 | CALLHOME Japanese Speech | |
| ftp | LDC96T18 | CALLHOME Japanese Transcripts | |
| /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Mandarin-Lexicon/ | ftp | LDC96L15 | CALLHOME Mandarin Chinese Lexicon |
| 2 CD | LDC96S34 | CALLHOME Mandarin Chinese Speech | |
| /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Mandarin-Transcripts/ | ftp | LDC96T16 | CALLHOME Mandarin Chinese Transcripts |
| /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Spanish-Lexicon/ | ftp | LDC96L16 | CALLHOME Spanish Lexicon |
| 2 CD | LDC96S35 | CALLHOME Spanish Speech | |
| /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Spanish-Transcripts/ | ftp | LDC96T17 | CALLHOME Spanish Transcripts |
| /afs/ir/data/linguistic-data/CELEX/ | 1 CD | LDC96L14 | CELEX2 |
| 4 CD | LDC96S33 | CSR-IV Hub 3 | |
| 3 CD | LDC96S31 | CSR-IV Hub 4 | |
| 1 CD | LDC96S30 | CTIMIT | |
| 12 CD | LDC96S38 | DCIEM/HCRC | |
| 1 CD | LDC96S32 | FFMTIMIT | |
| 1 CD | LDC96S29 | Frontiers in Speech Processing 93 | |
| 1 CD | LDC96S40 | Frontiers in Speech Processing 94 | |
| 6 CD | LDC96S64-1 | JEIDA/JCSD-Channel 0 City Names | |
| 20 CD | LDC96S64 | JEIDA/JCSD-Channel 0 Complete | |
| 4 CD | LDC96S64-2 | JEIDA/JCSD-Channel 0 Control Words | |
| 3 CD | LDC96S64-4 | JEIDA/JCSD-Channel 0 Four Digit Sequences | |
| 1 CD | LDC96S64-3 | JEIDA/JCSD-Channel 0 Isolated Digits | |
| 6 CD | LDC96S64-5 | JEIDA/JCSD-Channel 0 Mono Syllables | |
| 6 CD | LDC96S65-1 | JEIDA/JCSD-Channel 1 City Names | |
| 20 CD | LDC96S65 | JEIDA/JCSD-Channel 1 Complete | |
| 4 CD | LDC96S65-2 | JEIDA/JCSD-Channel 1 Control Words | |
| 3 CD | LDC96S65-4 | JEIDA/JCSD-Channel 1 Four Digit Sequences | |
| 1 CD | LDC96S65-3 | JEIDA/JCSD-Channel 1 Isolated Digits | |
| 6 CD | LDC96S65-5 | JEIDA/JCSD-Channel 1 Mono Syllables | |
| /afs/ir/data/linguistic-data/IE/MUC/MUC6/LDC96T10-Muc6/ | ftp | LDC96T10 | Message Understanding Conference (MUC) 6 Additional News Text |
| 2 CD | LDC96S41 | VAHA (POLYPHONE II) | |
| 3 CD | LDC95S23 | CSR-III Speech | |
| 4 CD | LDC95T6 | CSR-III Text | |
| 1 CD | LDC95T11 | European Language Newspaper Text | |
| /afs/ir/data/linguistic-data/Hansard-French/ | 2 CD | LDC95T20 | Hansard French/English |
| /afs/ir/data/linguistic-data/Japanese-Business-News/ | 1 CD | LDC95T8 | Japanese Business News Text |
| 1 CD | LDC95S22 | KING Speaker Verification | |
| 2 CD | LDC95S28 | LATINO-40 Spanish Read News | |
| 1 CD | LDC95T13 | Mandarin Chinese News Text | |
| /afs/ir/data/linguistic-data/North-American-News/ | 2 CD | LDC95T21 | North American News Text Corpus |
| 3 CD | LDC95S27 | PHONEBOOK: NYNEX Isolated Words | |
| 1 CD | LDC95T9 | Spanish News Text | |
| /afs/ir/data/linguistic-data/TRAINS/ | 1 CD | LDC95S25 | TRAINS spoken dialog corpus |
| /afs/ir/data/linguistic-data/Treebank/ | 1 CD | LDC95T7 | Treebank-2 |
| 6 CD | LDC95S24 | WSJCAM0 Cambridge Read News | |
| /afs/ir/data/linguistic-data/Air-Traffic-Control/ | 8 CD | LDC94S14A | Air Traffic Control Complete |
| 2 CD | LDC94S14B | Air Traffic Control BOS | |
| 3 CD | LDC94S14C | Air Traffic Control DCA | |
| 3 CD | LDC94S14D | Air Traffic Control DFW | |
| 4 CD | LDC94S20 | BRAMSHILL | |
| 34 CD | LDC94S13A | CSR-II (WSJ1) Complete | |
| 20 CD | LDC94S13C | CSR-II (WSJ1) Other | |
| 19 CD | LDC94S13B | CSR-II (WSJ1) Sennheiser | |
| /afs/ir/data/linguistic-data/ECI-Multilingual/ | 1 CD | LDC94T5 | ECI Multilingual Text |
| 8 CD | LDC94S21 | MACROPHONE | |
| 1 CD | LDC94S17 | OGI Multilanguage Corpus | |
| 1 CD | LDC94S18 | OGI Spelled and Spoken Word | |
| 2 CD | LDC94S15 | SPIDRE | |
| 3 CD | LDC94T4A | UN Parallel Text (Complete) | |
| 1 CD | LDC94T4B-1 | UN Parallel Text (English) | |
| 1 CD | LDC94T4B-2 | UN Parallel Text (French) | |
| 1 CD | LDC94T4B-3 | UN Parallel Text (Spanish) | |
| 1 CD | LDC94S16 | YOHO Speaker Verification | |
| 1 CD | LDC93S11 | Road Rally | |
| 6 CD | LDC93S4A | ATIS0 Complete | |
| 1 CD | LDC93S4B | ATIS0 Pilot | |
| 1 CD | LDC93S4B-2 | ATIS0 Read | |
| 4 CD | LDC93S4B-3 | ATIS0 SD Read | |
| 4 CD | LDC93S5 | ATIS2 | |
| 15 CD | LDC93S6A | CSR-I (WSJ0) Complete | |
| 9 CD | LDC93S6C | CSR-I (WSJ0) Other | |
| 9 CD | LDC93S6B | CSR-I (WSJ0) Sennheiser | |
| /afs/ir/data/linguistic-data/HCRC-Maptask-Transcripts/ | 8 CD | LDC93S12 | HCRC Map Task Corpus |
| 2 CD | LDC93S2 | NTIMIT | |
| 6 CD | LDC93S3A | Resource Management Complete Set 2.0 | |
| 4 CD | LDC93S3B | Resource Management RM1 2.0 | |
| 2 CD | LDC93S3C | Resource Management RM2 2.0 | |
| 1 CD | LDC93S8 | SWITCHBOARD Credit Card | |
| /afs/ir/data/linguistic-data/TIDIGITS/ | 3 CD | LDC93S10 | TIDIGITS |
| /afs/ir/data/linguistic-data/TIMIT/ | 1 CD | LDC93S1 | TIMIT Acoustic-Phonetic Continuous Speech Corpus |
| /afs/ir/data/linguistic-data/Tipster/ | 3 CD | LDC93T3A | TIPSTER Complete |
| 1 CD | LDC93T3B | TIPSTER Volume 1 | |
| 1 CD | LDC93T3C | TIPSTER Volume 2 | |
| 1 CD | LDC93T3D | TIPSTER Volume 3 |
The non-LDC corpora are listed alphabetically below. As for the LDC corpora, an AFS directory location and/or the abbreviation "CC://" indicate whether that corpus is installed on AFS or the Corpus Computer (CC) or on both. If the path starts with "CC://", the corpus is installed on the Corpus Computer. Exchange "CC://" by the following path to derive the location of the corpus on CC:
| Name | Annotation | Language(s) | Location | Associated tools |
|---|---|---|---|---|
| John Rylands Univ Corpus of late 18c prose | - | Early Modern English | /afs/ir/data/linguistic-data/Rylands_Univ_Corpus_Late_18c_prose | |
| Cornell SMART Archive | - | English | /afs/ir/data/linguistic-data/SMART-Archive | - |
| Enron Email Corpus | - | English | /afs/ir/data/linguistic-data/Enron-Email-Corpus | - |
| 20Newsgroups | English | /afs/ir/data/linguistic-data/TextCat/20Newsgroups/ | ||
| Aleksova's corpus | - | Bulgarian (spoken) | CC://Bugarian Corpora/Aleksova/ | - |
| ATIS | Syntax, POS, some argument structure | English | CC://TIGERCorpora/Atis - parsed and tagged (Stanford release)/ | TIGERSearch |
| Bavarian Archive of Speech Corpora (only annotations) | Prosody, syntax, POS, transcribed | German, English, Japanese | CC://BAScorpora/ /afs/ir/data/linguistic-data/Verbmobil-Dialogs/BAS-VM-annotation/ |
- |
| British National Corpus (BNC) World Edition | English | /afs/ir/data/linguistic-data/BNC-World/ Also search with Mark Davies excellent interface to the BNC - Variation In English Words and Phrases |
VIEW | |
| Brown Corpus | Syntax, POS, some argument structure | English | CC://TIGERCorpora/Brown Corpus -
parsed and tagged (Stanford release) /afs/ir/data/linguistic-data/Treebank/tgrep2able/ |
TIGERSearch |
| Census 1990 Names | English | /afs/ir/data/linguistic-data/IE/census1990names/ | ||
| CHRISTINE, Stage I, Release 2 | English | /afs/ir/data/linguistic-data/CHRISTINE/ | ||
| CMU Pronouncing Dictionary | English | /afs/ir/data/linguistic-data/CMU-Pronouncing-Dict/ | ||
| Corporate Acquisitions Annotated Reuters Texts | /afs/ir/data/linguistic-data/IE/CorpAcq-Reuters-Freitag/ | |||
| Corpus of Spoken Professional American English | POS | American English (spoken) | CC://CSPA - Corpus of Spoken Professional American English/ | MonoConc |
| DavidLewis (Reuters, TREC-AP) | English | /afs/ir/data/linguistic-data/TextCat/DavidLewis/ | ||
| Excite log | English | /afs/ir/data/linguistic-data/IR/EXCITE/ | ||
| International Computer Archive of Modern and Medieval English (ICAME) | diachronic corpus | English | /afs/ir/data/linguistic-data/ICAME/ | |
| International Corpus of English - The British Component (ICE GB) | English | /afs/ir/data/linguistic-data/ICE-GB/ /afs/ir/data/linguistic-data/Treebank/tgrep2able/ |
tgrep, tgrep2 | |
| IViE | Prosody, phonetic, etc. | British dialects | CC://IViE/ | - |
| Kristie Seymore's Information Extraction Data | English | /afs/ir/data/linguistic-data/IE/Kristie-Seymore-IE/ | ||
| LUCY, initial release | English | /afs/ir/data/linguistic-data/LUCY/ | ||
| MUC3-4 (Message Understanding Conference) | English | /afs/ir/data/linguistic-data/IE/MUC/MUC3-4/ | ||
| Mooney Job Data | English | /afs/ir/data/linguistic-data/IE/Mooney-Job-Data/ | ||
| PPCME2 [requires membership in a special group] | diachronic corpus | /afs/ir/data/linguistic-data/PPCME2/ | ||
| Proposition Bank (experimental pre-release) | predicate structure enriched treebank | English | /afs/ir/data/linguistic-data/PropBank/ | related tools |
| NEGRA | Syntax (LFG-based), POS, some argument structure | German | CC://TIGERCorpora/NEGRA-parsed/ | TIGERSearch |
| Remedia Story Comprehension: (use requires special permission) | English | /afs/ir/data/linguistic-data/QA/Remedia-Story-Comprehension/ | ||
| Reuters Corpus | English | /afs/ir/data/linguistic-data/Reuters-Corpus/ | ||
| RNC German radio news (Nachrichten) corpus | Prosodically annotated & transcribed speech files | German (spoken) | CC://RNC - German Radio News Corpus/ | - |
| Spam Filtering | English | /afs/ir/data/linguistic-data/TextCat/Spam-Filtering/ | ||
| Switchboard Corpus | Syntax, POS, some argument structure | English (spoken) | CC://TIGERCorpora/Switchboard Corpus
- parsed and tagged (Stanford release)/ /afs/ir/data/linguistic-data/Treebank/tgrep2able/ |
TIGERSearch |
| Switchboard LINK Project Corpus | Syntax, POS; some arg str,
animacy, information status, and coreference |
English (spoken) | /afs/ir/data/linguistic-data/Treebank/LINK-swbd/ requires special permission and understanding of citation requirements in the README file |
tgrep2 |
| SUSANNE Corpus, Release 5 | English | /afs/ir/data/linguistic-data/SUSANNE/ | ||
| TIGER Treebank [Version 1] |
Syntax (LFG-based), POS, some argument structure | German | CC://TIGERCorpora/TIGERcorpus 1.0 (July 2003)/ | TIGERSearch |
| TIGER sample corpora | Syntax, POS, some argument structure | English | CC://TIGERCorpora/CorpusSamplers/ | TIGERSearch |
| Unified Medical Language System (UMLS) | English | /afs/ir/data/linguistic-data/UMLS/ | ||
| YCOE |
Syntax, POS, CAT, lemma | English | CC://TIGERCorpora/YCOE/ /afs/ir/data/linguistic-data/YCOE/ |
TIGERSearch |
| Verbmobil Dialogs | German, English, Japanese | /afs/ir/data/linguistic-data/Verbmobil-Dialogs | ||
| Wallstreet Journal | Syntax, POS, some argument structure | English | CC://TIGERCorpora/Wall Street Journal
- parsed and tagged (Stanford release)/ /afs/ir/data/linguistic-data/Treebank/tgrep2able/ |
TIGERSearch |
This is only a very small collection of online corpora, please see the top 10 info-sources page for links to sites with far more information.
From the Lexis-Nexis home page choose ``General News''. From ``General News'' choose the ``More Options'' tab; using the ``Basic'' tab will only allow searches of the title and first paragraph of an article! On the ``More Options'' search page, always be sure to click on ``Headline'' and then select ``Full text'' from the popup menu that comes up. You can select from a range of sources; the default is ``Major Newpapers''; similarly, you can also select from a range of dates. When you get the list of results, click on ``Expanded List''; this will show you the part of the text with your search pattern for the results.
There are tips at the bottom of the search page that tell you how to construct searches. A few pointers. You can use ``!'' as a wild card for truncation on the right edges of words. Also helpful is ``pre/n'', for some ``n'', which allows you to search for a word that precedes another within an window of n words; ``read! pre/2 way'' will find patterns such as reads her way or reading our slow way, etc. (But ``read!'' will also find reader.) Another disadvantage: Lexis-Nexis has a large list of stop words -- words that you can't search for -- including just about any determiner, auxiliary, and preposition.
Volume 2.0 presents access to hundreds of dialogues that were not represented in the original release in October 2002. It is more diverse in terms of situations and dynamic patterns. Access to oral history interviews, the Watergate tapes (by several paths), diverse regional varieties of English (both British and international), the just-emerging American National Corpus (ANC), the U. S. Supreme Court, and other originally non-linguistic sources are presented for the first time.
The dialogues in this corpus occurred in a very diverse collection of interactive situations. Thus it is a data resource for studies of the breadth of coverage of particular dialogue models, and for studies that compare dialogue from different situations.