This site

::	HOME What? What not?
::	Site map
::	About this site

Available corpora

Corpora@Stanford

Getting started
@Stanford

::	Intro & Overview Where corpora grow and why you like them
::	Playground rules & registration Apply for your visa to the land of corpora
::	Setting up your account Pack your suitcase to the land of corpora

Available resources
@Stanford

::	User support The Corpus TA & our corpora-email-list
::	Corpora [Ordering corpora \| Checking out CDs]
::	Corpora-tools & Software [Documents]
::	Corpus-related classes & projects

Beyond Stanford

::	Top 10 info-sources E-resources out there

For the Corpus TA

::	Guidelines & help

Overview

This page has replaced an older corpus inventory page as of 04/01/2004. If you for some reason want to access the old page that is still possible.

Beside the corpora that we own on CD (which you can get from the Corpus TA, many corpora are installed and ready-to-use on either the AFS space or the corpus computer (CC). Some additional speech corpora may be available in the phonetics lab (see also the tip below). Please contact the Phonetics RA or the Corpus TA if you have questions about speech corpora. Although this page is not intended to give an overview of available online corpora (outside of Stanford), a very small selection is actually provided on this page. Nevertheless, we strongly encourage you to take advantage of the variety of freely accessible online corpora - for some links to sites that will provide you with an overview of the colorful world of online corpora, please browse & click through our subjectively construed list of the top 10 info-sources "out there".

This page contains the following information:

Recent acquisitions
LDC Corpora
Non-LDC Corpora
Corpora on the WWW

Tip-1: Since this page is not a database you may find it useful to just use your browser search function if you are interested in a specific corpus, or corpora for a specific language, or any other key word (e.g. "syntactically annotated"). Try it!

Tip-2: Interested in prosody? See Florian Jaeger's page of annotated links to prosodically annotated speech corpora.

Tip-3: If you cannot find your the corpus you were looking for - see if we can order it! Our department is a member of the LDC (Linguistic Data Consortium) which gives us free access to a lot of corpora. It may also be able to order other corpora. Simply tell the corpus TA what you need, but have a look at the information on "ordering corpora from the LDC" first, or browse the web to see whether what you need can be found online (and maybe for free). Also keep in mind that our inventory may be outdated since our resources to maintain this list are limited.

Tip-4: Thematically grouped corpora on AFS: There are a couple of thematic groups of corpora on AFS:

IE (Information Extraction):
/afs/ir/data/linguistic-data/IE
- Corporate Acquisitions Annotated Reuters Texts:
  /afs/ir/data/linguistic-data/IE/CorpAcq-Reuters-Freitag
- Kristie Seymore's Information Extraction Data:
  /afs/ir/data/linguistic-data/IE/Kristie-Seymore-IE
- MUC3-4 (Message Understanding Conference):
  /afs/ir/data/linguistic-data/IE/MUC/MUC3-4
- MUC-6 (Message Understanding Conference) Text collection, LDC96T10:
  /afs/ir/data/linguistic-data/IE/MUC/MUC6/LDC96T10-Muc6
- MUC-6 (Message Understanding Conference), LDC2003T13:
  /afs/ir/data/linguistic-data/IE/MUC/MUC6/LDC2003T13-muc6
- Mooney Job Data:
  /afs/ir/data/linguistic-data/IE/Mooney-Job-Data
- Census 1990 Names:
  /afs/ir/data/linguistic-data/IE/census1990names
Text Categorization:
/afs/ir/data/linguistic-data/TextCat
- 20Newsgroups:
  /afs/ir/data/linguistic-data/TextCat/20Newsgroups
- DavidLewis (Reuters, TREC-AP):
  /afs/ir/data/linguistic-data/TextCat/DavidLewis
- Spam Filtering:
  /afs/ir/data/linguistic-data/TextCat/Spam-Filtering
TREC (Information Retrieval Text Research Collection):
/afs/ir/data/linguistic-data/TREC/
- TREC-1 (=Tipster 1): [each user needs to sign license]
  /afs/ir/data/linguistic-data/TREC/TREC-1
- TREC-2 (=Tipster 2): [each user needs to sign license]
  /afs/ir/data/linguistic-data/TREC/TREC-2
- TREC-3 (=Tipster 3): [each user needs to sign license]
  /afs/ir/data/linguistic-data/TREC/TREC-3
- TREC-4:
  /afs/ir/data/linguistic-data/TREC/TREC-4
- TREC-5:
  /afs/ir/data/linguistic-data/TREC/TREC-5
- OHSUMED-TREC-9:
  /afs/ir/data/linguistic-data/TREC/OHSUMED-TREC-9
WSD (Word Sense Disambiguation):
/afs/ir/data/linguistic-data/WSD
- DSO Sense-Tagged, LDC97T12:
  /afs/ir/data/linguistic-data/WSD/DSO-Sense-Tagged
- Leacock's Data:
  /afs/ir/data/linguistic-data/WSD/leacock
- Pedersen's Data:
  /afs/ir/data/linguistic-data/WSD/pedersen
- Senseval1:
  /afs/ir/data/linguistic-data/WSD/senseval/senseval1

Recently acquired corpora

We've acquired a fair number of corpora and tools recently. Notably we've now got several new treebanks at Stanford and we update some older corpora to newer versions:

Arabic Treebank: Part 1 v3.0
Multiple Translation Arabic
ACE Time Normalization (TERN) 2004 English Training Data v1.0
BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
Chinese Treebank 5.0
Levantine Arabic QT Training Data Set 3 Speech
Levantine Arabic QT Training Data Set 3 Transcripts
Fisher English Training Speech Part 1 Transcripts
Fisher English Training Speech Part 1 Speech
Prague Arabic Dependency Treebank 1.0
Switchboard Cellular Part 2 Audio
Talkbank Ethology Data: Field Recordings of Vervet Monkey Calls
Arabic English Parallel News Part 1
Arabic News Translation Text Part 1
Santa Barbara Corpus of Spoken American English 3

Acquisitions that predate 01/01/2004 are listed on the old corpus inventory page.

[ Recent acquisitions | LDC Corpora | Non-LDC Corpora | Online corpora | top ]

LDC corpora

Below you find a complete list of all our LDC corpora, chronologically ordered. If an entry is in grey font this means that we do not own that corpus (because Stanford wasn't a member of the LDC in that year).

How to read the table below:
In the table below the first column gives the location of the corpus either on AFS or on the Corpus Computer (CC). An empty entry means that the corpus is currently not installed. If an AFS path is given, then the corpus is stored on AFS. If the path starts with "CC://" the corpus is installed on the Corpus Computer. Exchange "CC://" by the following path to derive the location of the corpus on CC:

D:/ The second column list the number of CDs/DVDs that the corpus is stored on or that the corpus was delivered via ftp/email. The ftp archives are stored on AFS under the following path (all filenames contain the LDC catalogue number which should make the identification of the corpus unproblematic):

/afs/ir/data/linguistic-data/ldc/LDC-tarfiles/ The third column in each row lists the LDC catalog number. Click on the link to be linked to the LDC catalog entry for the corpus. Before you use a corpus, please inform yourself about the copyrights and license restrictions that are given in the catalog entry. Finally, the last column contains the name of the corpus.

How to read the LDC catalog numbers: Each LDC corpus has a unique catalog number. The first three digits are always 'LDC'. The next two digits represent the year in which the corpus was released. The third part of the catalog number is a single digit representing the corpus type (Lexicon, Speech or Text). The final digits uniquely distinguish that corpus from other corpora of that type. Note that availability of corpora is not necessarily restricted to members of the release year. To see what membership years a certain corpus is available for, click on the catalog number and check the detailed listing for the corpus.

Tip 1: You can also search the LDC Catalog directly by type and source or by year or the projects. You may also use the general catalog search.

Location	Orig.	ID	Name of corpus
	1DVD	LDC2005T02	Arabic Treebank: Part 1 v3.0
	1CD	LDC2005T06	Chinese News Translation Text Part 1
	1CD	LDC2005T08	Discourse Graphbank
	1CD	LDC2005T09	ACE 2004 Multilingual Training Corpus
	1CD	LDC2005T05	Multiple Translation Arabic
	1CD	LDC2005T07	ACE Time Normalization (TERN) 2004 English Training Data v1.0
	1CD	LDC2005T03	Levantine Arabic QT Training Data Set 3 Transcripts
	1DVD	LDC2005S07	Levantine Arabic QT Training Data Set 3 Speech
/afs/ir/data/linguistic-data/Treebank/LDC2005T01-Chinese-Treebank-5.0	1CD	LDC2005T01	Chinese Treebank 5.0
	2DVD	LDC2005S08	BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
/afs/ir/data/linguistic-data/Buckwalter-Arabic-Morphological-Analyzer-2.0	ftp	LDC2004L02	Buckwalter Arabic Morphological Analyzer Version 2.0
/afs/ir/data/linguistic-data/LDC2004T19-Fisher-Transcripts	1CD	LDC2004T19	Fisher English Training Speech Part 1 Transcripts
	7DVD	LDC2004S13	Fisher English Training Speech Part 1 Speech
	1CD	LDC2004T23	Prague Arabic Dependency Treebank 1.0
	3DVD	LDC2004S07	Switchboard Cellular Part 2 Audio
	1DVD	LDC2004S12	Talkbank Ethology Data: Field Recordings of Vervet Monkey Calls
	ftp	LDC2004T18	Arabic English Parallel News Part 1
	ftp	LDC2004T17	Arabic News Translation Text Part 1
	1DVD	LDC2004S10	Santa Barbara Corpus of Spoken American English 3
	2DVD	LDC2004S08	MDE RT-03 Training Data Speech
	9DVD	LDC2004S09	NIST Meeting Pilot Corpus Speech
	1DVD	LDC2004T08	Hong Kong Parallel Text
	1DVD	LDC2004T12	MDE RT-03 Training Data Text and Annotations
	1CD	LDC2004T16	2001 Communicator Dialogue Act Tagged
	1DVD	LDC2004V01	FORM1 Kinematic Gesture
	ftp	LDC2004T15	2000 Communicator Dialogue Act Tagged
/afs/ir/data/linguistic-data/Proposition-Bank-1	ftp	LDC2004T14	Proposition Bank I
	ftp	LDC2004T07	Multiple-Translation Chinese (MTC) Part 3
	ftp	LDC2004T13	NIST Meeting Pilot Corpus Transcripts and Metadata
	2 DVD	LDC2004S04	2002 NIST Speaker Recognition Evaluation (SRE)
	1 CD	LDC2004T11	Arabic Treebank: Part 3 v.1.0
	2 DVD	LDC2004S05	ISL Meeting Corpus Speech Part 1
	ftp	LDC2004T10	ISL Meeting Corpus Transcripts Part 1
	ftp	LDC2004T01	Czech Broadcast News Transcripts
	2 DVD	LDC2004S01	Czech Broadcast News Speech
/afs/ir/data/linguistic-data/Chinese-Treebank	ftp	LDC2004T05	Chinese Treebank Version 4.0
	9 DVD	LDC2004S02	ICSI Meeting Speech
	ftp	LDC2004T04	ICSI Meeting Transcripts
	ftp	LDC2004L01	Klex: Finite-State Lexical Transducer for Korean
	ftp	LDC2004T03	Morphologically Annotated Korean Text
	ftp	LDC2004T09	TIDES Extraction (ACE) 2003 Multilingual Training Data
	ftp	LDC2003T04	1997 HUB5 Spanish Transcripts
	ftp	LDC2003T03	1997 HUB5 German Transcripts
	ftp	LDC2003T02	1998 HUB5 English Transcripts
	1 DVD	LDC2003S01	2001 Communicator Evaluation
/afs/ir/data/linguistic-data/ldc/LDC2003T01-2001-HUB5-Mandarin-Transcripts	ftp	LDC2003T01	2001 HUB5 Mandarin Transcripts
	ftp	LDC2003T11	ACE-2 Version 1.0
	1 CD	LDC2003T20	ANC First Release
CC://Arabic Gigaword/	1 DVD	LDC2003T12	Arabic Gigaword
/afs/ir/data/linguistic-data/Arabic-Treebank/Arabic-Treebank-Trans	ftp	LDC2003T07	Arabic Treebank: Part 1 - 10K-word English Translation
/afs/ir/data/linguistic-data/Arabic-Treebank/Arabic-Treebank-2.0	ftp	LDC2003T06	Arabic Treebank: Part 1 v 2.0
CC://Chinese Gigaword/	1 DVD	LDC2003T09	Chinese Gigaword
/afs/ir/data/linguistic-data/English Gigaword/,CC://English Gigaword/	1 DVD	LDC2003T05	English Gigaword
	1 CD	LDC2003V01	FORM2 Kinematic Gesture
	1 CD	LDC2003L01	Grassfields Bantu Fieldwork: Dschang Lexicon
	1 CD	LDC2003S02	Grassfields Bantu Fieldwork: Dschang Tone Paradigms
	3 CD	LDC2003P01	Korean Telephone Conversations Complete Set
	ftp	LDC2003L02	Korean Telephone Conversations Lexicon
	3 CD	LDC2003S03	Korean Telephone Conversations Speech
	ftp	LDC2003T08	Korean Telephone Conversations Transcripts
/afs/ir/data/linguistic-data/IE/MUC/MUC6/LDC2003T13-muc6/	ftp	LDC2003T13	Message Understanding Conference (MUC) 6
/afs/ir/data/linguistic-data/MTA	ftp	LDC2003T18	Multiple-Translation Arabic (MTA) Part 1
	ftp	LDC2003T17	Multiple-Translation Chinese (MTC) Part 2
/afs/ir/data/linguistic-data/SAID/	ftp	LDC2003T10	SAID
	1 DVD	LDC2003T15	SLX Corpus of Classic Sociolinguistic Interviews
CC://Santa Barbara II/	1 DVD	LDC2003S06	Santa Barbara Corpus of Spoken American English Part-II
	4 DVD	LDC2003T16	SummBank 1.0
	1 CD	LDC2003S05	West Point Russian Speech
	1 CD	LDC2002S22	1997 HUB5 Arabic Evaluation
	ftp	LDC2002T39	1997 HUB5 Arabic Transcripts
	1 CD	LDC2002S24	1997 HUB5 German Evaluation
	1 CD	LDC2002S25	1997 HUB5 Spanish Evaluation
	1 CD	LDC2002S10	1998 HUB5 English Evaluation
	7 CD	LDC2002S56	2000 Communicator Evaluation
	1 CD	LDC2002S13	2001 HUB5 English Evaluation
	1 CD	LDC2002S12	2001 HUB5 Mandarin Evaluation
	1 CD	LDC2002S34	2001 NIST Speaker Recognition Evaluation Corpus
	ftp	LDC2002L49	Buckwalter Arabic Morphological Analyzer Version 1.0
	1 CD	LDC2002S37	Callhome Egyptian Arabic Speech Supplement
	ftp	LDC2002T38	Callhome Egyptian Arabic Transcripts Supplement
	ftp	LDC2002L27	Chinese-English Translation Lexicon Version 3.0
	5 CD	LDC2002S28	Emotional Prosody Speech and Transcripts
	ftp	LDC2002T26	Korean English Treebank Annotations
	ftp	LDC2002T01	Multiple-Translation Chinese Corpus
	ftp	LDC2002T07	RST Discourse Treebank
CC://Switchboard - Dan's version/ /afs/ir/data/linguistic-data/Switchboard/ (only transcripts)	20 CD	LDC2002S06	Switchboard-2 Phase III Audio
/afs/ir/data/linguistic-data/AQUAINT	2 CD	LDC2002T31	The AQUAINT Corpus of English News Text
	6 CD	LDC2002S04	Translanguage English Database (TED) Speech
	ftp	LDC2002T03	Translanguage English Database (TED) Transcripts
	1 CD	LDC2002S35	Voicemail Corpus Part II
	3 CD	LDC2002S02	West Point Arabic Speech Corpus
	8 CD	LDC2001S97	2000 NIST Speaker Recognition Evaluation
	1 CD	LDC2001T55	Arabic Newswire Part 1
	ftp	LDC2001T61	CALLHOME Spanish Dialogue Act Annotation
	1 CD	LDC2001T62	Cetempublico
/afs/ir/data/linguistic-data/Chinese-Treebank/Chinese-Treebank-2/	ftp	LDC2001T11	Chinese Treebank Version 2.0
	1 CD	LDC2001S16	Grassfields Bantu Fieldwork: Ngomba Tone Paradigms
	ftp	LDC2001T02	Message Understanding Conference (MUC) 7
/afs/ir/data/linguistic-data/PragueDependencyTreebank_v1.0/	1 CD	LDC2001T10	Prague Dependency Treebank 1.0
	3 CD	LDC2001S04	Speech in Noisy Environments (SPINE2) Part 1 Audio
	ftp	LDC2001T05	Speech in Noisy Environments (SPINE2) Part 1 Transcripts
	2 CD	LDC2001S06	Speech in Noisy Environments (SPINE2) Part 2 Audio
	ftp	LDC2001T07	Speech in Noisy Environments (SPINE2) Part 2 Transcripts
	3 CD	LDC2001S08	Speech in Noisy Environments (SPINE2) Part 3 Audio
	ftp	LDC2001T09	Speech in Noisy Environments (SPINE2) Part 3 Transcripts
	8 CD	LDC2001S99	Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio
	13 CD	LDC2001S13	Switchboard Cellular Part 1 Audio
	3 CD	LDC2001S15	Switchboard Cellular Part 1 Transcribed Audio
/afs/ir/data/linguistic-data/ldc/LDC2001T14-Swbd-Cell-1-Trans	ftp	LDC2001T14	Switchboard Cellular Part 1 Transcription
	ftp	LDC2001T60	Syllable-Final /s/ Lenition
	6 CD	LDC2001S93	TDT2 Mandarin Audio Corpus
/afs/ir/data/linguistic-data/TDT2-Multilingual/	1 CD	LDC2001T57	TDT2 Multilanguage Text Version 4.0
	55 CD	LDC2001S94	TDT3 English Audio
	13 CD	LDC2001S95	TDT3 Mandarin Audio
/afs/ir/data/linguistic-data/TDT2-Multilingual/	1 CD	LDC2001T58	TDT3 Multilanguage Text Version 2.0
/afs/ir/data/linguistic-data/LDC2000S88-1999-HUB4-Test	1 CD	LDC2000S88	1999 HUB-4 Broadcast News Evaluation English Test Material
/afs/ir/data/linguistic-data/BLLIP-WSJ/	2 CD	LDC2000T43	BLLIP 1987-89 WSJ Corpus Release 1
/afs/ir/data/linguistic-data/Hansard-Hong-Kong/	1 CD	LDC2000T50	Hong Kong Hansards Parallel Text
/afs/ir/data/linguistic-data/Hong-Kong-Laws/	ftp	LDC2000T47	Hong Kong Laws Parallel Text
/afs/ir/data/linguistic-data/Hong-Kong-News/	ftp	LDC2000T46	Hong Kong News Parallel Text
	1 CD	LDC2000T45	Korean Newswire
/afs/ir/data/linguistic-data/Santa-Barbara/	3 CD	LDC2000S85	Santa Barbara Corpus of Spoken American English Part-I
	4 CD	LDC2000S96	Speech in Noisy Environments (SPINE) Evaluation Audio
	ftp	LDC2000T54	Speech in Noisy Environments (SPINE) Evaluation Transcripts
/afs/ir/data/linguistic-data/SPINE/	4 CD	LDC2000S87	Speech in Noisy Environments (SPINE) Training Audio
	ftp	LDC2000T49	Speech in Noisy Environments (SPINE) Training Transcripts
	2 CD	LDC2000S92	TDT2 Careful Transcription Audio
/afs/ir/data/linguistic-data/TDT2-Careful/	ftp	LDC2000T44	TDT2 Careful Transcription Text
	1 CD	LDC2000T52	TREC Mandarin
	1 CD	LDC2000T51	TREC Spanish
	ftp	LDC2000T53	Voice of America (VOA) Broadcast News Czech Transcript Corpus
	6 CD	LDC2000S89	Voice of America (VOA) Czech Broadcast News Audio
	5 CD	LDC99S81	1999 Speaker Recognition Benchmark
	6 CD	LDC99S80	1997 Speaker Recognition Benchmark
	4 CD	LDC99L23	American English Spoken Lexicon
	ftp	LDC99L22	Egyptian Colloquial Arabic Lexicon
	1 CD	LDC99T34	Japanese Business News Text Supplement
	1 CD	LDC99T40	Portuguese Newswire Text
	1 CD	LDC99S78	SUSAS
	ftp	LDC99T33	SUSAS Transcripts
	1 CD	LDC99T41	Spanish Newswire Text, Volume 2
CC://Switchboard - Dan's version/ /afs/ir/data/linguistic-data/Switchboard/ (only transcripts)	32 CD	LDC99S79	Switchboard-2 Phase II
	73 CD	LDC99S84	TDT2 English Audio
	10 CD	LDC99S83	Tactical Speaker Identification Speech Corpus (TSID)
/afs/ir/data/linguistic-data/Treebank	1 CD	LDC99T42	Treebank-3
	7 CD	LDC99S82	USC Marketplace Broadcast News Speech
	ftp	LDC99T36	USC Marketplace Broadcast News Transcripts
	18 CD	LDC98S71	1997 English Broadcast News Speech (Hub-4)
/afs/ir/data/linguistic-data/Broadcast-News-Transcripts/	ftp	LDC98T28	1997 English Broadcast News Transcripts (Hub-4)
	8 CD	LDC98S73	1997 Mandarin Broadcast News Speech (Hub-4NE)
/afs/ir/data/linguistic-data/ldc/LDC98T24-1997-Mandarin-Broadcast-News-Transcripts	ftp	LDC98T24	1997 Mandarin Broadcast News Transcripts (Hub-4NE)
	9 CD	LDC98S74	1997 Spanish Broadcast News Speech (Hub-4NE)
/afs/ir/data/linguistic-data/Spanish-Broadcast-News/	ftp	LDC98T29	1997 Spanish Broadcast News Transcripts (Hub-4NE)
	6 CD	LDC98S76	1998 Speaker Recognition Benchmark
	ftp	LDC98L21	COMLEX English Syntax Lexicon
	3 CD	LDC98S67	HTIMIT
	2 CD	LDC98S69	Hub-5 Mandarin Telephone Speech Corpus
/afs/ir/data/linguistic-data/ldc/LDC98T26-Hub-5-Mandarin-Transcripts	ftp	LDC98T26	Hub-5 Mandarin Transcripts
	5 CD	LDC98S70	Hub-5 Spanish Telephone Speech Corpus
/afs/ir/data/linguistic-data/Hub5-Spanish-Transcripts/	ftp	LDC98T27	Hub-5 Spanish Transcripts
/afs/ir/data/linguistic-data/1996-CSR-Hub-4-LM	2CD	LDC98T31	1996 CSR Hub-4 Language Model
	2 CD	LDC98T32	JURIS
	2 CD	LDC98S68	LLHDB
	2 CD	LDC98T30	North American News Text Supplement
CC://Switchboard - Dan's version/ /afs/ir/data/linguistic-data/Switchboard/ (only transcripts)3	26 CD	LDC98S75	Switchboard-2 Phase 1
/afs/ir/data/linguistic-data/TDT-Pilot-Study/	ftp	LDC98T25	TDT Pilot Study Corpus
	2 CD	LDC98S72	Taiwanese Putonghua Speech and Transcripts
/afs/ir/data/linguistic-data/Voicemail1/	1 CD	LDC98S77	Voicemail Corpus-Part I
	19 CD	LDC97S44	1996 English Broadcast News Speech (Hub-4)
/afs/ir/data/linguistic-data/Broadcast-News-Transcripts/hub4_eng_train_trans/	ftp	LDC97T22	1996 English Broadcast News Transcripts (Hub-4)
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-English-Lexicon/	ftp	LDC97L20	CALLHOME American English Lexicon (PRONLEX)
	3 CD	LDC97S42	CALLHOME American English Speech
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-English-Transcripts/	ftp	LDC97T14	CALLHOME American English Transcripts
	3 CD	LDC97S45	CALLHOME Egyptian Arabic Speech
	ftp	LDC97T19	CALLHOME Egyptian Arabic Transcripts
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Arabic-Lexicon/	ftp	LDC97L19	CALLHOME Egyptian Arabic Lexicon
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-German-Lexicon/	ftp	LDC97L18	CALLHOME German Lexicon
	3 CD	LDC97S43	CALLHOME German Speech
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-German-Transcripts/	ftp	LDC97T15	CALLHOME German Transcripts
/afs/ir/data/linguistic-data/WSD/DSO-Sense-Tagged/	ftp	LDC97T12	DSO Corpus of Sense-Tagged English
/afs/ir/data/linguistic-data/Switchboard/Audio-swbd1ph2	23 CD	LDC97S62	SWITCHBOARD-1 Release 2
	2 CD	LDC97S63	The CMU Kids Corpus
/afs/ir/data/linguistic-data/Boston-University-Radio/	4 CD	LDC96S36	Boston University Radio Speech Corpus
	3 CD	LDC96S46	CALLFRIEND American English-Non-Southern Dialect
	3 CD	LDC96S47	CALLFRIEND American English-Southern Dialect
	3 CD	LDC96S48	CALLFRIEND Canadian French
	3 CD	LDC96S49	CALLFRIEND Egyptian Arabic
	3 CD	LDC96S50	CALLFRIEND Farsi
	3 CD	LDC96S51	CALLFRIEND German
	3 CD	LDC96S52	CALLFRIEND Hindi
	3 CD	LDC96S53	CALLFRIEND Japanese
	3 CD	LDC96S54	CALLFRIEND Korean
	3 CD	LDC96S55	CALLFRIEND Mandarin Chinese-Mainland Dialect
	3 CD	LDC96S56	CALLFRIEND Mandarin Chinese-Taiwan Dialect
	3 CD	LDC96S57	CALLFRIEND Spanish-Caribbean Dialect
	3 CD	LDC96S58	CALLFRIEND Spanish-Non-Caribbean Dialect
	3 CD	LDC96S59	CALLFRIEND Tamil
	3 CD	LDC96S60	CALLFRIEND Vietnamese
	3 CD	LDC96S61	1996 Speaker Rcognition Benchmark
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Japanese-Lexicon/	ftp	LDC96L17	CALLHOME Japanese Lexicon
	3 CD	LDC96S37	CALLHOME Japanese Speech
	ftp	LDC96T18	CALLHOME Japanese Transcripts
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Mandarin-Lexicon/	ftp	LDC96L15	CALLHOME Mandarin Chinese Lexicon
	2 CD	LDC96S34	CALLHOME Mandarin Chinese Speech
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Mandarin-Transcripts/	ftp	LDC96T16	CALLHOME Mandarin Chinese Transcripts
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Spanish-Lexicon/	ftp	LDC96L16	CALLHOME Spanish Lexicon
	2 CD	LDC96S35	CALLHOME Spanish Speech
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Spanish-Transcripts/	ftp	LDC96T17	CALLHOME Spanish Transcripts
/afs/ir/data/linguistic-data/CELEX/	1 CD	LDC96L14	CELEX2
	4 CD	LDC96S33	CSR-IV Hub 3
	3 CD	LDC96S31	CSR-IV Hub 4
	1 CD	LDC96S30	CTIMIT
	12 CD	LDC96S38	DCIEM/HCRC
	1 CD	LDC96S32	FFMTIMIT
	1 CD	LDC96S29	Frontiers in Speech Processing 93
	1 CD	LDC96S40	Frontiers in Speech Processing 94
	6 CD	LDC96S64-1	JEIDA/JCSD-Channel 0 City Names
	20 CD	LDC96S64	JEIDA/JCSD-Channel 0 Complete
	4 CD	LDC96S64-2	JEIDA/JCSD-Channel 0 Control Words
	3 CD	LDC96S64-4	JEIDA/JCSD-Channel 0 Four Digit Sequences
	1 CD	LDC96S64-3	JEIDA/JCSD-Channel 0 Isolated Digits
	6 CD	LDC96S64-5	JEIDA/JCSD-Channel 0 Mono Syllables
	6 CD	LDC96S65-1	JEIDA/JCSD-Channel 1 City Names
	20 CD	LDC96S65	JEIDA/JCSD-Channel 1 Complete
	4 CD	LDC96S65-2	JEIDA/JCSD-Channel 1 Control Words
	3 CD	LDC96S65-4	JEIDA/JCSD-Channel 1 Four Digit Sequences
	1 CD	LDC96S65-3	JEIDA/JCSD-Channel 1 Isolated Digits
	6 CD	LDC96S65-5	JEIDA/JCSD-Channel 1 Mono Syllables
/afs/ir/data/linguistic-data/IE/MUC/MUC6/LDC96T10-Muc6/	ftp	LDC96T10	Message Understanding Conference (MUC) 6 Additional News Text
	2 CD	LDC96S41	VAHA (POLYPHONE II)
	3 CD	LDC95S23	CSR-III Speech
	4 CD	LDC95T6	CSR-III Text
	1 CD	LDC95T11	European Language Newspaper Text
/afs/ir/data/linguistic-data/Hansard-French/	2 CD	LDC95T20	Hansard French/English
/afs/ir/data/linguistic-data/Japanese-Business-News/	1 CD	LDC95T8	Japanese Business News Text
	1 CD	LDC95S22	KING Speaker Verification
	2 CD	LDC95S28	LATINO-40 Spanish Read News
	1 CD	LDC95T13	Mandarin Chinese News Text
/afs/ir/data/linguistic-data/North-American-News/	2 CD	LDC95T21	North American News Text Corpus
	3 CD	LDC95S27	PHONEBOOK: NYNEX Isolated Words
	1 CD	LDC95T9	Spanish News Text
/afs/ir/data/linguistic-data/TRAINS/	1 CD	LDC95S25	TRAINS spoken dialog corpus
/afs/ir/data/linguistic-data/Treebank/	1 CD	LDC95T7	Treebank-2
	6 CD	LDC95S24	WSJCAM0 Cambridge Read News
/afs/ir/data/linguistic-data/Air-Traffic-Control/	8 CD	LDC94S14A	Air Traffic Control Complete
	2 CD	LDC94S14B	Air Traffic Control BOS
	3 CD	LDC94S14C	Air Traffic Control DCA
	3 CD	LDC94S14D	Air Traffic Control DFW
	4 CD	LDC94S20	BRAMSHILL
	34 CD	LDC94S13A	CSR-II (WSJ1) Complete
	20 CD	LDC94S13C	CSR-II (WSJ1) Other
	19 CD	LDC94S13B	CSR-II (WSJ1) Sennheiser
/afs/ir/data/linguistic-data/ECI-Multilingual/	1 CD	LDC94T5	ECI Multilingual Text
	8 CD	LDC94S21	MACROPHONE
	1 CD	LDC94S17	OGI Multilanguage Corpus
	1 CD	LDC94S18	OGI Spelled and Spoken Word
	2 CD	LDC94S15	SPIDRE
	3 CD	LDC94T4A	UN Parallel Text (Complete)
	1 CD	LDC94T4B-1	UN Parallel Text (English)
	1 CD	LDC94T4B-2	UN Parallel Text (French)
	1 CD	LDC94T4B-3	UN Parallel Text (Spanish)
	1 CD	LDC94S16	YOHO Speaker Verification
	1 CD	LDC93S11	Road Rally
	6 CD	LDC93S4A	ATIS0 Complete
	1 CD	LDC93S4B	ATIS0 Pilot
	1 CD	LDC93S4B-2	ATIS0 Read
	4 CD	LDC93S4B-3	ATIS0 SD Read
	4 CD	LDC93S5	ATIS2
	15 CD	LDC93S6A	CSR-I (WSJ0) Complete
	9 CD	LDC93S6C	CSR-I (WSJ0) Other
	9 CD	LDC93S6B	CSR-I (WSJ0) Sennheiser
/afs/ir/data/linguistic-data/HCRC-Maptask-Transcripts/	8 CD	LDC93S12	HCRC Map Task Corpus
	2 CD	LDC93S2	NTIMIT
	6 CD	LDC93S3A	Resource Management Complete Set 2.0
	4 CD	LDC93S3B	Resource Management RM1 2.0
	2 CD	LDC93S3C	Resource Management RM2 2.0
	1 CD	LDC93S8	SWITCHBOARD Credit Card
/afs/ir/data/linguistic-data/TIDIGITS/	3 CD	LDC93S10	TIDIGITS
/afs/ir/data/linguistic-data/TIMIT/	1 CD	LDC93S1	TIMIT Acoustic-Phonetic Continuous Speech Corpus
/afs/ir/data/linguistic-data/Tipster/	3 CD	LDC93T3A	TIPSTER Complete
	1 CD	LDC93T3B	TIPSTER Volume 1
	1 CD	LDC93T3C	TIPSTER Volume 2
	1 CD	LDC93T3D	TIPSTER Volume 3

[ Recent acquisitions | LDC Corpora | Non-LDC Corpora | Online corpora | top ]

Non-LDC corpora

The non-LDC corpora are listed alphabetically below. As for the LDC corpora, an AFS directory location and/or the abbreviation "CC://" indicate whether that corpus is installed on AFS or the Corpus Computer (CC) or on both. If the path starts with "CC://", the corpus is installed on the Corpus Computer. Exchange "CC://" by the following path to derive the location of the corpus on CC:

D:/

Name	Annotation	Language(s)	Location	Associated tools
John Rylands Univ Corpus of late 18c prose	-	Early Modern English	/afs/ir/data/linguistic-data/Rylands_Univ_Corpus_Late_18c_prose
Cornell SMART Archive	-	English	/afs/ir/data/linguistic-data/SMART-Archive	-
Enron Email Corpus	-	English	/afs/ir/data/linguistic-data/Enron-Email-Corpus	-
20Newsgroups		English	/afs/ir/data/linguistic-data/TextCat/20Newsgroups/
Aleksova's corpus	-	Bulgarian (spoken)	CC://Bugarian Corpora/Aleksova/	-
ATIS	Syntax, POS, some argument structure	English	CC://TIGERCorpora/Atis - parsed and tagged (Stanford release)/	TIGERSearch
Bavarian Archive of Speech Corpora (only annotations)	Prosody, syntax, POS, transcribed	German, English, Japanese	CC://BAScorpora/ /afs/ir/data/linguistic-data/Verbmobil-Dialogs/BAS-VM-annotation/	-
British National Corpus (BNC) World Edition		English	/afs/ir/data/linguistic-data/BNC-World/ Also search with Mark Davies excellent interface to the BNC - Variation In English Words and Phrases	VIEW
Brown Corpus	Syntax, POS, some argument structure	English	CC://TIGERCorpora/Brown Corpus - parsed and tagged (Stanford release) /afs/ir/data/linguistic-data/Treebank/tgrep2able/	TIGERSearch
Census 1990 Names		English	/afs/ir/data/linguistic-data/IE/census1990names/
CHRISTINE, Stage I, Release 2		English	/afs/ir/data/linguistic-data/CHRISTINE/
CMU Pronouncing Dictionary		English	/afs/ir/data/linguistic-data/CMU-Pronouncing-Dict/
Corporate Acquisitions Annotated Reuters Texts			/afs/ir/data/linguistic-data/IE/CorpAcq-Reuters-Freitag/
Corpus of Spoken Professional American English	POS	American English (spoken)	CC://CSPA - Corpus of Spoken Professional American English/	MonoConc
DavidLewis (Reuters, TREC-AP)		English	/afs/ir/data/linguistic-data/TextCat/DavidLewis/
Excite log		English	/afs/ir/data/linguistic-data/IR/EXCITE/
International Computer Archive of Modern and Medieval English (ICAME)	diachronic corpus	English	/afs/ir/data/linguistic-data/ICAME/
International Corpus of English - The British Component (ICE GB)		English	/afs/ir/data/linguistic-data/ICE-GB/ /afs/ir/data/linguistic-data/Treebank/tgrep2able/	tgrep, tgrep2
IViE	Prosody, phonetic, etc.	British dialects	CC://IViE/	-
Kristie Seymore's Information Extraction Data		English	/afs/ir/data/linguistic-data/IE/Kristie-Seymore-IE/
LUCY, initial release		English	/afs/ir/data/linguistic-data/LUCY/
MUC3-4 (Message Understanding Conference)		English	/afs/ir/data/linguistic-data/IE/MUC/MUC3-4/
Mooney Job Data		English	/afs/ir/data/linguistic-data/IE/Mooney-Job-Data/
PPCME2 [requires membership in a special group]	diachronic corpus		/afs/ir/data/linguistic-data/PPCME2/
Proposition Bank (experimental pre-release)	predicate structure enriched treebank	English	/afs/ir/data/linguistic-data/PropBank/	related tools
NEGRA	Syntax (LFG-based), POS, some argument structure	German	CC://TIGERCorpora/NEGRA-parsed/	TIGERSearch
Remedia Story Comprehension: (use requires special permission)		English	/afs/ir/data/linguistic-data/QA/Remedia-Story-Comprehension/
Reuters Corpus		English	/afs/ir/data/linguistic-data/Reuters-Corpus/
RNC German radio news (Nachrichten) corpus	Prosodically annotated & transcribed speech files	German (spoken)	CC://RNC - German Radio News Corpus/	-
Spam Filtering		English	/afs/ir/data/linguistic-data/TextCat/Spam-Filtering/
Switchboard Corpus	Syntax, POS, some argument structure	English (spoken)	CC://TIGERCorpora/Switchboard Corpus - parsed and tagged (Stanford release)/ /afs/ir/data/linguistic-data/Treebank/tgrep2able/	TIGERSearch
Switchboard LINK Project Corpus	Syntax, POS; some arg str, animacy, information status, and coreference	English (spoken)	/afs/ir/data/linguistic-data/Treebank/LINK-swbd/ requires special permission and understanding of citation requirements in the README file	tgrep2
SUSANNE Corpus, Release 5		English	/afs/ir/data/linguistic-data/SUSANNE/
TIGER Treebank [Version 1]	Syntax (LFG-based), POS, some argument structure	German	CC://TIGERCorpora/TIGERcorpus 1.0 (July 2003)/	TIGERSearch
TIGER sample corpora	Syntax, POS, some argument structure	English	CC://TIGERCorpora/CorpusSamplers/	TIGERSearch
Unified Medical Language System (UMLS)		English	/afs/ir/data/linguistic-data/UMLS/
YCOE	Syntax, POS, CAT, lemma	English	CC://TIGERCorpora/YCOE/ /afs/ir/data/linguistic-data/YCOE/	TIGERSearch
Verbmobil Dialogs		German, English, Japanese	/afs/ir/data/linguistic-data/Verbmobil-Dialogs
Wallstreet Journal	Syntax, POS, some argument structure	English	CC://TIGERCorpora/Wall Street Journal - parsed and tagged (Stanford release)/ /afs/ir/data/linguistic-data/Treebank/tgrep2able/	TIGERSearch

[ Recent acquisitions | LDC Corpora | Non-LDC Corpora | Online corpora | top ]

Corpora on the WWW (a very small collection)

This is only a very small collection of online corpora, please see the top 10 info-sources page for links to sites with far more information.

British National Corpus
COBUILD Corpus
Lexis-Nexis Academic Universe
OED new version (slow and requires a graphical browser) or old version (fast and lynx-friendly)
DIALOGUE DIVERSITY CORPUS: Version 2.0
TITUS corpus and search engine [signed license conditions & user agreement need to be faxed to the number stated on the user agreement].
COSMAS II is a German giga corpus with almost 2 billion (!) text words. It is accessible via the COSMAS II Online Client. Unfortunately, all help and information available for this corpus is given in German.
Right in front of your door you can find the Text@Humanities project of the Human Digital Information Service, a collection of online searchable (literary) texts drawn from American English, Irish, other varieties of English, German, French and Spanish.
The W3-Corpora Search Engine is an online search engine on a large collection of corpora (still in prototype stage but looks promising).

[ Recent acquisitions | LDC Corpora | Non-LDC Corpora | Online corpora | top ]

This site

Available corpora

Corpora@Stanford

Getting started @Stanford

Available resources @Stanford

Beyond Stanford

For the Corpus TA

Overview

Recently acquired corpora

LDC corpora

Non-LDC corpora

Corpora on the WWW (a very small collection)

Getting started
@Stanford

Available resources
@Stanford