| |
Overview
Beside the corpora that we own on CD (which you can get from the Corpus TA,
many corpora are installed and ready-to-use on either
the AFS space or the corpus computer (CC). Some additional
speech corpora may be available in the phonetics lab (see also the tip below). Please contact the Phonetics RA or the Corpus TA if
you have questions about speech corpora. Although
this page is not intended to give an overview of available online corpora (outside of Stanford), a very small
selection is actually provided on this page. Nevertheless, we strongly encourage you to take advantage of the
variety of freely accessible online corpora - for some links to sites that will provide you with an overview
of the colorful world of online corpora, please browse & click through our subjectively construed list of
the top 10 info-sources "out there".
This page has four main parts:
In addition, you will often find the most recently acquired corpora summarized at the
top of this page.
Tip-1: Since this page is not a database you may find it useful to just use your browser search function if you are
interested in a specific corpus, or corpora for a specific language, or any other key word (e.g. "syntactically
annotated"). Try it!
Tip-2: Interested in prosody? See Florian Jaeger's page of annotated links to
prosodically annotated speech
corpora.
Tip-3: If you cannot find your the corpus you were looking for - see if we can order it! Our department is a
member of the LDC (Linguistic Data Consortium) which gives us free access to a lot of corpora. It may also be able to order other corpora.
Simply tell the corpus TA what you need, but have a look at the information on
"ordering corpora from the LDC" first, or browse the web to see whether what you
need can be found online (and maybe for free). Also keep in mind that our inventory may be outdated since our resources
to maintain this list are limited.
Recently acquired corpora (as of 02/04/04)
We've acquired a fair number of corpora and tools recently. Notably
we've now got several new treebanks at Stanford and we update some older corpora to
newer versions:
Winter 2004
- ICSI Meeting transcripts [AFS] - information not entered below yet
- Arabic Gigaword [DVD | CC]
- Chinese Gigaword [DVD | CC]
- The IViE Corpus (English Intonation in the British Isles) [CC]
- Bavarian Speech Archive (BAS) annotation of VerbMobil 1 & 2 data [AFS]
- Proposition Bank [AFS]
- SLX Corpus of Classic Sociolinguistic Interviews [DVD]
- Santa Barbara Corpus of Spoken American English Part-II [DVD | CC]
- ECI Multilingual Text [AFS | CD]
- English Gigaword [DVD | CC]
- UN Parallel Text (Complete) [CD]
- The AQUAINT Corpus of English News Text [CD]
Fall 2003
- Topic Detection and Tracking (TDT3) Multilanguage Text 2 [AFS]
- LUCY, initial release [AFS]
- SUSANNE Corpus, Release 5 [AFS]
- CHRISTINE, Stage I, Release 2 [AFS]
- The York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE) [AFS]
- Corpus of Spoken Professional American English (tagged & untagged) [CC]
Summer 2003
- TIGER release 1.0 [CC]
- Penn Chinese Treebank [AFS | CC]
- Penn Arabic Treebank [AFS | CC]
- NEGRA treebank (German) [AFS | CC]
- TIGER corpus (German) [AFS | CC]
- Prague Dependency Bank (Czech) [AFS]
Corpora on AFS space (as of 01/16/2004)
- Air Traffic Control Corpus - Transcripts only,
LDC94S14A:
/afs/ir/data/linguistic-data/Air-Traffic-Control
- Arabic Treebank, LDC2003T06:
/afs/ir/data/linguistic-data/Arabic-Treebank
- Bavarian Speech Archive (BAS) annotation of VerbMobil 1 & 2 data BAS website
/afs/ir/data/linguistic-data/Verbmobil-Dialogs/BAS-VM-annotation/
- BLLIP-WSJ (Brown Laboratory for Linguistic Information Processing)
LDC2000T43:
/afs/ir/data/linguistic-data/BLLIP-WSJ
- Boston University Radio Speech Corpus, LDC96S36:
/afs/ir/data/linguistic-data/Boston-University-Radio
- BNC World Edition (license conditions and installation of SARA software being studied)
/afs/ir/data/linguistic-data/BNC-world
- Broadcast News Transcripts (CSR-VI), LDC98T28:
/afs/ir/data/linguistic-data/Broadcast-News-Transcripts
- CALLHOME:
/afs/ir/data/linguistic-data/CALLHOME
- CALLHOME American English Lexicon, LDC97L20:
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-English-Lexicon
- CALLHOME American English Transcripts, LDC97T14:
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-English-Transcripts
- CALLHOME Egyptian Arabic Lexicon, LDC97L19:
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Arabic-Lexicon
- CALLHOME German Lexicon LDC97L18:
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-German-Lexicon
- CALLHOME German Transcripts LDC97T15:
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-German-Transcripts
- CALLHOME Japanese Lexicon LDC96L17:
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Japanese-Lexicon
- CALLHOME Mandarin Chinese Lexicon LDC96L16:
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Mandarin-Lexicon
- CALLHOME Spanish Lexicon, LDC96L16:
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Spanish-Lexicon
- CALLHOME Spanish Transcripts, LDC96T17:
/afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Spanish-Transcripts
- CELEX 2, LDC96L14:
[special license condition: one license per research group]
/afs/ir/data/linguistic-data/CELEX
- Chinese Treebanks
Chinese Treebank (1), LDC2000T48:
/afs/ir/data/linguistic-data/Chinese-Treebank/Chinese-Treebank
- Chinese Treebank 2, LDC2001T11:
/afs/ir/data/linguistic-data/Chinese-Treebank/Chinese-Treebank-2
- Chinese Treebank 3, LDC2003E06:
/afs/ir/data/linguistic-data/Chinese-Treebank/Chinese-Treebank-3
- CHRISTINE, Stage I, Release 2 CHRISTINE project
/afs/ir/data/linguistic-data/CHRISTINE/
- CMU Pronouncing Dictionary:
/afs/ir/data/linguistic-data/CMU-Pronouncing-Dict
- ECI Multilingual Text LDC94T5
/afs/ir/data/linguistic-data/ECI-Multilingual
- EXCITE:
/afs/ir/data/linguistic-data/IR/EXCITE
- Hansard French/English, LDC95T20:
/afs/ir/data/linguistic-data/Hansard-French
- HCRC Maptask, LDC93S12:
/afs/ir/data/linguistic-data/HCRC-Maptask-Transcripts
- Hong Kong Hansards Parallel Text, LDC2000T50:
/afs/ir/data/linguistic-data/Hansard-Hong-Kong
- Hong Kong Laws, LDC2000T47:
/afs/ir/data/linguistic-data/Hong-Kong-Laws
- Hong Kong News, LDC2000T46:
/afs/ir/data/linguistic-data/Hong-Kong-News
- Hub-5 Spanish Transcripts, LDC98T27:
/afs/ir/data/linguistic-data/Hub5-Spanish-Transcripts
- ICAME:
/afs/ir/data/linguistic-data/ICAME
- ICE-GB (International Corpus of English - The British Component):
/afs/ir/data/linguistic-data/ICE-GB
(If you want to borrow the CD to install the search software on your Windows PC let me know. It doesn't work for Macs or Unix computers.)
- IE (Information Extraction):
/afs/ir/data/linguistic-data/IE
- Corporate Acquisitions Annotated Reuters Texts:
/afs/ir/data/linguistic-data/IE/CorpAcq-Reuters-Freitag
- Kristie Seymore's Information Extraction Data:
/afs/ir/data/linguistic-data/IE/Kristie-Seymore-IE
-
MUC3-4 (Message Understanding Conference):
/afs/ir/data/linguistic-data/IE/MUC/MUC3-4
- MUC-6 (Message Understanding Conference) Text collection, LDC96T10:
/afs/ir/data/linguistic-data/IE/MUC/MUC6/LDC96T10-Muc6
- MUC-6 (Message Understanding Conference), LDC2003T13:
/afs/ir/data/linguistic-data/IE/MUC/MUC6/LDC2003T13-muc6
- Mooney Job Data:
/afs/ir/data/linguistic-data/IE/Mooney-Job-Data
- Census 1990 Names:
/afs/ir/data/linguistic-data/IE/census1990names
- Japanese Business News, LDC95T8:
/afs/ir/data/linguistic-data/Japanese-Business-News
- LUCY, initial release (copyright free version) LUCY project
/afs/ir/data/linguistic-data/LUCY
- North American News Text Corpus, LDC95T21:
/afs/ir/data/linguistic-data/North-American-News
- PPCME2 PPCME2 website [requires membership in a special group]:
/afs/ir/data/linguistic-data/TREC/PPCME2
- Prague Dependency Bank (Czech) LDC2001T10
/afs/ir/data/linguistic-data/PragueDependencyTreebank_v1.0
- Proposition Bank (experimental pre-release) Proposition Bank website (predicate structure enriched treebank) [related tools]
/afs/ir/data/linguistic-data/PropBank
- Remedia Story Comprehension: (use requires special permission)
/afs/ir/data/linguistic-data/QA/Remedia-Story-Comprehension
- Reuters Corpus
/afs/ir/data/linguistic-data/Reuters-Corpus
- SAID (A Syntactically Annotated Idiom Dataset), LDC2003T10
/afs/ir/data/linguistic-data/SAID
- Santa Barbara Corpus of Spoken American English, LDC2000S85:
/afs/ir/data/linguistic-data/Santa-Barbara
- Spanish Broadcast News, LDC98T29:
/afs/ir/data/linguistic-data/Spanish-Broadcast-News
- SPINE, Speech in Noisy Environments, LDC2000S87 and LDC2000T49:
/afs/ir/data/linguistic-data/SPINE
- SUSANNE Corpus, Release 5 SUSANNE project
/afs/ir/data/linguistic-data/SUSANNE
- Switchboard Transcripts, LDC93S7-T:
/afs/ir/data/linguistic-data/Switchboard/Switchboard-Transcripts
- TDT Pilot Study, LDC98T25 [special user agreement]:
/afs/ir/data/linguistic-data/TDT-Pilot-Study
- TDT2 Careful Transcription, LDC2000T44:
/afs/ir/data/linguistic-data/TDT2-Careful
- TDT2 Multilanguage Text 4, LDC2001T57
/afs/ir/data/linguistic-data/TDT2-Multilingual
- TDT3 Multilanguage Text 2, LDC2001T58
/afs/ir/data/linguistic-data/TDT2-Multilingual
- Text Categorization:
/afs/ir/data/linguistic-data/TextCat
- 20Newsgroups:
/afs/ir/data/linguistic-data/TextCat/20Newsgroups
- DavidLewis (Reuters, TREC-AP):
/afs/ir/data/linguistic-data/TextCat/DavidLewis
- Spam Filtering:
/afs/ir/data/linguistic-data/TextCat/Spam-Filtering
- TIDIGITS, LDC93S10:
/afs/ir/data/linguistic-data/TIDIGITS
- TIMIT, LDC93S1:
/afs/ir/data/linguistic-data/TIMIT
- Tipster Complete, LDC93T3A [each user needs to sign license]:
/afs/ir/data/linguistic-data/Tipster
- TRAINS, LDC95S25:
/afs/ir/data/linguistic-data/TRAINS
- TREC (Information Retrieval Text Research Collection):
/afs/ir/data/linguistic-data/TREC/
- Treebank Release 2 and 3, LDC95T7 and LDC99T42:
/afs/ir/data/linguistic-data/Treebank
- UMLS (Unified Medical Language System):
/afs/ir/data/linguistic-data/UMLS
- Verbmobil Dialogs:
/afs/ir/data/linguistic-data/Verbmobil-Dialogs
- Voicemail Corpus Part 1, Speech and Transcripts, LDC98S77:
/afs/ir/data/linguistic-data/Voicemail1
- WSD (Word Sense Disambiguation):
/afs/ir/data/linguistic-data/WSD
- DSO Sense-Tagged, LDC97T12:
/afs/ir/data/linguistic-data/WSD/DSO-Sense-Tagged
- Leacock's Data:
/afs/ir/data/linguistic-data/WSD/leacock
- Pedersen's Data:
/afs/ir/data/linguistic-data/WSD/pedersen
- Senseval1:
/afs/ir/data/linguistic-data/WSD/senseval/senseval1
- York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), YCOE project
[special user agreement]
/afs/ir/data/linguistic-data/YCOE
Corpora on the Corpus Computer
In addition to the corpora on AFS, a couple of corpora are only stored on the corpus computer.
All corpora are stored on the D-partition of the corpus computer. This section will undergo further revision and more
details about the available corpora will be added soon:
| Name |
Annotation |
Language(s) |
Format |
Associated tools |
| Aleksova's corpus |
- |
Bulgarian (spoken) |
Winword files |
- |
| Arabic Gigaword |
|
Arabic |
|
|
| ATIS |
Syntax, POS, some argument structure |
English |
TIGER XML, MRG |
TIGERSearch |
| Bavarian Archive of Speech Corpora (only annotations) |
Prosody, syntax, POS, transcribed |
German, English, Japanese |
BAS format |
- |
| Brown Corpus |
Syntax, POS, some argument structure |
English |
TIGER XML, MRG |
TIGERSearch |
| Chinese Gigaword |
|
Chinese |
|
|
| Chinese Treebank |
Syntax, POS, some argument structure |
Chinese |
TIGER XML, MRG |
TIGERSearch |
| Corpus of Spoken Professional American English |
POS |
American English (spoken) |
SGML-tagged, plain text |
MonoConc |
| English Gigaword |
|
English |
|
|
| IMS German radio news (Nachrichten) corpus |
Prosodically annotated & transcribed speech files |
German (spoken) |
ToBI annotation |
- |
| IViE |
Prosody, phonetic, etc. |
British dialects |
- |
- |
| NEGRA |
Syntax (LFG-based), POS, some argument structure |
German |
TIGER XML, NEGRA format |
TIGERSearch |
| Santa Barbara Corpus of Spoken American English Part-II |
speech, intonation, transcribed |
English |
text, CHAT-format |
TIGERSearch |
| Switchboard Corpus |
Syntax, POS, some argument structure |
English (spoken) |
TIGER XML, MRG |
TIGERSearch |
TIGER Treebank [Version 1] |
Syntax (LFG-based), POS, some argument structure |
German |
TIGER XML, NEGRA format |
TIGERSearch |
| TIGER sample corpora |
Syntax, POS, some argument structure |
English |
TIGER XML, MRG |
TIGERSearch |
YCOE
|
Syntax, POS, CAT, lemma |
German |
TIGER XML, NEGRA format |
TIGERSearch |
| Wallstreet Journal |
Syntax, POS, some argument structure |
English |
TIGER XML, MRG |
TIGERSearch |
Corpora only available on CD, DVD, or as packed archive on AFS (as of 02/04/2004)
You can check out these CDs from us or ask the
corpus TA to install their content on the corpus computer or AFS.
- ACL/DCL, Association For Computational Linguistics Data Collection Initiative,
CD-ROM 1, LDC93T1, 1991, 1 disc
- The AQUAINT Corpus of English News Text, LDC2002T31, 2 CDs
- Arabic Gigaword LDC2003T12, 1 DVD
- ATCO Complete, LDC94S14A:
ATCO, Air Traffic Control Corpus, Dallas Fort Worth (DFW), NIST Speech
Discs 16-1.1, 16-2.1, 16-3.1, 1994, NIST/LDC, 3 discs
ATCO, Air Traffic Control Corpus, Logan International (BOS), NIST Speech
Discs 16-4.1, 16-5.1, 1994, NIST/LDC, 2 discs
ATCO, Air Traffic Control Corpus, Washington National (DCA), NIST Speech
Discs 16-6.1, 16-7.1, 16-8.1, 1994, NIST/LDC, 3 discs
- ATIS0 Complete, LDC93S4A:
ATIS0, Air Travel Information System, Spontaneous Speech Pilot Corpus and
Relational Database, NIST Speech Disc 5-1.1, NTIS PB91-505354, DARPA,
1990, 1 disc
ATIS0, Air Travel Information System, Read Versions of Spontaneous Data,
NIST Speech Disc 5-2.1, NTIS PB91-505362, DARPA, 1990, 1 disc
ATIS0, Air Travel Information System, Speaker-Dependent Training Data,
NIST Speech Discs 5-3.1, 5-4.1, 5-5.1, 5-6.1, NTIS PB91-505370, DARPA,
1991, 4 discs
- ATIS2, Air Travel Information System, Multi-Site Speech Collection, NIST
Speech Discs 12-1.1 to 12-4.1,
LDC93S5, 1990, 4 discs
- BLLIP-WSJ (Brown Laboratory for Linguistic Information Processing), LDC2000T43, 2 CDs
- Boston U. Radio Speech Corpus, LDC96S36, 4 discs
- British National Corpus Sampler, 1999
- CALLFRIEND American English Non Southern Dialect, 60 Telephone Conversations, LDC96S46, 3 discs
- CALLFRIEND American English Southern Dialect, 60 Telephone Conversations,
LDC96S47, 3 discs
- CALLFRIEND Japanese, LDC96S53, 3 discs
- CALLFRIEND Hindi, LDC96S52, 3 discs
- CALLFRIEND Tamil, LDC96S59, 3 discs
- CALLHOME American English, 120 Telephone Conversations,
LDC97S42, 3 discs
- CALLHOME German, 100 Telephone Conversations, LDC97S43, 3 discs
- CALLHOME Japanese, LDC96S37, 3 discs
- CELEX, The celex Lexical Database, Release 2 (Dutch Version 3.1, English
Version 2.5, German Version 2.5), LDC/Centre for Lexical Information
Max Planck Institute for Psycholinguistics Nijmegen,
LDC96L14, 1995, 1 disc [special user agreement]
- Chinese Gigaword LDC2003T09, 1 DVD
- CSR-II (WSJ1) Complete, LDC94S13A:
WSJ1, Continuous Speech Recognition Corpus, NIST/LDC, 1993, 34 discs
- CTIMIT, Cellular Telephone Acoustic-Phonetic Continous Speech Corpus,
LDC96S30, 1995, 1 disc
- DCIEM/HCRC, LDC96S38, 12 parts
- ECI Multilingual Text, LDC94T5, 1 CD
- English Gigaword, LDC2003T05, 1 DVD
- FFMTIMIT, Acoustic-Phonetic Continuous Speech Corpus Secondary (Far
Field) Microphone Recordings, NIST Speech Disc 21-1.1, NTIS Order No.
PB95-504569, LDC96S32, 1 disc
- Hansard French/English, LDC95T20, 1 disc
- HCRC Map Task Corpus, Discs 1-4 of 8, Human Communication Research
Centre, University of Edinburgh,
LDC93S12, 1992, 8 discs
- Hong Kong Hansards Parallel Text, LDC2000T50, 1 disc
- ICE-GB (International Corpus of English, British Component), 1 disc
- Japanese Business News Text, LDC95T8, 1 disc
- JURIS, Justice Retrieval and Inquiry System, LDC98T32, 2 discs
- NTIMIT, Telephone Network Acoustic-Phonetic Continuous Speech Corpus, NIST
Speech Discs 10-1.1/10-2.1, NTIS Order No. PB92-502087,
LDC93S2, 1992, 2 discs
- RM1, Resource Management, Continuous Speech Database:
Speaker-Dependent Training Data, NIST Corpus 2-1.1 and 2-2.1, 1989,
NTIS Order No. PB89-226666, DARPA, 2 discs
Speaker Independent Training Data, NISC Disc 2-3.1, NTIS Order No.
PB90-500539, 1989, DARPA, 1 disc
Development Test and Evaluation Test Data and Scoring Software, NIST
Speech Disc 2-4.2, 1992, DARPA, 1 disc
LDC93S3B
- RM1, Resource Management, Continuous Speech Database, Isolated - and
Spelled - Word Data, NIST Speech Disc 2-5.1, 1996, DARPA, LDC96S39, 1 disc (2 copies)
- RM2, Extended Resource Management, Continuous Speech Speaker-Dependent Corpus
(RM2), NIST Speech Discs 3-1.2 and 3-2.2, NTIS Order No. PB90-501776,
LDC93S3C,
1990, 2 discs
- Santa Barbara Corpus of Spoken American English, LDC2000S85, 3 discs
- Santa Barbara Corpus of Spoken American English Part-II, LDC2003S06, 1 DVD
- SLX Corpus of Classic Sociolinguistic Interviews, LDC2003T15, 1 DVD
- SPIDRE, Speaker Identification Research Corpus, NIST speech discs 18-1.1
and 18-2.1, 1994, LDC94S15, 2 discs
- SPINE, Speech in Noisy Environments, LDC2000S87, 4 discs
- Switchboard Corpus, Recorded Telephone Conversations, NIST, 26 discs,
1992, obsolete
- Switchboard Corpus, Excerpts, Credit Card Conversations, NIST Speech Disc
8-1.2, LDC93S8,
1992, 1 disc
- Switchboard-1 Release 2, LDC97S62, 23 parts
- The Penn Treebank Project, Preliminary Release 0.5, 1992, LDC, 1 disc, obsolete
- The Penn Treebank Project, Release 2, 1995, LDC95T7, 1 disc
- The Penn Treebank Project, Release 3, LDC99T42, 1 disc
- The Prague Dependency Bank 1.0 (Czech), LDC2001T10, 1 disc
- Topic Detection and Tracking (TDT2), LDC2000S92, 2 discs [special user agreement]
- Topic Detection and Tracking (TDT3) Multilanguage Text 2, LDC2001T58, 1 disc
- TIDIGITS, Studio Quality Speaker-Independent Connected-Digit Corpus, NIST
Speech Discs 4-1, 4-2, 4-3, NTIS PB-91-506592,
LDC93S10, 1991
- TIMIT, Acoustic-Phonetic Continuous Speech Corpus, NIST Speech Disc 1-1.1,
NTIS Order No. PB91-505065, LDC93S1, 1990, 1 disc
- Tipster Complete, LDC93T3A, 3 discs [special user agreement]
- TRAINS Spoken Dialog Corpus, LDC95S25
- TREC (Text Research Collection) Vol. 4, 1 disc
- TREC (Text Research Collection) Vol. 5, 1 disc
- UN Parallel Text (Complete), LDC94T4A, 3 CDs
- VAHA, Voice Across Hispanic America, LDC96S41, 2 discs
- Voicemail Corpus Part 1, Speech and Transcripts, LDC98S77, 1 disc
- WSJ0, Continuous Speech Recognition Corpus, NIST, LDC93S6A, 1993, 15 discs
- 1997 Broadcast News Speech Corpus (CSR-VI: Hub 4), LDC98S71, 1997, 18 discs
- 1998 Speaker Recognition Evaluation, NIST/LDC,
LDC98S76, 1998, 6 discs
Corpora only available in archive form (as of
02/04/2004)
Some corpora are distributed via ftp by the LDC. Thus we don't
have any CDs for them and if we have downloaded them but not yet installed them,
they are listed here. The archives are stored on AFS under
/afs/ir/data/linguistic-data/ldc/LDC-tarfiles/ if not mentioned otherwise, and
all filenames contain the LDC catalogue number which should make the
identification of the corpus unproblematic.
Note that the just-mentioned directory
contains several other files which are archives of corpora that are already installed under AFS
and therefore not currently listed in this section.
Corpora on the WWW (a very small collection)
This is only a very small collection of online corpora, please see the top 10 info-sources
page for links to sites with far more information.
- British National Corpus
By Beth Levin:
You can find information on doing searches by looking in the SARA
Manual (http://thetis.bl.uk/CHAP4/) this
manual is intended for the full version of the BNC, but it contains
information on pattern searching in section 3. With some ingenuity,
you can even do searches by lexical category. Since you only get 50
randomly chosen examples of the pattern you are searching for at a
time, if there are 80-100 or more examples of this pattern, search for
it again since you may get some more examples. (We now have the BNC in
AFS space, but we haven't installed the SARA server yet. But you can use
gsearch with it.)
Note - You may also find it useful to have the
the BNC Basic Tagset open in a separate
window while doing your first searches in the BNC.
- COBUILD Corpus
By Beth Levin:
This web site also has material from a whole range of written and
spoken sources, used in the development of the Collins COBUILD
dictionaries and ESL materials. It allows for some relatively
sophisticated searching; click on ``query syntax'' for details.
Particularly nice is ``@'': read@ searches for all the
inflected forms of read! You only get 40 examples at a time,
but can partially get around this in the same way as with the BNC.
The other shortcoming is that the window of text is very small --
often not even a whole sentence. If you want more text, you have to
search for the string of text defining the left or right edge of the
example.
Note - There is also the
Cobuild Concordance and Collocations Sampler.
- Lexis-Nexis Academic Universe
By Beth Levin:
This web site has material from major newspapers, wire services, and
television news programs. Although it is not a ``balanced'' corpus,
it has so much text that you can find things you won't find anywhere
else (well, maybe through a web search!); for example, outgraze,
outdefend, outweed or he reads himself quasi-blind and
the fans spin themselves dizzy. The other major drawback is
that it is not designed for linguists, so that you cannot take full
advantage of what's there and you also need to work around search
procedures that were designed for research into current events.
From the Lexis-Nexis home page choose ``General News''. From
``General News'' choose the ``More Options'' tab; using the ``Basic''
tab will only allow searches of the title and first paragraph of an
article! On the ``More Options'' search page, always be sure to click
on ``Headline'' and then select ``Full text'' from the popup menu that
comes up. You can select from a range of sources; the default is
``Major Newpapers''; similarly, you can also select from a range of
dates. When you get the list of results, click on ``Expanded List'';
this will show you the part of the text with your search pattern for
the results.
There are tips at the bottom of the search page that tell you how to
construct searches. A few pointers. You can use ``!'' as a wild card
for truncation on the right edges of words. Also helpful is
``pre/n'', for some ``n'', which allows you to search for a word that
precedes another within an window of n words; ``read! pre/2 way'' will
find patterns such as reads her way or reading our slow
way, etc. (But ``read!'' will also find reader.) Another
disadvantage: Lexis-Nexis has a large list of stop words -- words
that you can't search for -- including just about any determiner,
auxiliary, and preposition.
- OED new version (slow and requires a graphical browser)
or old version (fast and lynx-friendly)
- DIALOGUE DIVERSITY CORPUS: Version 2.0
From their website:
The DDC gives direct access to a set of dialogue transcripts (13 sources, more than 12 hours of dialogue, all in English.). It also gives a set of links and methods for indirect access to hundreds of additional dialogues (principally in English.) Many sources provide speech data as well as transcripts. The emphasis is on free or inexpensive access.
Volume 2.0 presents access to hundreds of dialogues that were not represented in the original release in October 2002. It is more diverse in terms of situations and dynamic patterns. Access to oral history interviews, the Watergate tapes (by several paths), diverse regional varieties of English (both British and international), the just-emerging American National Corpus (ANC), the U. S. Supreme Court, and other originally non-linguistic sources are presented for the first time.
The dialogues in this corpus occurred in a very diverse collection of interactive situations. Thus it is a data resource for studies of the breadth of coverage of particular dialogue models, and for studies that compare dialogue from different situations.
- TITUS corpus and search
engine [signed license conditions &
user agreement need to be faxed to the number stated on the user agreement].
By Florian Jaeger:
TITUS is a large collection of Indo-European text that can be searched with the TITUS search engine.
Several
types of searches are available. You can search specific texts, restrict the search to specific languages
(e.g. Farsi) or language groups (e.g. Avestan, Lithuanian, or simply Old Prussion) or combinations of these
restrictions. Wildcards (e.g. '*') or logical operators (e.g. 'AND' or 'OR') can be used, too. The output is a
list of texts that contain matches to your search. The site also contains lexica, links to several other text
databases, tutorials, etc. A rich source for anyone working on Indo-European languages.
- COSMAS II is a German giga corpus with almost 2 billion (!)
text words. It is accessible via the COSMAS II Online Client. Unfortunately, all help and information
available for this corpus is given in German.
By Florian Jaeger: COSMAS II contains morphosyntactic annotated parts, speech, formal writings (news texts, novels, etc.), as
well as special sociolinguistic corpora. You can load subcorpora (so called 'archives') and construct searches using a text or a graphical
interface - both of which need some time getting used to (but it's worth it). The searches have almost regular expression power; you can search
for tagged morphosyntactic information; save your searches and filter your results. A inflection and derivation operator, '&' is also availabel.
- Right in front of your door you can find the
Text@Humanities
project of the Human Digital Information Service, a collection of online searchable (literary) texts drawn from American
English, Irish, other varieties of English, German, French and Spanish.
- The W3-Corpora Search Engine is
an online search engine on a large collection of corpora (still in prototype stage but looks promising).
|