Corpora@Stanford

First time users

You must register before using our corpus resources. Click here for instructions.

You might find it useful to start with a basic introduction to the corpora we have, how to get to them, and how to get started searching them.

What corpora do we have?

Stanford has been a subscribing member of the Linguistic Data Consortium for several years, so we have most of the corpora released by the LDC. You can use the LDC Catalog to find corpora suitable to your interests; it has a very useful search facility. If you find a corpus that you want to use and we don't have it, we can probably order it.

For full details of all corpora we provide access to (including LDC corpora, other licensed corpora, and freely available data sets) look at our corpus inventory.

Where are the corpora?

The corpus inventory lists the location of each corpus that we own. The corpora are all stored either:

on Stanford's AFS filesystem,
or on CD/DVD in the Chair's Office, by the Linguistics department administrative offices,
or available for download from LDC Online (this is true for some older corpora, and the corpus TA can download these for you).

How do I use the corpora?

Corpora stored on the Stanford network are located at /afs/ir/data/linguistic-data on AFS. If you aren't familiar with accessing AFS space, read the instructions at IT services.

We maintain a list of useful tools and utilities which will help you use the data.

Corpus Resources

First time users

What corpora do we have?

Where are the corpora?

How do I use the corpora?