What corpora do we have?

Stanford has been a subscribing member of the Linguistic Data Consortium for several years, so we have most of the corpora released by the LDC. You can use the LDC Catalog to find corpora suitable to your interests; it has a very useful search facility. If you find a corpus that you want to use and we don't have it, we can probably order it.

For full details of all corpora we provide access to (including LDC corpora, other licensed corpora, and freely available data sets) look at our corpus inventory.

Where are the corpora?

The corpus inventory lists the location of each corpus that we own. The corpora are all stored either:

How do I use the corpora?

Corpora stored on the Stanford network are located at /afs/ir/data/linguistic-data on AFS. If you aren't familiar with accessing AFS space, read the instructions at IT services.

We maintain a list of useful tools and utilities which will help you use the data.