This site

::  HOME
What? What not?
::  Site map
::  About this site
 
 

 

For the Corpus TA

Corpora@Stanford

Getting started
@Stanford

::  Intro & Overview
Where corpora grow and why you like them
::  Playground rules
& registration

Apply for your visa to the land of corpora
::  Setting up your account
Pack your suitcase to the land of corpora

Available resources
@Stanford

::  User support
The Corpus TA &
our corpora-email-list
::  Corpora
[Ordering corpora | Checking out CDs]

::  Corpora-tools & Software
[Documents]

::  Corpus-related classes
& projects

Beyond Stanford

::  Top 10 info-sources
E-resources out there

For the Corpus TA

::  Guidelines & help
 

Overview

Information for future Corpus TAs will be posted here. If you are a new Corpus TA or you are thinking about becoming the Corpus TA make sure that you read the information provided below.

Currently, the following topics are covered on this page:

Managing user access to AFS and CC

The Corpus TA manages access to AFS and CC. For AFS there are several groups set up (see also Switching to a new Corpus TA below). The general group allowing access to all corpora on AFS that do not require special user agreements is corpora-general. Before giving AFS or CC access to anyone, be sure that the new user understands the license and copyright conditions (i.e. point the user to this site for information). Currently the standard procedure when adding a user to AFS is:

  1. Prior to anything, make sure that the user has actually a Leland account (i.e. access to AFS). This is not the same as having a SUNET-ID! Many users aren't themselves aware of this and the easiest way to find out is to attempt to change to the user's account:
    cd ~USER-SUNET-ID
    The Corpus TA cannot set up AFS accounts. He/She can only give access to the linguistics departments AFS directory to users who already have an AFS/Leland account.
  1. log into AFS and type:
    pts adduser SUNET-ID-NEWUSER SUNET-ID-OF-CORPUS_TA:corpora-general
  1. in the file:
    /afs/ir/dept/linguistics/WWW/corpora/admin/corpora-general-group
    add a line (in chronological order)
    SUNET-ID-NEWUSER - NAME (DEPT/ADVISOR etc)
    For more info, see "man pts". The following command will tell you who is a currently a member of a certain group.
    pts membership SUNET-ID-OF-CORPUS_TA:CORPORA-GROUP
  1. save the e-mail message with the request to the mail file:
    /afs/ir/dept/linguistics/WWW/corpora/admin/requests-from-users

The CC currently has only one shared user account for all users. This may have to be changed but as long as it isn't, simply give a new user the password.

Keeping track of the CDs

Well, in nutshell, do so! Each CD costs money. To replace a lost LDC corpus CD comes at a cost of $100. As of 11/22/2003, all CDs that were formerly located at CSLI have been moved to the Linguistics Department's chair office. All future acquisitions should be collected at the same location. Occasionally somebody will want to check out some CDs - make sure that you have that person sign the necessary user agreements (if any) and note the check out in the blue folder that is located on the same shelf as the CDs.

Ordering, storage, & installation of new corpora and software

The Corpus TA is responsible for the installation of new corpora (on AFS and CC). There are a few guidelines that are worth considering:

  • In the past, new LDC corpora were ordered by Chris Manning. The plan for the future is that this will be done by the Corpus TA. For non-free, non-LDC corpora a solution has yet to be found.
  • Newly arrived corpora and software should be registered on the webpage since this is (so far) our only log of our inventory. It is important that this log is up-to-date since otherwise it is impossible to keep track of what we own and what not.
  • We decided to store all original CDs, etc. in the chair's office - ergo: ... bring them there =).
  • As long as CC is not accessible via the network, corpora that are of interest to a wider community of corpora users at Stanford should be (also) stored under AFS. The same holds for new software if it runs under UNIX.
  • CDs should be placed online in their original, unadulterated form, bar only uncompressing files". That wasn't always done in the early days, and it's just a lot easier for others to start from the original and do to it what they want to do than to try to undo what someone else did.
  • After the installation, the corpus community should be informed about the new acquisitions. Use the corpus email list and update the webpage section on 'new acquisitions'. It seems like a good idea to - every once in a while - send an email to the whole department (rather than the list) containing all new tools and corpora.

Detailed instructions for storing ftp and email LDC corproa on AFS

For small corpora (that we do not own a CD of), please keep the original tar-archives for backup at:

    /afs/ir/data/linguistic-data/mnt/mnt21/LDC-tarfiles

Detailed instructions for installations on AFS

To install a new corpus/new software:

  1. To see which partition has a sufficient amount of free space (see note below for details), check the disk usage logfile (be aware that it may be out of data) at:
    /afs/ir/dept/linguistics/WWW/corpora/admin/disk-usage
  1. Goto that partition, e.g. (note that currently all directories for software - /lib, /src, /bin, and /doc, as well as, /download are located on /mnt16):
    cd /afs/ir/data/linguistic-data/mnt/mnt25
  1. Make a new directory there.
  2. If the corpus needs special access restrictions, make the directory readable for the relevant group before copying the corpus (see Creating and managing groups for special access restrictions on AFS for details). Note that in the default case GROUP-NAME will be 'corpora-general', the group for all files without special access restrictions:
    fs sa DIRNAME SUNET-ID-OF-CORPUS-TA:GROUP-NAME read
    If the new corpus/tool has special access restrictions (i.e. if it should not be accessible to the default group 'corpora-general') you will have to specify this explicitly (since apparently the default is that files are readable for users in 'corpora-general':
    fsr sa DIRNAME SUNET-ID-OF-CORPUS-TA:corpora-general none
    Note that if you want to do this after a corpus is already copied, you have to use a recursive command, which slightly differs:
    fsr sa -dir TARGET-DIR -acl SUNET-ID-OF-CORPUS-TA:GROUP-NAME read -m
  1. Put the corpus in the new directory.
  2. If the corpus is from the LDC, make a link to it from
    /afs/ir/data/linguistic-data/ldc/
    using the naming scheme the other corpora there follow, i.e. starting with the LDC number, etc. To make a link use the 'ln' command (since it allows cross-partition links unlike 'link':
    ln -s TARGET LINK_NAME
  1. Make a link to the corpus in /afs/ir/data/linguistic-data using a simple & descriptive name. If the corpus is from the LDC, link to the ldc directory, otherwise straight to the mount partition. If it's part of one of the thematic collections (like "TextCat"), make a link from within the relevant subdirectory.
  2. Modify /afs/ir/dept/linguistics/WWW/corpora/admin/disk-usage to reflect which corpus got added on what partition and how much free space remains. To determine how much space is taken you can type (where 'mtnXX' is a dummy for the target partition that you are interested in):
    du -ks /afs/ir/data/linguistic-data/mnt/mntXX
    This will give you an estimate of the taken kilo bytes. Each partition has 2GB. Substract the taken space from the available space.
  1. Update the corpus page (this is the only way we keep track of what is installed where). Put a link in the 'Recent Acquisition' section and also add the corpus description and location to the page (e.g. under 'Corpora on AFS'). Update the 'Date of last change'. You may also consider to send an email to the corpora@csli email list (basically follow the instruction under Ordering, storage, & installation of new corpora and software).

NOTE: For small corpora (that we do not own a CD of), please keep the original tar-archives for backup, (see Detailed instructions for storing ftp and email LDC corpora).

Creating and managing groups for special access restrictions on AFS

There are some corpora on AFS with special license requirements. To restrict user access to those corpora you can restrict access to the directory in which the relevant corpus is stored to members of a certain group. Several groups are already in use (some are listed below under Switching to a new Corpus TA). To create a new group type:

    pts creategroup -name SUNET-ID-OF-CORPUS_TA:GROUP-NAME -owner SUNET-ID-OF-CORPUS_TA

More generally you can use 'pts' with all its subcommands (e.g. adduser, delete user or group, check members of a group, etc.) to manage the groups you created. This is a necessary part of several typical tasks (e.g. Managing user access to AFS and CC).

Switching to a new Corpus TA

Aside of the obvious task that go along with switching to a new Corpus TA at the end of the year there are a couple of specific things that have to be done:

  • Change ownership of the groups on AFS:
    pts chown SUNET-ID-OF-OLD-CORPUS_TA:corpora-general SUNET-ID-OF-NEW-CORPUS_TA
    pts chown SUNET-ID-OF-OLD-CORPUS_TA:corpora-tipster SUNET-ID-OF-NEW-CORPUS_TA
    pts chown SUNET-ID-OF-OLD-CORPUS_TA:corpora-tdtpilot SUNET-ID-OF-NEW-CORPUS_TA
    pts chown SUNET-ID-OF-OLD-CORPUS_TA:corpora-ppcme2 SUNET-ID-OF-NEW-CORPUS_TA
    pts chown SUNET-ID-OF-OLD-CORPUS_TA:corpora-celex SUNET-ID-OF-NEW-CORPUS_TA

    (plus any new groups that may have been added later).
  • Switch to or add the new Corpus TA as a owner/moderator of the corpora@csli email list. To admin the corpora@csli.stanford.edu email-list (e.g. switch owners), see the help pages.
  • The new Corpus TA will need the passwords for the CC. At least once a year they should be changed, so this is a good opportunity (but inform the department about it).
  • Get administrator permission for the corpora website (on the department's WWW site):
      fsr -m sa -dir /afs/ir/dept/linguistics/WWW/corpora -acl SUNET-ID-OF-NEW-CORPUS_TA all
  • Give administrator permissions for all files on AFS to the new Corpus TA. It may be a good idea for the old Corpus TA to keep the same permissions for some time to be able to help out here and there. To change ownership of all AFS directories under /afs/ir/data/linguistic-data/, log into AFS and type:
    fsr -m sa -dir /afs/ir/data/linguistic-data -acl SUNET-ID-OF-NEW-CORPUS_TA all