| |
Overview
Information for future Corpus TAs will be posted here. If you are a new Corpus TA
or you are thinking about becoming the Corpus TA make sure that you read the information
provided below.
Currently, the following topics are covered on this page:
Managing user access to AFS and CC
The Corpus TA manages access to AFS and CC. For AFS there are several groups set up (see also
Switching to a new Corpus TA below). The general group allowing access to
all corpora on AFS that do not require special user agreements is corpora-general.
Before giving AFS or CC access to anyone, be sure that the new user understands the license
and copyright conditions (i.e. point the user to this site for information). Currently
the standard procedure when adding a user to AFS is:
- Prior to anything, make sure that the user has actually a Leland account (i.e. access to AFS).
This is not the same as having a SUNET-ID! Many users aren't themselves aware of this and
the easiest way to find out is to attempt to change to the user's account:
The Corpus TA cannot set up AFS accounts. He/She can only give access to the linguistics
departments AFS directory to users who already have an AFS/Leland account.
- log into AFS and type:
pts adduser SUNET-ID-NEWUSER SUNET-ID-OF-CORPUS_TA:corpora-general
- in the file:
/afs/ir/dept/linguistics/WWW/corpora/admin/corpora-general-group
add a line (in chronological order)
SUNET-ID-NEWUSER - NAME (DEPT/ADVISOR etc)
For more info, see "man pts". The following command will tell you who is a currently a
member of a certain group.
pts membership SUNET-ID-OF-CORPUS_TA:CORPORA-GROUP
- save the e-mail message with the request to the mail file:
/afs/ir/dept/linguistics/WWW/corpora/admin/requests-from-users
The CC currently has only one shared user account for all users. This may have to be changed
but as long as it isn't, simply give a new user the password.
Keeping track of the CDs
Well, in nutshell, do so! Each CD costs money. To replace a lost LDC corpus CD comes at a
cost of $100. As of 11/22/2003, all CDs that were formerly located at CSLI have been moved to
the Linguistics Department's chair office. All future acquisitions should be collected at the
same location. Occasionally somebody will want to check out some CDs - make sure that you have
that person sign the necessary user agreements (if any) and note the check out in the blue folder
that is located on the same shelf as the CDs.
Ordering, storage, & installation of new corpora and software
The Corpus TA is responsible for the installation of new corpora (on AFS and CC). There are a few
guidelines that are worth considering:
- In the past, new LDC corpora were ordered by Chris Manning. The plan for the future
is that this will be done by the Corpus TA. For non-free, non-LDC corpora a solution has
yet to be found.
- Newly arrived corpora and software should be registered on the webpage since this is
(so far) our only log of our inventory. It is important that this log is up-to-date since
otherwise it is impossible to keep track of what we own and what not.
- We decided to store all original CDs, etc. in the chair's office - ergo: ... bring
them there =).
- As long as CC is not accessible via the network, corpora that are of interest
to a wider community of corpora users at Stanford should be (also) stored under AFS. The
same holds for new software if it runs under UNIX.
- CDs should be placed online in their original, unadulterated form, bar only
uncompressing files". That wasn't always done in the early days, and it's just a lot
easier for others to start from the original and do to it what they want to do than
to try to undo what someone else did.
- After the installation, the corpus community should be informed about the new
acquisitions. Use the corpus email list and update the webpage section on 'new acquisitions'.
It seems like a good idea to - every once in a while - send an email to the whole department
(rather than the list) containing all new tools and corpora.
Detailed instructions for storing ftp and email LDC corproa on AFS
For small corpora (that we do not own a CD of), please keep the original tar-archives
for backup at:
/afs/ir/data/linguistic-data/mnt/mnt21/LDC-tarfiles
Detailed instructions for installations on AFS
To install a new corpus/new software:
- To see which partition has a sufficient amount of free space (see note below
for details), check the disk usage logfile (be aware that it may be out of data) at:
/afs/ir/dept/linguistics/WWW/corpora/admin/disk-usage
- Goto that partition, e.g. (note that currently all directories for software - /lib, /src, /bin, and
/doc, as well as, /download are located on /mnt16):
cd /afs/ir/data/linguistic-data/mnt/mnt25
- Make a new directory there.
- If the corpus needs special access restrictions, make the directory readable for the relevant group
before copying the corpus (see Creating and managing groups for special access restrictions
on AFS for details). Note that in the default case GROUP-NAME will be 'corpora-general', the group for all files
without special access restrictions:
fs sa DIRNAME SUNET-ID-OF-CORPUS-TA:GROUP-NAME read
If the new corpus/tool has special access restrictions (i.e. if it should not be accessible to the default group
'corpora-general') you will have to specify this explicitly (since apparently the default is that files are readable
for users in 'corpora-general':
fsr sa DIRNAME SUNET-ID-OF-CORPUS-TA:corpora-general none
Note that if you want to do this after a corpus is already copied, you have to use a recursive command,
which slightly differs:
fsr sa -dir TARGET-DIR -acl SUNET-ID-OF-CORPUS-TA:GROUP-NAME read -m
- Put the corpus in the new directory.
- If the corpus is from the LDC, make a link to it from
/afs/ir/data/linguistic-data/ldc/
using the naming scheme the other corpora there follow, i.e. starting with the LDC number, etc. To make
a link use the 'ln' command (since it allows cross-partition links unlike 'link':
- Make a link to the corpus in /afs/ir/data/linguistic-data using a simple & descriptive name.
If the corpus is from the LDC, link to the ldc directory, otherwise straight to the mount partition.
If it's part of one of the thematic collections (like "TextCat"), make a link from within the relevant
subdirectory.
- Modify /afs/ir/dept/linguistics/WWW/corpora/admin/disk-usage to reflect which corpus got added on
what partition and how much free space remains. To determine how much space is taken
you can type (where 'mtnXX' is a dummy for the target partition that you are interested in):
du -ks /afs/ir/data/linguistic-data/mnt/mntXX
This will give you an estimate of the taken kilo bytes.
Each partition has 2GB. Substract the taken space from the available
space.
- Update the corpus page (this is the only way we keep track of what is installed where). Put a link
in the 'Recent Acquisition' section and also add the corpus description and location to the page (e.g.
under 'Corpora on AFS'). Update the 'Date of last change'. You may also consider to send an email to the
corpora@csli email list (basically follow the instruction under Ordering, storage, &
installation of new corpora and software).
NOTE: For small corpora (that we do not own a CD of), please keep the original tar-archives
for backup, (see Detailed instructions for storing ftp and email LDC corpora).
Creating and managing groups for special access restrictions on AFS
There are some corpora on AFS with special license requirements. To
restrict user access to those corpora you can restrict access to the directory in which the relevant corpus is
stored to members of a certain group. Several groups are already in use (some are listed below under
Switching to a new Corpus TA). To create a new group type:
pts creategroup -name SUNET-ID-OF-CORPUS_TA:GROUP-NAME -owner SUNET-ID-OF-CORPUS_TA
More generally you can use 'pts' with all its subcommands (e.g. adduser, delete user or group, check members of a
group, etc.) to manage the groups you created. This is a necessary part of several typical tasks (e.g.
Managing user access to AFS and CC).
Switching to a new Corpus TA
Aside of the obvious task that go along with switching to a new Corpus TA at the end of the year
there are a couple of specific things that have to be done:
- Change ownership of the groups on AFS:
pts chown SUNET-ID-OF-OLD-CORPUS_TA:corpora-general SUNET-ID-OF-NEW-CORPUS_TA
pts chown SUNET-ID-OF-OLD-CORPUS_TA:corpora-tipster SUNET-ID-OF-NEW-CORPUS_TA
pts chown SUNET-ID-OF-OLD-CORPUS_TA:corpora-tdtpilot SUNET-ID-OF-NEW-CORPUS_TA
pts chown SUNET-ID-OF-OLD-CORPUS_TA:corpora-ppcme2 SUNET-ID-OF-NEW-CORPUS_TA
pts chown SUNET-ID-OF-OLD-CORPUS_TA:corpora-celex SUNET-ID-OF-NEW-CORPUS_TA
(plus any new groups that may have been added later).
- Switch to or add the new Corpus TA as a owner/moderator of the corpora@csli email list.
To admin the corpora@csli.stanford.edu email-list (e.g. switch owners), see the
help pages.
- The new Corpus TA will need the passwords for the CC. At least once a year they should
be changed, so this is a good opportunity (but inform the department about it).
- Get administrator permission for the corpora website (on the department's WWW site):
fsr -m sa -dir /afs/ir/dept/linguistics/WWW/corpora -acl SUNET-ID-OF-NEW-CORPUS_TA all
- Give administrator permissions for all files on AFS to the new Corpus TA. It may be a good idea
for the old Corpus TA to keep the same permissions for some time to be able to help out here and there.
To change ownership of all AFS directories under /afs/ir/data/linguistic-data/, log into
AFS and type:
fsr -m sa -dir /afs/ir/data/linguistic-data -acl SUNET-ID-OF-NEW-CORPUS_TA all
|