User Guide
From FarmShare
m (→Directory paths) |
(→Scratch) |
||
(48 intermediate revisions not shown) | |||
Line 1: | Line 1: | ||
- | + | Contact SRCC staff for support at: [mailto:srcc-support@stanford.edu srcc-support@stanford.edu], or post questions and concerns to the community [https://mailman.stanford.edu/mailman/listinfo/farmshare-discuss discussion list] at: [mailto:farmshare-discuss@lists.stanford.edu farmshare-discuss@lists.stanford.edu]. | |
- | = Connecting = | + | == Connecting == |
- | + | Log into <code>rice.stanford.edu</code>. Authentication is by SUNet ID and password (or GSSAPI), and [https://uit.stanford.edu/service/webauth/twostep two-step] authentication is required. A suggested configuration for OpenSSH and recommendations for two popular SSH clients for Windows can be found in [[Advanced Connection Options]]. | |
- | + | ||
- | + | ||
- | + | ||
- | + | == Storage == | |
- | + | FarmShare is ''not'' approved for use with [https://dataclass.stanford.edu/ high-risk] data, including protected health information and personally identifiable information. | |
- | == | + | === Home === |
- | + | Home directories are served (via NFS 4) from a dedicated file server, and per-user quota is currently 48 GB. Users may exceed this soft limit for up to 7 days, up to a hard limit of 64 GB. | |
- | + | === AFS === | |
- | + | ||
- | + | [https://uit.stanford.edu/service/afs AFS] is accessible from <code>rice</code> systems ''only''. A link to each user's AFS home directory, <code>~/afs-home</code>, is provided as a convenience, but should only be used to access files in the legacy environment, and for transferring data. It should ''not'' be used as a working directory when submitting batch jobs, as AFS is not accessible from compute nodes. Please note that a valid Kerberos ticket and an [[AFS#Authentication|AFS token]] are required to access locations in AFS; run <code>kinit && aklog</code> to re-authenticate if you have trouble accessing any AFS directory. | |
- | + | The default, per-user quota for AFS home directories is 5 GB, but you may have additional quota due to your enrollment in certain courses, and you can [https://tools.stanford.edu/cgi-bin/afs-request request] additional quota (up to 20 GB total) with faculty sponsorship. AFS is backed up every night, and backups are kept for 30 days. The most recent snapshot of your AFS home directory is available in the <code>.backup</code> subdirectory, and you can request recovery from older backups by submitting a [https://helpsu.stanford.edu HelpSU] ticket. | |
- | = | + | === Scratch === |
- | + | Scratch storage is available in <code>/farmshare/user_data</code>, and each user is provided with a personal scratch directory, <code>/farmshare/user_data/$USER</code>. The total volume size is currently 126 TB; quotas are not currently enforced, but ''files not modified in the last 90 days are regularly purged, without warning''. The scratch volume is ''not'' backed up, and is ''not'' suitable for long-term storage, but can be used as working storage for batch jobs, and as a short-term staging area for data waiting to be archived to permanent storage. | |
- | = | + | === Temp === |
- | + | Local <code>/tmp</code> storage is available on most nodes, but size varies from node to node. On <code>rice</code> systems, <code>/tmp</code> is 512 GB, with a per-user quota of 128 GB. Users may exceed this soft limit for up to 7 days, up to a hard limit of 192 GB, and space is regularly reclaimed from files older than 7 days. | |
- | = | + | == File Transfer == |
- | + | === Using SSH === | |
- | + | FarmShare supports any file-transfer method using SSH as a transport, including standard tools like <code>scp</code>, <code>sftp</code>, and <code>rsync</code> on Linux and macOS systems, and SFTP clients like [https://uit.stanford.edu/software/fetch Fetch] for macOS and [https://uit.stanford.edu/software/scrt_sfx SecureFX] for Windows. Because 2-step authentication is required you may need enable keep-alive in your preferred SFTP client to avoid repeated authentications. For Fetch, in the Preferences dialog, select <code>General</code> → <code>FTP compatibility</code> → <code>Keep connections alive</code>; for SecureFX, in Global Options, select <code>File Transfer</code> → <code>Options</code> → <code>Advanced</code> → <code>Options</code> → <code>Keep connections alive</code>. | |
- | + | You can also use FUSE and [[SSHFS]] to mount your FarmShare home and scratch directories. Most Linux distributions provide a standard <code>sshfs</code> package. On macOS you can use [https://brew.sh Homebrew] to install the <code>osxfuse</code> and <code>sshfs</code> packages, or download FUSE and SSHFS installers from the [https://osxfuse.github.io FUSE for macOS] project. Support for this option on Windows typically requires commercial software (like [https://www.expandrive.com ExpanDrive]). | |
- | = | + | === Using AFS === |
- | + | You can use the native OpenAFS client to access files in AFS, including your AFS home directory. Most Linux distributions provide standard <code>openafs</code> packages. The University provides [https://uit.stanford.edu/software/afs installers] for the macOS and Windows clients. | |
- | You can | + | You can also use [https://afs.stanford.edu WebAFS] to transfer files between your computer and locations in AFS using a web browser. |
- | + | == Installed Software == | |
- | + | FarmShare systems run Ubuntu 16.04 LTS, and most software is sourced from standard repositories. Additional software, including licensed software, is organized using environment modules and can be accessed using the <code>module</code> command. '''Please note''' that commercial software is licensed for use on FarmShare ''only'' in coursework, and ''not'' in research. Users can also build and/or install their own software in their home directories, either manually, or using a local package manager. FarmShare supports running software packaged as Singularity containers. | |
- | + | == Running Jobs == | |
- | + | FarmShare uses [https://slurm.schedmd.com Slurm] for job management. Full [https://slurm.schedmd.com/documentation.html documentation] is available from the vendor, and detailed usage information is provided in the <code>man</code> pages for the <code>srun</code>, <code>sbatch</code>, <code>squeue</code>, <code>scancel</code>, <code>sinfo</code>, and <code>scontrol</code> commands. | |
- | + | ||
- | + | ||
- | + | Jobs are scheduled according to a priority which depends on a number of factors, including how long a job has been waiting, its size, and a fair-share value that tracks recent per-user utilization of cluster resources. Lower-priority jobs, and jobs requiring access to resources not currently available, may wait some time before starting to run. The scheduler may reserve resources so that pending jobs can start; while it will try to backfill these resources with smaller, shorter jobs (even those at lower priorities), this behavior can sometimes cause nodes to appear to be idle even when there are jobs that are ready to run. You can use <code>squeue --start</code> to get an ''estimate'' of when pending jobs will start. | |
- | = | + | === Interactive Jobs === |
- | + | Interactive sessions that require resources in excess of limits on the login nodes, exclusive access to resources, or access to a feature not available on the login nodes (e.g., a GPU), can be submitted to a compute node. | |
- | + | <source lang="sh">srun --pty --qos=interactive $SHELL -l</source> | |
- | + | Interactive jobs receive a modest priority boost compared to batch jobs, but when contention for resources is high interactive jobs may wait a long time before starting. Each user is allowed one interactive job, which may run for at most one day. | |
- | + | === Batch Jobs === | |
- | + | The <code>sbatch</code> command is used to submit a batch job, and takes a batch script as an argument. Options are used to request specific resources (including runtime), and can be provided either on the command line or, using a special syntax, in the script file itself. <code>sbatch</code> can also be used to submit many similar jobs, each perhaps varying in only one or two parameters, in a single invocation using the <code>--array</code> option; each job in an array has access to environment variables identifying its rank. | |
- | + | === MPI jobs === | |
+ | [[OpenMPI]] is installed, both as a package, and (in a more recent version) as a module (<code>openmpi</code>). [https://software.intel.com/en-us/intel-mpi-library/ Intel MPI] is also installed, as part of the [https://software.intel.com/en-us/parallel-studio-xe/ Intel Parallel Studio] module (<code>intel</code>). Because security concerns restrict allowed authentication methods, SSH cannot be used to launch MPI tasks; use <code>srun</code> instead. | ||
- | = | + | === Default Allocations === |
- | + | Default allocations vary by partition and quality-of-service, but in general a job will have access to 1 physical core (2 threads) and 8 GB of memory, and may run for up to 2 hours by default; interactive jobs may run for up to 1 hour by default. The default allocation on the <code>bigmem</code> partition is 1 core (2 threads) and 48 GB of memory. | |
- | + | If your job needs more resources than are provided by default, or access to a special feature (like large memory or a GPU), you ''must'' run on the appropriate partition (or quality-of-service) ''and'' request those resources explicitly. Common <code>sbatch</code> options include <code>--partition</code>, <code>--qos</code>, <code>--cpus-per-task</code>, <code>--mem</code>, <code>--mem-per-cpu</code>, <code>--gres</code>, and <code>--time</code>. | |
- | == | + | === Limits === |
- | + | Maximum runtime is 2 days unless jobs are scheduled using the <code>long</code> quality-of-service, which has a 7-day maximum runtime; interactive jobs have a maximum runtime of 1 day. | |
- | + | The <code>gpu</code> quality-of-service has a minimum GPU requirement (1), so you must request access to a GPU explicitly when submitting a job. | |
- | == | + | <source lang="sh">sbatch --partition=gpu --qos=gpu --gres=gpu:1</source> |
- | + | The <code>bigmem</code> quality-of-service has a minimum memory requirement; you must request at least 96GB when submitting a job. | |
- | == | + | <source lang="sh">sbatch --partition=bigmem --qos=bigmem --mem=96G</source> |
- | + | === Monitoring your Jobs === | |
- | + | You can use the <code>squeue</code> and <code>sacct</code> commands to monitor the current state of the scheduler and of your jobs. The <code>sprio</code> command can provide some information on how priority was determined for particular jobs, and the <code>sshare</code> command on how current fair-share was calculated. Use the <code>scontrol</code> and <code>sacctmgr</code> commands to examine the configuration of hosts, partitions, and qualities-of-service. | |
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + |
Latest revision as of 15:05, 26 August 2020
Contact SRCC staff for support at: srcc-support@stanford.edu, or post questions and concerns to the community discussion list at: farmshare-discuss@lists.stanford.edu.
Contents |
Connecting
Log into rice.stanford.edu
. Authentication is by SUNet ID and password (or GSSAPI), and two-step authentication is required. A suggested configuration for OpenSSH and recommendations for two popular SSH clients for Windows can be found in Advanced Connection Options.
Storage
FarmShare is not approved for use with high-risk data, including protected health information and personally identifiable information.
Home
Home directories are served (via NFS 4) from a dedicated file server, and per-user quota is currently 48 GB. Users may exceed this soft limit for up to 7 days, up to a hard limit of 64 GB.
AFS
AFS is accessible from rice
systems only. A link to each user's AFS home directory, ~/afs-home
, is provided as a convenience, but should only be used to access files in the legacy environment, and for transferring data. It should not be used as a working directory when submitting batch jobs, as AFS is not accessible from compute nodes. Please note that a valid Kerberos ticket and an AFS token are required to access locations in AFS; run kinit && aklog
to re-authenticate if you have trouble accessing any AFS directory.
The default, per-user quota for AFS home directories is 5 GB, but you may have additional quota due to your enrollment in certain courses, and you can request additional quota (up to 20 GB total) with faculty sponsorship. AFS is backed up every night, and backups are kept for 30 days. The most recent snapshot of your AFS home directory is available in the .backup
subdirectory, and you can request recovery from older backups by submitting a HelpSU ticket.
Scratch
Scratch storage is available in /farmshare/user_data
, and each user is provided with a personal scratch directory, /farmshare/user_data/$USER
. The total volume size is currently 126 TB; quotas are not currently enforced, but files not modified in the last 90 days are regularly purged, without warning. The scratch volume is not backed up, and is not suitable for long-term storage, but can be used as working storage for batch jobs, and as a short-term staging area for data waiting to be archived to permanent storage.
Temp
Local /tmp
storage is available on most nodes, but size varies from node to node. On rice
systems, /tmp
is 512 GB, with a per-user quota of 128 GB. Users may exceed this soft limit for up to 7 days, up to a hard limit of 192 GB, and space is regularly reclaimed from files older than 7 days.
File Transfer
Using SSH
FarmShare supports any file-transfer method using SSH as a transport, including standard tools like scp
, sftp
, and rsync
on Linux and macOS systems, and SFTP clients like Fetch for macOS and SecureFX for Windows. Because 2-step authentication is required you may need enable keep-alive in your preferred SFTP client to avoid repeated authentications. For Fetch, in the Preferences dialog, select General
→ FTP compatibility
→ Keep connections alive
; for SecureFX, in Global Options, select File Transfer
→ Options
→ Advanced
→ Options
→ Keep connections alive
.
You can also use FUSE and SSHFS to mount your FarmShare home and scratch directories. Most Linux distributions provide a standard sshfs
package. On macOS you can use Homebrew to install the osxfuse
and sshfs
packages, or download FUSE and SSHFS installers from the FUSE for macOS project. Support for this option on Windows typically requires commercial software (like ExpanDrive).
Using AFS
You can use the native OpenAFS client to access files in AFS, including your AFS home directory. Most Linux distributions provide standard openafs
packages. The University provides installers for the macOS and Windows clients.
You can also use WebAFS to transfer files between your computer and locations in AFS using a web browser.
Installed Software
FarmShare systems run Ubuntu 16.04 LTS, and most software is sourced from standard repositories. Additional software, including licensed software, is organized using environment modules and can be accessed using the module
command. Please note that commercial software is licensed for use on FarmShare only in coursework, and not in research. Users can also build and/or install their own software in their home directories, either manually, or using a local package manager. FarmShare supports running software packaged as Singularity containers.
Running Jobs
FarmShare uses Slurm for job management. Full documentation is available from the vendor, and detailed usage information is provided in the man
pages for the srun
, sbatch
, squeue
, scancel
, sinfo
, and scontrol
commands.
Jobs are scheduled according to a priority which depends on a number of factors, including how long a job has been waiting, its size, and a fair-share value that tracks recent per-user utilization of cluster resources. Lower-priority jobs, and jobs requiring access to resources not currently available, may wait some time before starting to run. The scheduler may reserve resources so that pending jobs can start; while it will try to backfill these resources with smaller, shorter jobs (even those at lower priorities), this behavior can sometimes cause nodes to appear to be idle even when there are jobs that are ready to run. You can use squeue --start
to get an estimate of when pending jobs will start.
Interactive Jobs
Interactive sessions that require resources in excess of limits on the login nodes, exclusive access to resources, or access to a feature not available on the login nodes (e.g., a GPU), can be submitted to a compute node.
srun --pty --qos=interactive $SHELL -l
Interactive jobs receive a modest priority boost compared to batch jobs, but when contention for resources is high interactive jobs may wait a long time before starting. Each user is allowed one interactive job, which may run for at most one day.
Batch Jobs
The sbatch
command is used to submit a batch job, and takes a batch script as an argument. Options are used to request specific resources (including runtime), and can be provided either on the command line or, using a special syntax, in the script file itself. sbatch
can also be used to submit many similar jobs, each perhaps varying in only one or two parameters, in a single invocation using the --array
option; each job in an array has access to environment variables identifying its rank.
MPI jobs
OpenMPI is installed, both as a package, and (in a more recent version) as a module (openmpi
). Intel MPI is also installed, as part of the Intel Parallel Studio module (intel
). Because security concerns restrict allowed authentication methods, SSH cannot be used to launch MPI tasks; use srun
instead.
Default Allocations
Default allocations vary by partition and quality-of-service, but in general a job will have access to 1 physical core (2 threads) and 8 GB of memory, and may run for up to 2 hours by default; interactive jobs may run for up to 1 hour by default. The default allocation on the bigmem
partition is 1 core (2 threads) and 48 GB of memory.
If your job needs more resources than are provided by default, or access to a special feature (like large memory or a GPU), you must run on the appropriate partition (or quality-of-service) and request those resources explicitly. Common sbatch
options include --partition
, --qos
, --cpus-per-task
, --mem
, --mem-per-cpu
, --gres
, and --time
.
Limits
Maximum runtime is 2 days unless jobs are scheduled using the long
quality-of-service, which has a 7-day maximum runtime; interactive jobs have a maximum runtime of 1 day.
The gpu
quality-of-service has a minimum GPU requirement (1), so you must request access to a GPU explicitly when submitting a job.
sbatch --partition=gpu --qos=gpu --gres=gpu:1
The bigmem
quality-of-service has a minimum memory requirement; you must request at least 96GB when submitting a job.
sbatch --partition=bigmem --qos=bigmem --mem=96G
Monitoring your Jobs
You can use the squeue
and sacct
commands to monitor the current state of the scheduler and of your jobs. The sprio
command can provide some information on how priority was determined for particular jobs, and the sshare
command on how current fair-share was calculated. Use the scontrol
and sacctmgr
commands to examine the configuration of hosts, partitions, and qualities-of-service.