Test Cases
From FarmShare
This page should have some "test cases" that users or sysadmins can run to verify the functionality of the barleys.
TC1: submit a job from a corn via /mnt/glusterfs
- cd /mnt/glusterfs/your_sunetid
- echo "hostname" | qsub -cwd
- qstat # check job status
- Check that the stderr output file is empty and the stdout output file contains the hostname of the machine that the job ran on
This test verifies that the shared filesystem is available and the job submission process works as expected.
TC2: submit a job from a corn with AUKS support
- kinit / klist -f # check that your ticket is forwardable and renewable
- aklog / tokens # check that you have an AFS token
- echo "hostname" | qsub
- qstat # check job status
- Check that the stderr output file is empty and the stdout output file contains the hostname of the machine that the job ran on
This test verifies that AUKS handles the Kerberos/AFS tickets/tokens correctly.
TC3: check memory tracking
R script:
$ cat R8GB.R x <- array(1:1073741824, dim=c(1024,1024,1024)) x <- gaussian(); Sys.sleep(60)
submit script:
$ cat r_test.script #!/bin/bash # use the current directory #$ -cwd # mail this address #$ -M chekh@stanford.edu # send mail on begin, end, suspend #$ -m bes # get rid of spurious messages about tty/terminal types #$ -S /bin/sh R --vanilla --no-save < R8GB.R
- submit this job with 'qsub r_test.script' (with AUKS or not)
- check that you get an e-mail
- check that the e-mail correctly reports ~8GB maxvmem (our current version does not, it's a bug, we need to upgrade)
TC4: check time tracking
submit a job like
echo "sleep 3600" | qsub -cwd -m bes -M chekh@stanford.edu
Check that the job completion mail says 1hr of walltime elapsed
submit a job like
echo "sleep 3600" | qsub -cwd -m bes -M chekh@stanford.edu -l h_rt=72:00:00
Check that the job went into long.q
TC5: check maximum job numbers
We currently have:
max_u_jobs 100 max_jobs 3000
for i in `seq -w 01 200`; do echo "sleep 600" | qsub -cwd ; done
That will submit 200 jobs, and you should see that only 100 of them will be accepted and an error like
Unable to run job: job rejected: Only 100 jobs are allowed per user (current job count: 100).
TC6: check maximum number of tasks
We have the task settings set to defaults, see 'man 5 sge_conf' for more info:
$ qconf -sconf |grep aj max_aj_instances 2000 max_aj_tasks 75000
Submit script:
$ cat task_test.script #!/bin/bash #$ -cwd #$ -N task_test #printenv |grep TASK echo $JOB_NAME $SGE_TASK_ID on $HOSTNAME >> /mnt/glusterfs/chekh/test_cases/task_test_output
You can check that you can submit 5 tasks: 'qsub -t 1-5 task_test.script'. Or you can check that you can submit 5000 tasks and run 10 at a time: 'qsub -t 1-5000 -tc 10'. And check that the output file contains what you expect.
Looks like our current version of SGE has the documentation say "$TASK_ID", but actually you get "$SGE_TASK_ID"
TC7: check disk throughput performance - local
Use the bonnie++ executable and run it on local disk with a submit script like this:
#!/bin/bash #$ -m bes #$ -M chekh@stanford.edu #$ -cwd BONNIE=/mnt/glusterfs/chekh/bonnie/bonnie++ echo $TMPDIR $BONNIE -d $TMPDIR
Check that the performance numbers in the output roughly match these, my job ran 2h40m:
$ cat bonnie.script.o24714 /tmp/24714.1.main.q Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP barley06.st 193760M 583 98 76227 22 40798 14 3514 97 96284 15 103.7 41 Latency 30420us 18681ms 1842ms 18715us 480ms 489ms Version 1.96 ------Sequential Create------ --------Random Create-------- barley06.stanford.e -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 3487 4 +++++ +++ 22471 23 29117 37 +++++ +++ +++++ +++ Latency 13181us 1138us 552us 1048us 72us 483us 1.96,1.96,barley06.stanford.edu,1,1323906617,193760M,,583,98,76227,22,40798,14,3514,97,96284,15,103.7,41,16,,,,,3487,4,+++++,+++,22471,23,29117,37,+++++,+++,+++++,+++,30420us,18681ms,1842ms,18715us,480ms,489ms,13181us,1138us,552us,1048us,72us,483us
#!/bin/bash #$ -m bes #$ -M chekh@stanford.edu #$ -cwd BONNIE=/mnt/glusterfs/chekh/bonnie/bonnie++ DIR=/mnt/glusterfs/chekh/bonnie/temp echo $DIR $BONNIE -d $DIR
Your results should be something like this (mine ran for 3h44m):
/mnt/glusterfs/chekh/bonnie/temp Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP barley18.st 193760M 10 30 40498 21 27087 20 3481 96 291925 49 244.9 11 Latency 1131ms 925ms 2834ms 27065us 976ms 101ms Version 1.96 ------Sequential Create------ --------Random Create-------- barley18.stanford.e -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 432 8 2945 16 554 7 450 7 1544 9 500 7 Latency 814ms 4220us 13577us 12176us 7681us 654ms 1.96,1.96,barley18.stanford.edu,1,1323897313,193760M,,10,30,40498,21,27087,20,3481,96,291925,49,244.9,11,16,,,,,432,8,2945,16,554,7,450,7,1544,9,500,7,1131ms,925ms,2834ms,27065us,976ms,101ms,814ms,4220us,13577us,12176us,7681us,654ms