Test Cases
From FarmShare
(→TC6: check maximum number of tasks) |
|||
(3 intermediate revisions not shown) | |||
Line 10: | Line 10: | ||
==TC2: submit a job from a corn with AUKS support== | ==TC2: submit a job from a corn with AUKS support== | ||
- | # kinit / klist -f # check that your ticket is forwardable | + | # kinit / klist -f # check that your ticket is forwardable and renewable |
# aklog / tokens # check that you have an AFS token | # aklog / tokens # check that you have an AFS token | ||
# echo "hostname" | qsub | # echo "hostname" | qsub | ||
Line 62: | Line 62: | ||
We currently have: | We currently have: | ||
<pre> | <pre> | ||
- | max_u_jobs | + | max_u_jobs 480 |
max_jobs 3000 | max_jobs 3000 | ||
</pre> | </pre> | ||
Line 162: | Line 162: | ||
Latency 814ms 4220us 13577us 12176us 7681us 654ms | Latency 814ms 4220us 13577us 12176us 7681us 654ms | ||
1.96,1.96,barley18.stanford.edu,1,1323897313,193760M,,10,30,40498,21,27087,20,3481,96,291925,49,244.9,11,16,,,,,432,8,2945,16,554,7,450,7,1544,9,500,7,1131ms,925ms,2834ms,27065us,976ms,101ms,814ms,4220us,13577us,12176us,7681us,654ms | 1.96,1.96,barley18.stanford.edu,1,1323897313,193760M,,10,30,40498,21,27087,20,3481,96,291925,49,244.9,11,16,,,,,432,8,2945,16,554,7,450,7,1544,9,500,7,1131ms,925ms,2834ms,27065us,976ms,101ms,814ms,4220us,13577us,12176us,7681us,654ms | ||
+ | </pre> | ||
+ | |||
+ | ==TC9: check maximum running jobs limit== | ||
+ | In grid engine, you can set a limit to how many jobs a single user can have in state 'r'. It's the 'maxujobs' setting the the scheduler config, which you can see with 'qconf -ssconf'. | ||
+ | |||
+ | If you submit a bunch of jobs, and have the maximum number running, and have sheduling info turned on, you should see that the waiting jobs have a message like: | ||
+ | <pre> | ||
+ | job dropped because of user limitations | ||
+ | </pre> | ||
+ | |||
+ | ==TC10: check mail to users== | ||
+ | The qmaster sends mail based on its mailer config: | ||
+ | |||
+ | <pre> | ||
+ | $ qconf -sconf |grep mail | ||
+ | mailer /usr/bin/mail | ||
+ | administrator_mail its-sa-rc-reports@lists.stanford.edu | ||
+ | </pre> | ||
+ | And that e-mail address should get notices about any failed jobs. | ||
+ | |||
+ | <pre> | ||
+ | [chekh@corn10.stanford.edu] /farmshare/user_data/chekh [0] | ||
+ | $ kinit | ||
+ | Password for chekh@stanford.edu: | ||
+ | [chekh@corn10.stanford.edu] /farmshare/user_data/chekh [0] | ||
+ | $ echo "hostname" | qsub -cwd -m bea -M chekh@stanford.edu -l mem_free=1G | ||
+ | Your job 1430581 ("STDIN") has been submitted | ||
+ | </pre> | ||
+ | |||
+ | You'll get an e-mail like this about the beginning of the job: | ||
+ | <pre> | ||
+ | Job 1430581 (STDIN) Started | ||
+ | User = chekh | ||
+ | Queue = trusty.q | ||
+ | Host = barley01.stanford.edu | ||
+ | Start Time = 06/15/2017 11:59:00 | ||
+ | </pre> | ||
+ | And one about the end: | ||
+ | <pre> | ||
+ | Job 1430581 (STDIN) Complete | ||
+ | User = chekh | ||
+ | Queue = trusty.q@barley01.stanford.edu | ||
+ | Host = barley01.stanford.edu | ||
+ | Start Time = 06/15/2017 11:59:01 | ||
+ | End Time = 06/15/2017 11:59:01 | ||
+ | User Time = 00:00:00 | ||
+ | System Time = 00:00:00 | ||
+ | Wallclock Time = 00:00:00 | ||
+ | CPU = 00:00:00 | ||
+ | Max vmem = NA | ||
+ | Exit Status = 0 | ||
</pre> | </pre> |
Latest revision as of 12:02, 15 June 2017
This page should have some "test cases" that users or sysadmins can run to verify the functionality of the barleys.
TC1: submit a job from a corn via /mnt/glusterfs
- cd /mnt/glusterfs/your_sunetid
- echo "hostname" | qsub -cwd
- qstat # check job status
- Check that the stderr output file is empty and the stdout output file contains the hostname of the machine that the job ran on
This test verifies that the shared filesystem is available and the job submission process works as expected.
TC2: submit a job from a corn with AUKS support
- kinit / klist -f # check that your ticket is forwardable and renewable
- aklog / tokens # check that you have an AFS token
- echo "hostname" | qsub
- qstat # check job status
- Check that the stderr output file is empty and the stdout output file contains the hostname of the machine that the job ran on
This test verifies that AUKS handles the Kerberos/AFS tickets/tokens correctly.
TC3: check memory tracking
R script:
$ cat R8GB.R x <- array(1:1073741824, dim=c(1024,1024,1024)) x <- gaussian(); Sys.sleep(60)
submit script:
$ cat r_test.script #!/bin/bash # use the current directory #$ -cwd # mail this address #$ -M chekh@stanford.edu # send mail on begin, end, suspend #$ -m bes # get rid of spurious messages about tty/terminal types #$ -S /bin/sh R --vanilla --no-save < R8GB.R
- submit this job with 'qsub r_test.script' (with AUKS or not)
- check that you get an e-mail
- check that the e-mail correctly reports ~8GB maxvmem (our current version does not, it's a bug, we need to upgrade)
TC4: check time tracking
submit a job like
echo "sleep 3600" | qsub -cwd -m bes -M chekh@stanford.edu
Check that the job completion mail says 1hr of walltime elapsed
submit a job like
echo "sleep 3600" | qsub -cwd -m bes -M chekh@stanford.edu -l h_rt=72:00:00
Check that the job went into long.q
TC5: check maximum job numbers
We currently have:
max_u_jobs 480 max_jobs 3000
for i in `seq -w 01 200`; do echo "sleep 600" | qsub -cwd ; done
That will submit 200 jobs, and you should see that only 100 of them will be accepted and an error like
Unable to run job: job rejected: Only 100 jobs are allowed per user (current job count: 100).
TC6: check maximum number of tasks
We have the task settings set to defaults, see 'man 5 sge_conf' for more info:
$ qconf -sconf |grep aj max_aj_instances 2000 max_aj_tasks 75000
Submit script:
$ cat task_test.script #!/bin/bash #$ -cwd #$ -N task_test #printenv |grep TASK echo $JOB_NAME $SGE_TASK_ID on $HOSTNAME >> /mnt/glusterfs/chekh/test_cases/task_test_output
You can check that you can submit 5 tasks: 'qsub -t 1-5 task_test.script'. Or you can check that you can submit 5000 tasks and run 10 at a time: 'qsub -t 1-5000 -tc 10'. And check that the output file contains what you expect.
Looks like our current version of SGE has the documentation say "$TASK_ID", but actually you get "$SGE_TASK_ID"
TC7: check disk throughput performance - local
Use the bonnie++ executable and run it on local disk with a submit script like this:
#!/bin/bash #$ -m bes #$ -M chekh@stanford.edu #$ -cwd BONNIE=/mnt/glusterfs/chekh/bonnie/bonnie++ echo $TMPDIR $BONNIE -d $TMPDIR
Check that the performance numbers in the output roughly match these, my job ran 2h40m:
$ cat bonnie.script.o24714 /tmp/24714.1.main.q Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP barley06.st 193760M 583 98 76227 22 40798 14 3514 97 96284 15 103.7 41 Latency 30420us 18681ms 1842ms 18715us 480ms 489ms Version 1.96 ------Sequential Create------ --------Random Create-------- barley06.stanford.e -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 3487 4 +++++ +++ 22471 23 29117 37 +++++ +++ +++++ +++ Latency 13181us 1138us 552us 1048us 72us 483us 1.96,1.96,barley06.stanford.edu,1,1323906617,193760M,,583,98,76227,22,40798,14,3514,97,96284,15,103.7,41,16,,,,,3487,4,+++++,+++,22471,23,29117,37,+++++,+++,+++++,+++,30420us,18681ms,1842ms,18715us,480ms,489ms,13181us,1138us,552us,1048us,72us,483us
#!/bin/bash #$ -m bes #$ -M chekh@stanford.edu #$ -cwd BONNIE=/mnt/glusterfs/chekh/bonnie/bonnie++ DIR=/mnt/glusterfs/chekh/bonnie/temp echo $DIR $BONNIE -d $DIR
Your results should be something like this (mine ran for 3h44m):
/mnt/glusterfs/chekh/bonnie/temp Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP barley18.st 193760M 10 30 40498 21 27087 20 3481 96 291925 49 244.9 11 Latency 1131ms 925ms 2834ms 27065us 976ms 101ms Version 1.96 ------Sequential Create------ --------Random Create-------- barley18.stanford.e -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 432 8 2945 16 554 7 450 7 1544 9 500 7 Latency 814ms 4220us 13577us 12176us 7681us 654ms 1.96,1.96,barley18.stanford.edu,1,1323897313,193760M,,10,30,40498,21,27087,20,3481,96,291925,49,244.9,11,16,,,,,432,8,2945,16,554,7,450,7,1544,9,500,7,1131ms,925ms,2834ms,27065us,976ms,101ms,814ms,4220us,13577us,12176us,7681us,654ms
TC9: check maximum running jobs limit
In grid engine, you can set a limit to how many jobs a single user can have in state 'r'. It's the 'maxujobs' setting the the scheduler config, which you can see with 'qconf -ssconf'.
If you submit a bunch of jobs, and have the maximum number running, and have sheduling info turned on, you should see that the waiting jobs have a message like:
job dropped because of user limitations
TC10: check mail to users
The qmaster sends mail based on its mailer config:
$ qconf -sconf |grep mail mailer /usr/bin/mail administrator_mail its-sa-rc-reports@lists.stanford.edu
And that e-mail address should get notices about any failed jobs.
[chekh@corn10.stanford.edu] /farmshare/user_data/chekh [0] $ kinit Password for chekh@stanford.edu: [chekh@corn10.stanford.edu] /farmshare/user_data/chekh [0] $ echo "hostname" | qsub -cwd -m bea -M chekh@stanford.edu -l mem_free=1G Your job 1430581 ("STDIN") has been submitted
You'll get an e-mail like this about the beginning of the job:
Job 1430581 (STDIN) Started User = chekh Queue = trusty.q Host = barley01.stanford.edu Start Time = 06/15/2017 11:59:00
And one about the end:
Job 1430581 (STDIN) Complete User = chekh Queue = trusty.q@barley01.stanford.edu Host = barley01.stanford.edu Start Time = 06/15/2017 11:59:01 End Time = 06/15/2017 11:59:01 User Time = 00:00:00 System Time = 00:00:00 Wallclock Time = 00:00:00 CPU = 00:00:00 Max vmem = NA Exit Status = 0