Test Cases

From FarmShare

Revision as of 11:02, 15 June 2017 by Chekh (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This page should have some "test cases" that users or sysadmins can run to verify the functionality of the barleys.

Contents

TC1: submit a job from a corn via /mnt/glusterfs

  1. cd /mnt/glusterfs/your_sunetid
  2. echo "hostname" | qsub -cwd
  3. qstat # check job status
  4. Check that the stderr output file is empty and the stdout output file contains the hostname of the machine that the job ran on

This test verifies that the shared filesystem is available and the job submission process works as expected.

TC2: submit a job from a corn with AUKS support

  1. kinit / klist -f # check that your ticket is forwardable and renewable
  2. aklog / tokens # check that you have an AFS token
  3. echo "hostname" | qsub
  4. qstat # check job status
  5. Check that the stderr output file is empty and the stdout output file contains the hostname of the machine that the job ran on

This test verifies that AUKS handles the Kerberos/AFS tickets/tokens correctly.

TC3: check memory tracking

R script:

$ cat R8GB.R 
x <- array(1:1073741824, dim=c(1024,1024,1024)) 
x <- gaussian(); 
Sys.sleep(60)

submit script:

$ cat r_test.script
#!/bin/bash

# use the current directory
#$ -cwd
# mail this address
#$ -M chekh@stanford.edu
# send mail on begin, end, suspend
#$ -m bes
# get rid of spurious messages about tty/terminal types
#$ -S /bin/sh

R --vanilla --no-save < R8GB.R 

  1. submit this job with 'qsub r_test.script' (with AUKS or not)
  2. check that you get an e-mail
  3. check that the e-mail correctly reports ~8GB maxvmem (our current version does not, it's a bug, we need to upgrade)

TC4: check time tracking

submit a job like

 echo "sleep 3600" | qsub -cwd -m bes -M chekh@stanford.edu

Check that the job completion mail says 1hr of walltime elapsed


submit a job like

 echo "sleep 3600" | qsub -cwd -m bes -M chekh@stanford.edu -l h_rt=72:00:00

Check that the job went into long.q

TC5: check maximum job numbers

We currently have:

max_u_jobs                   480
max_jobs                     3000
 for i in `seq -w 01 200`; do echo "sleep 600" | qsub -cwd ; done

That will submit 200 jobs, and you should see that only 100 of them will be accepted and an error like

 Unable to run job: job rejected: Only 100 jobs are allowed per user (current job count: 100).

TC6: check maximum number of tasks

We have the task settings set to defaults, see 'man 5 sge_conf' for more info:

$ qconf -sconf |grep aj
max_aj_instances             2000
max_aj_tasks                 75000

Submit script:

$ cat task_test.script 
#!/bin/bash

#$ -cwd
#$ -N task_test

#printenv |grep TASK

echo $JOB_NAME $SGE_TASK_ID on $HOSTNAME >> /mnt/glusterfs/chekh/test_cases/task_test_output

You can check that you can submit 5 tasks: 'qsub -t 1-5 task_test.script'. Or you can check that you can submit 5000 tasks and run 10 at a time: 'qsub -t 1-5000 -tc 10'. And check that the output file contains what you expect.

Looks like our current version of SGE has the documentation say "$TASK_ID", but actually you get "$SGE_TASK_ID"

TC7: check disk throughput performance - local

Use the bonnie++ executable and run it on local disk with a submit script like this:

#!/bin/bash

#$ -m bes
#$ -M chekh@stanford.edu
#$ -cwd

BONNIE=/mnt/glusterfs/chekh/bonnie/bonnie++

echo $TMPDIR

$BONNIE -d $TMPDIR

Check that the performance numbers in the output roughly match these, my job ran 2h40m:

$ cat bonnie.script.o24714 
/tmp/24714.1.main.q
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
barley06.st 193760M   583  98 76227  22 40798  14  3514  97 96284  15 103.7  41
Latency             30420us   18681ms    1842ms   18715us     480ms     489ms
Version  1.96       ------Sequential Create------ --------Random Create--------
barley06.stanford.e -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  3487   4 +++++ +++ 22471  23 29117  37 +++++ +++ +++++ +++
Latency             13181us    1138us     552us    1048us      72us     483us
1.96,1.96,barley06.stanford.edu,1,1323906617,193760M,,583,98,76227,22,40798,14,3514,97,96284,15,103.7,41,16,,,,,3487,4,+++++,+++,22471,23,29117,37,+++++,+++,+++++,+++,30420us,18681ms,1842ms,18715us,480ms,489ms,13181us,1138us,552us,1048us,72us,483us

TC8: check disk throughput performance - shared fs

#!/bin/bash

#$ -m bes
#$ -M chekh@stanford.edu
#$ -cwd

BONNIE=/mnt/glusterfs/chekh/bonnie/bonnie++

DIR=/mnt/glusterfs/chekh/bonnie/temp

echo $DIR

$BONNIE -d $DIR

Your results should be something like this (mine ran for 3h44m):

/mnt/glusterfs/chekh/bonnie/temp
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
barley18.st 193760M    10  30 40498  21 27087  20  3481  96 291925  49 244.9  11
Latency              1131ms     925ms    2834ms   27065us     976ms     101ms
Version  1.96       ------Sequential Create------ --------Random Create--------
barley18.stanford.e -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   432   8  2945  16   554   7   450   7  1544   9   500   7
Latency               814ms    4220us   13577us   12176us    7681us     654ms
1.96,1.96,barley18.stanford.edu,1,1323897313,193760M,,10,30,40498,21,27087,20,3481,96,291925,49,244.9,11,16,,,,,432,8,2945,16,554,7,450,7,1544,9,500,7,1131ms,925ms,2834ms,27065us,976ms,101ms,814ms,4220us,13577us,12176us,7681us,654ms

TC9: check maximum running jobs limit

In grid engine, you can set a limit to how many jobs a single user can have in state 'r'. It's the 'maxujobs' setting the the scheduler config, which you can see with 'qconf -ssconf'.

If you submit a bunch of jobs, and have the maximum number running, and have sheduling info turned on, you should see that the waiting jobs have a message like:

                            job dropped because of user limitations

TC10: check mail to users

The qmaster sends mail based on its mailer config:

$ qconf -sconf |grep mail
mailer                       /usr/bin/mail
administrator_mail           its-sa-rc-reports@lists.stanford.edu

And that e-mail address should get notices about any failed jobs.

  [chekh@corn10.stanford.edu] /farmshare/user_data/chekh [0] 
$ kinit
Password for chekh@stanford.edu: 
[chekh@corn10.stanford.edu] /farmshare/user_data/chekh [0] 
$ echo "hostname" | qsub -cwd -m bea -M chekh@stanford.edu -l mem_free=1G
Your job 1430581 ("STDIN") has been submitted

You'll get an e-mail like this about the beginning of the job:

Job 1430581 (STDIN) Started
 User       = chekh
 Queue      = trusty.q
 Host       = barley01.stanford.edu
 Start Time = 06/15/2017 11:59:00

And one about the end:

Job 1430581 (STDIN) Complete
 User             = chekh
 Queue            = trusty.q@barley01.stanford.edu
 Host             = barley01.stanford.edu
 Start Time       = 06/15/2017 11:59:01
 End Time         = 06/15/2017 11:59:01
 User Time        = 00:00:00
 System Time      = 00:00:00
 Wallclock Time   = 00:00:00
 CPU              = 00:00:00
 Max vmem         = NA
 Exit Status      = 0
Personal tools
Toolbox
LANGUAGES