Why isn't my job running?

From FarmShare

Revision as of 11:06, 12 April 2012 by Chekh (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Contents

Why is my job still in the queue and not running?

There are not enough resources available for the resource manager to schedule your job on one of the queue instances on the execution hosts.

But there are available processors!

What are you looking at? If you look at the output of 'qstat -g c', it shows you only "slots". While "slots" typically map to "CPU cores", we actually oversubscribe a bit on the barleys (28 slots on a 24-core machine). But there are other resources besides "slots" that may not be available".

Checking memory resources

We currently configure a mem_free "complex attribute" for the execution hosts. GE tracks this consumable attribute and also tracks actual mem_free (physical RAM). Jobs are set to request a default of 4GB of mem_free per job. So you can have many free slots on a host, but its memory can be fully used. Use 'qstat -f -F' and look for the mem_free attributes of each host.

Asking for help

This stuff is not so straightforward, so if you need a specific explanation, file a ticket; make sure to include your job id. And it probably helps to include the output of 'qstat -f -j JOBID'.

Personal tools
Toolbox
LANGUAGES