Cluster compute nodes are intrinsically not guaranteed resources. Many reasons can lead to downtime or unavailability of cluster nodes, typically combined with a loss of the corresponding processes and results. An ideal workflow manager needs to be aware of certain error conditions, and ideally can cope with them.
For these conditions, additional checks might be required for the workflow manager to judge a “good” behavior. I.e. if a job is reported running after the maximum allowed runtime for a queue, a ping-check of the machine and an ssh-check of the running processes might be required.
Example:
#> qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 7989183 0.11790 Cluster_Mo mx_nas r 08/27/2012 12:10:56 fs_long@gopt-131.cluster.bc2.c 1 7998792 0.11790 Cluster_Mo mx_nas r 08/27/2012 14:07:40 fs_short@gopt-79.cluster.bc2.c 1 7998808 0.11790 Cluster_Mo mx_nas r 08/27/2012 14:07:56 fs_short@gopt-79.cluster.bc2.c 1 ~> ssh gopt-79 #> ps aux|grep mx_nas root 20722 0.0 0.0 67260 3076 ? Ss 08:50 0:00 sshd: mx_nas [priv] mx_nas 20724 0.0 0.0 67260 1604 ? S 08:50 0:00 sshd: mx_nas@pts/0 mx_nas 20725 0.6 0.0 73136 2204 pts/0 Ss 08:50 0:00 -tcsh mx_nas 20837 0.0 0.0 69848 1052 pts/0 R+ 08:50 0:00 ps aux mx_nas 20838 0.0 0.0 61172 676 pts/0 S+ 08:50 0:00 grep mx_nas
Possible check steps:
Possible check steps:
Behaviour of the different DMSs:
DRMAA in SGE
If the job ID is not found any more in the qstat list of batch jobs, DRMAA throws a InvalidJobException reporting that the specified job ID does not exist.
It seems that qacct is the only utility that will continue to acknowledge a job ID after the job has finished.
It seems that there is no way to access the information available in qacct with DRMAA (see also DRMAA Internals for more infos).
The current implementation in iBRAIN2 assumes that, if the job ID is not found, the job already finished successfully (no error).
LSF
In this case lsf gives back an error message on stderror, like “Job <231231> is not found”. This could pappen because either the job was submitted within few seconds and the lsf system hasn't spread the information yet or the job finished more than 1 hour before the status query and lsf already forgot it.
The current implementation in case of this problem checks wether the job was submitted more than 1 hour ago and if it was then we consider the job as finished otherwise we keep the pending status of the job.
After submitting jobs, the workflow manager keeps track of the cluster job ID provided by the queue manager. That allows monitoring and controlling of the job.
In caso of the LSF platform the current iBrain2 implementation checks the status of the job by calling bjobs with the job id. This results a lot of status query and high load on the database of the LSF platform. This was criticized by the administrators.
As the bjobs -a command gives back the status of all the jobs of the given user one possible solution would be to cache this information for a short time (like 30 sec) and use this cache for the per job queries.
If a job fails in a particular node, submit it the next time to another one. SGE supports targetting jobs to a particular node within a queue by using for example:
qsub -q fs_short@gopt-79 .........
More infos in BC2.
Currently the workflow manager would need implementation of the following job schedulers:
Some history
Current commercial distributions:
Current open source distsributions: