Cluster Job Management and Visualization

Cluster Status Inspection

Cluster compute nodes are intrinsically not guaranteed resources. Many reasons can lead to downtime or unavailability of cluster nodes, typically combined with a loss of the corresponding processes and results. An ideal workflow manager needs to be aware of certain error conditions, and ideally can cope with them.

SGE reports a job still in “running” state, even though it terminated
SGE reports a job still in “running” state, even though the machine is unavailable

For these conditions, additional checks might be required for the workflow manager to judge a “good” behavior. I.e. if a job is reported running after the maximum allowed runtime for a queue, a ping-check of the machine and an ssh-check of the running processes might be required.

Observed difficult RMS's (Resource Management System) behaviours:

The RMS reports that a job is still running but in fact it is not

Example:

#> qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
7989183 0.11790 Cluster_Mo mx_nas       r     08/27/2012 12:10:56 fs_long@gopt-131.cluster.bc2.c     1
7998792 0.11790 Cluster_Mo mx_nas       r     08/27/2012 14:07:40 fs_short@gopt-79.cluster.bc2.c     1
7998808 0.11790 Cluster_Mo mx_nas       r     08/27/2012 14:07:56 fs_short@gopt-79.cluster.bc2.c     1
~> ssh gopt-79
#> ps aux|grep mx_nas
root     20722  0.0  0.0  67260  3076 ?        Ss   08:50   0:00 sshd: mx_nas [priv]
mx_nas   20724  0.0  0.0  67260  1604 ?        S    08:50   0:00 sshd: mx_nas@pts/0
mx_nas   20725  0.6  0.0  73136  2204 pts/0    Ss   08:50   0:00 -tcsh
mx_nas   20837  0.0  0.0  69848  1052 pts/0    R+   08:50   0:00 ps aux
mx_nas   20838  0.0  0.0  61172   676 pts/0    S+   08:50   0:00 grep mx_nas

Possible check steps:

Wait until the RMS reported running time exceeds the maximum allowed runtime for the queue to which the job was submitted. E.g. a job submitted to the short queue “fs_short” should not be running for longer than 30 minutes.
If the RMS still reports the job is running, access the system through SSH, and check if the process is still running by using the ps command.

The RMS reports that a job is still running but the machine is unavailable

Possible check steps:

Wait until the RMS reported running time exceeds the maximum allowed runtime for the queue to which the job was submitted. E.g. a job submitted to the short queue “fs_short” should not be running for longer than 30 minutes.
If the RMS still reports the job is running, check that the machine is still reachable by using the ping command.

The RMS reports nothing about a submitted job

Behaviour of the different DMSs:

DRMAA in SGE
If the job ID is not found any more in the qstat list of batch jobs, DRMAA throws a InvalidJobException reporting that the specified job ID does not exist. It seems that qacct is the only utility that will continue to acknowledge a job ID after the job has finished. It seems that there is no way to access the information available in qacct with DRMAA (see also DRMAA Internals for more infos).
The current implementation in iBRAIN2 assumes that, if the job ID is not found, the job already finished successfully (no error).

LSF
In this case lsf gives back an error message on stderror, like “Job <231231> is not found”. This could pappen because either the job was submitted within few seconds and the lsf system hasn't spread the information yet or the job finished more than 1 hour before the status query and lsf already forgot it.

The current implementation in case of this problem checks wether the job was submitted more than 1 hour ago and if it was then we consider the job as finished otherwise we keep the pending status of the job.

Job Submission and Controlling For Cluster Jobs

After submitting jobs, the workflow manager keeps track of the cluster job ID provided by the queue manager. That allows monitoring and controlling of the job.

Cluster friendly status queries

In caso of the LSF platform the current iBrain2 implementation checks the status of the job by calling bjobs with the job id. This results a lot of status query and high load on the database of the LSF platform. This was criticized by the administrators.

As the bjobs -a command gives back the status of all the jobs of the given user one possible solution would be to cache this information for a short time (like 30 sec) and use this cache for the per job queries.

Recovering Cluster Jobs

Check the status of the non-terminated jobs.
If it terminated successfully, check that the expected outputs were produced by using the output checks validation in the workflow.xml.
If the job is still running, continue polling the task status until completion is reached.
If the job terminated in error, resubmit the job (maximum number of allowed resubmissions is set in the system.config file).
If the maximum number of resubmissions is reached, proceed with the current mechanism of raising a warning and continue workflow execution or raising an error and stop execution (also specified in the workflow.xml).

Possible Resubmission Strategy

If a job fails in a particular node, submit it the next time to another one. SGE supports targetting jobs to a particular node within a queue by using for example:

qsub -q fs_short@gopt-79 .........

More infos in BC2.

Terminating Cluster Jobs

Check which are the non-terminated jobs
Terminate running jobs and remove non-started jobs from the queue
Kill also the processes which have been started by the job.

Cluster Queue Status View

Visualize the number of running workflow manager jobs
Visualize the number of queued workflow manager jobs
Visualize the number of totally running jobs on the cluster
Visualize the estimated wait time

Documentation

Used queue managers

Currently the workflow manager would need implementation of the following job schedulers:

SGE (through DRMAA), current installed version in bc2: GE 6.1u6, with DRMAA version 1.0 and DRMS GE 6.1u6.
Platform LSF

DRMAA

DRMAA Website
DRMAA Java API documentation
DRMAA Internals

Sun Grid Engine 6.1

Sun N1 Grid Engine 6.1
Sun N1 Grid Engine 6.1 User's Guide
Grid Engine 6.1 Quick Reference

From Sun Grid Engine (SGE) to Oracle Grid Engine

Some history

Oracle Grid Engine on Server Farms

Current commercial distributions:

Univa Grid Engine
Oracle Grid Engine, previously known as Sun Grid Engine (SGE)

Current open source distsributions:

screeningBee Data Analysis Tools

Sidebar

Table of Contents

Cluster Job Management and Visualization

Cluster Status Inspection

Observed difficult RMS's (Resource Management System) behaviours:

The RMS reports that a job is still running but in fact it is not

The RMS reports that a job is still running but the machine is unavailable

The RMS reports nothing about a submitted job

Job Submission and Controlling For Cluster Jobs

Cluster friendly status queries

Recovering Cluster Jobs

Possible Resubmission Strategy

Terminating Cluster Jobs

Cluster Queue Status View

Documentation

Used queue managers

DRMAA

Sun Grid Engine 6.1

From Sun Grid Engine (SGE) to Oracle Grid Engine

screeningBee Data Analysis Tools

User Tools

Site Tools

Sidebar

Table of Contents

Cluster Job Management and Visualization

Cluster Status Inspection

Observed difficult RMS's (Resource Management System) behaviours:

The RMS reports that a job is still running but in fact it is not

The RMS reports that a job is still running but the machine is unavailable

The RMS reports nothing about a submitted job

Job Submission and Controlling For Cluster Jobs

Cluster friendly status queries

Recovering Cluster Jobs

Possible Resubmission Strategy

Terminating Cluster Jobs

Cluster Queue Status View

Documentation

Used queue managers

DRMAA

Sun Grid Engine 6.1

From Sun Grid Engine (SGE) to Oracle Grid Engine

Page Tools