====== Cluster Job Management and Visualization ======

===== Cluster Status Inspection =====
Cluster compute nodes are intrinsically not guaranteed resources. Many reasons
can lead to downtime or unavailability of cluster nodes, typically combined with
a loss of the corresponding processes and results. An ideal workflow manager needs
to be aware of certain error conditions, and ideally can cope with them.

  * SGE reports a job still in "running" state, even though it terminated
  * SGE reports a job still in "running" state, even though the machine is unavailable

For these conditions, additional checks might be required for the workflow manager
to judge a "good" behavior. I.e. if a job is reported running after the maximum
allowed runtime for a queue, a ping-check of the machine and an ssh-check of the
running processes might be required.

==== Observed difficult RMS's (Resource Management System) behaviours: ====
=== The RMS reports that a job is still running but in fact it is not ===

**Example:**
<code>#> qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
7989183 0.11790 Cluster_Mo mx_nas       r     08/27/2012 12:10:56 fs_long@gopt-131.cluster.bc2.c     1
7998792 0.11790 Cluster_Mo mx_nas       r     08/27/2012 14:07:40 fs_short@gopt-79.cluster.bc2.c     1
7998808 0.11790 Cluster_Mo mx_nas       r     08/27/2012 14:07:56 fs_short@gopt-79.cluster.bc2.c     1
~> ssh gopt-79
#> ps aux|grep mx_nas
root     20722  0.0  0.0  67260  3076 ?        Ss   08:50   0:00 sshd: mx_nas [priv]
mx_nas   20724  0.0  0.0  67260  1604 ?        S    08:50   0:00 sshd: mx_nas@pts/0
mx_nas   20725  0.6  0.0  73136  2204 pts/0    Ss   08:50   0:00 -tcsh
mx_nas   20837  0.0  0.0  69848  1052 pts/0    R+   08:50   0:00 ps aux
mx_nas   20838  0.0  0.0  61172   676 pts/0    S+   08:50   0:00 grep mx_nas
</code>

**Possible check steps:**
  - Wait until the RMS reported running time exceeds the maximum allowed runtime for the queue to which the job was submitted. E.g. a job submitted to the short queue "fs_short" should not be running for longer than 30 minutes.
  - If the RMS still reports the job is running, access the system through SSH, and check if the process is still running by using the ps command.

=== The RMS reports that a job is still running but the machine is unavailable ===

**Possible check steps:**

  - Wait until the RMS reported running time exceeds the maximum allowed runtime for the queue to which the job was submitted. E.g. a job submitted to the short queue "fs_short" should not be running for longer than 30 minutes.
  - If the RMS still reports the job is running, check that the machine is still reachable by using the ping command.

=== The RMS reports nothing about a submitted job ===

Behaviour of the different DMSs:

**DRMAA in SGE**\\
If the job ID is not found any more in the qstat list of batch jobs, DRMAA throws a InvalidJobException reporting that the specified job ID does not exist.
It seems that qacct is the only utility that will continue to acknowledge a job ID after the job has finished.
It seems that there is no way to access the information available in qacct with DRMAA (see also DRMAA Internals for more infos).\\
The current implementation in iBRAIN2 assumes that, if the job ID is not found, the job already finished successfully (no error).

**LSF**\\
In this case lsf gives back an error message on stderror, like "Job <231231> is not found". This could pappen because either the job was submitted within few seconds and the lsf system hasn't spread the information yet or the job finished more than 1 hour before the status query and lsf already forgot it. 

The current implementation in case of this problem checks wether the job was submitted more than 1 hour ago and if it was then we consider the job as finished otherwise we keep the pending status of the job.

===== Job Submission and Controlling For Cluster Jobs =====
After submitting jobs, the workflow manager keeps track of the cluster
job ID provided by the queue manager. That allows monitoring and controlling of the job.

==== Cluster friendly status queries ====
In caso of the LSF platform the current iBrain2 implementation checks the status of the job by calling bjobs with the job id. This results a lot of status query and high load on the database of the LSF platform. This was criticized by the administrators.

As the bjobs -a command gives back the status of all the jobs of the given user one possible solution would be to cache this information for a short time (like 30 sec) and use this cache for the per job queries.

==== Recovering Cluster Jobs ====
  * Check the status of the non-terminated jobs.
  * If it terminated successfully, check that the expected outputs were produced by using the output checks validation in the workflow.xml.
  * If the job is still running, continue polling the task status until completion is reached.
  * If the job terminated in error, resubmit the job (maximum number of allowed resubmissions is set in the system.config file).
  * If the maximum number of resubmissions is reached, proceed with the current mechanism of raising a warning and continue workflow execution or raising an error and stop execution (also specified in the workflow.xml).

==== Possible Resubmission Strategy ====
If a job fails in a particular node, submit it the next time to another one.
SGE supports targetting jobs to a particular node within a queue by using for example:
<code>qsub -q fs_short@gopt-79 .........</code>

More infos in [[http://www.bc2.unibas.ch/BC2/manuals/user/SGE/sge_guide.html|BC2]].

==== Terminating Cluster Jobs ====
  * Check which are the non-terminated jobs
  * Terminate running jobs and remove non-started jobs from the queue
  * Kill also the processes which have been started by the job.

==== Cluster Queue Status View ====
  * Visualize the number of running workflow manager jobs
  * Visualize the number of queued workflow manager jobs
  * Visualize the number of totally running jobs on the cluster
  * Visualize the estimated wait time

===== Documentation =====
==== Used queue managers ====
Currently the workflow manager would need implementation of the following job schedulers:
  * SGE (through DRMAA), current installed version in bc2: GE 6.1u6, with DRMAA version 1.0 and DRMS GE 6.1u6.
  * Platform LSF

==== DRMAA ====
[[http://www.drmaa.org/index.php|DRMAA Website]]\\
[[http://gridscheduler.sourceforge.net/javadocs/|DRMAA Java API documentation]]\\
[[https://blogs.oracle.com/templedf/entry/drmaa_internals1|DRMAA Internals]]

==== Sun Grid Engine 6.1 ====
[[http://docs.oracle.com/cd/E19957-01/820-0700/enterprisenotes-1/index.html|Sun N1 Grid Engine 6.1]]\\
{{:beewm:devel:820-0699.pdf|Sun N1 Grid Engine 6.1 User's Guide}}\\
[[http://www.bioteam.net/wp-content/uploads/2008/02/sge-quick-reference_v3_a4.pdf|Grid Engine 6.1 Quick Reference]]

==== From Sun Grid Engine (SGE) to Oracle Grid Engine ====
**Some history**
  * [[http://www.softpanorama.org/Admin/Job_schedulers/grid_engine.shtml|Oracle Grid Engine on Server Farms]]

**Current commercial distributions:**
  * [[http://www.univa.com/products/grid-engine|Univa Grid Engine]]
  * [[http://www.oracle.com/us/products/tools/oracle-grid-engine-075549.html|Oracle Grid Engine]], previously known as Sun Grid Engine (SGE)

**Current open source distsributions:**
  * [[http://gridscheduler.sourceforge.net/|Open Grid Scheduler/Grid Engine]]
     * [[http://gridscheduler.sourceforge.net/howto/howto.html|HOWTOs]]
     * [[http://gridscheduler.sourceforge.net/htmlman/manuals.html|man pages]]
     * [[http://docs.oracle.com/cd/E24901_01/index.htm|Oracle Grid Engine / Sun Grid Engine Documentation (at oracle.com)]]
  * [[http://arc.liv.ac.uk/SGE/|Son of Grid Engine]]
     * [[http://arc.liv.ac.uk/SGE/howto/howto.html|HOWTOs]]
     * [[http://arc.liv.ac.uk/SGE/htmlman/manuals.html|man pages]]
     * [[http://arc.liv.ac.uk/SGE/javadocs/|javadocs]]