Make it easy to recover failed storage, failed processings, …
Requires two things:
intelligent automatic recovery
user interface for easy recovery options
Automatic recovery rules:
Recovering running and crashed cluster jobs and processing after daemon crash
Putting into longer queue and resubmitting cluster jobs which was killed by the cluster platform because they didn't finish
Requesting more memory and resubmitting cluster jobs which failed because the memory wasn't large enough
Recovering after network failure (ssh connection of remote submission file system mounts may broken)
Recovering after the disk is filled up: if there's no more space on the scratch space or in other folders the daemon should stop the operation try to run clean up and if the clean up isn't sufficient then should report to the administrators
Manual recovery option:
Stopping/Canceling processing/ cluster jobs
Resubmitting cancelled or failed jobs
beewm/devel/easy_recovery_tasks.txt · Last modified: 2013/07/24 18:15 by epujadas