User Tools

Site Tools


beewm:devel:easy_recovery_tasks

Easy Recovery Tasks

Make it easy to recover failed storage, failed processings, …

Requires two things:

  1. intelligent automatic recovery
  2. user interface for easy recovery options

Automatic recovery rules:

  • Recovering running and crashed cluster jobs and processing after daemon crash
  • Putting into longer queue and resubmitting cluster jobs which was killed by the cluster platform because they didn't finish
  • Requesting more memory and resubmitting cluster jobs which failed because the memory wasn't large enough
  • Recovering after network failure (ssh connection of remote submission file system mounts may broken)
  • Recovering after the disk is filled up: if there's no more space on the scratch space or in other folders the daemon should stop the operation try to run clean up and if the clean up isn't sufficient then should report to the administrators

Manual recovery option:

  • Stopping/Canceling processing/ cluster jobs
  • Resubmitting cancelled or failed jobs
beewm/devel/easy_recovery_tasks.txt · Last modified: 2013/07/24 18:15 by epujadas