User Tools

Site Tools


beewm:devel:feature_list

This is an old revision of the document!


Table of Contents

Datasets

Use existing storage datasets as input for processing

Concept

Allow to comfortably select a set of datasets as input for a workflow from the set of all available datasets. Typical examples could include:

  1. all RAW image datasets from the Bartonella screen (from 12 different experiments and two spaces)
  2. all RAW image datasets imaged with microscope XM1
  3. all Thresholded Infection Scoring datasets from 2011 (from all experiments and spaces)

Notes

This feature request might not be of high priority, if the following feature “Deduce which results need to be computed automatically” exists.

Use single dataset(s) as input for processing (for example CP pipeline)

Concept

  1. When initially parsing the workflow, check which modules are required to execute.
  2. If all the following conditions are met, a module might not be required to execute:
    1. The result(s) of this module exist already, in a new enough version
    2. None of the inputs for this module would be changed by this workflow
  3. In contrast, the module must execute if one of the following conditions are met:
    1. The results of this module do not exist in storage, in a new enough version
    2. At least one of the inputs for this module is changed by this workflow
    3. The module version gathered from the workflow is sufficiently newer than existing results

If the module must not execute, then the workflow manager must check if the outputs of the module are required for subsequent processing steps. All results that are required for subsequent processing steps must be made available on the cluster. The results should be made available in a way that hides the fact that the module did not execute, i.e. be transparent about whether the module executed or not.

Possible implementations

Probably this wouldn't be too hard to implement in the current module structure of ibrain2. In the first state handler of the module this logic can be included with the usage of the OpenBis java API. The OpenBis API provides methods to query datasets belonging to projects/experiments.

The bigger problem here would be defining the proper equivalence relation between datasets.

Tight Acquisition Integration

See: Tight Acquisition Integration

Make use the time of the acquisition. There are steps should be performed per image and can be done right after the image is available. These steps are checking the images and preprocessing them. The process does this steps would be able to check the machinery of the microscope and send warnings to the right person if something happens (like the microscope stops acquiring).

Allow for "cheap" registration or no registration of datasets

In the current iBRAIN2 concept, datasets must first be registered with iBRAIN2 in order to be used for any processing. This makes it hard to work with datasets that come from sources unknown to iBRAIN2. A possible use case is the registration of metadata alongside images in openBIS: we could register in openBIS a compound datasets consisting of ( images, TIFF metadata, thumbnails small, thumbnails big, quality assessment, and shading correction model ). However, these datasets would be unknown to iBRAIN2.

Ideally, iBRAIN2 could learn about these datasets “cheaply” for example by automatic openBIS inspection.

Possible implementations

If we get rid off the dataset specific information from the database and we use the gui to query the datasets we don't need anymore the dataset registration. For the automatic processing we could also use dataset queries to check if there is new data in a previously set project.

Recover, remove or finalize failed workflows

Accumulate a list of processings that ended in an error condition. For each job that ended in an error condition, allow to

  1. re-try processing where the old processing failed
  2. remove (or hide) the processing and all results produced by this process
  3. finalize the processing: skip all remaining steps and assign the completed steps a success status

Easy status maintenance

  1. hide unneeded or bad workflows
  2. hide unneeded or bad processings

Automatic resolution of parallelism (run parallel whatever possible)

Trigger processing when data becomes available

Staging, Resource Handling and Cluster Integration

Clever handling of resource bottlenecks (i.e. never stage two plates at once, stage smaller datasets first, ...)

See: Resource Optimization

Typical resource bottlenecks and rules for clever handling of resource bottlenecks are:

  1. network transfer of data
    1. staging of smaller datasets before larger datasets (log10-scale)
    2. do not stage more than N datasets at once (for N = 2 or 3)
  2. hard disk reading/writing
    1. do not move more than N datasets at once (for N = 2 or 3)
  3. openBIS data import
  4. cluster queuing system
  5. cluster CPU time

Allow manual prioritization

See: Prioritization

  1. Allow setting initial priority for each processing
  2. Allow dynamic changing of priority for each processing

Prioritize processing by Resource Bottlenecks

See: Prioritization

  1. Prefer processings for which more datasets are already staged
  2. Prefer processing for which smaller amounts of data need to be staged
  3. Prefer processings after which larger amounts of data can be deleted
  4. Prefer processings of workflows with fewer remaining steps
  5. Prefer processings which are quicker

Stage repeatedly used data only once

See: Reusable Datasets in Cluster

Maintain information about the pool of datasets on the cluster.

Free cluster scratch space immediately after use

See: Reusable Datasets in Cluster

Delete data from the pool of datasets on the cluster as soon as the dataset is not required by subsequent modules anymore.

During Startup Recover or Kill Running Cluster Jobs

See: Cluster Job Management and Visualization

Make sure that when the workflow manager starts, it either kills all running cluster jobs or it can recover them?

Stop running processes (include killing of jobs)

See: Cluster Job Management and Visualization

Allow stopping of jobs, including forced stop where cluster processes are killed.

Optimize number of jobs for Cluster Queuing System

See: Cluster Queue Optimization

Avoid overloading the cluster queuing system by starting unnecessary large numbers of cluster jobs. Maintain a list of running and queued jobs dynamically based on current cluster load.

Monitoring and Maintenance of cluster resources

Manually triggered maintenance tasks for the cluster:

  1. remove old temporary files owned by the cluster job user from cluster nodes /tmp
  2. clean resources from crashed jobs: crash dump files etc…
  3. warn about endlessly running jobs

Robustness and Code Design

Automatic Restart or Email notification in case of daemon crash

See: Watchdog and User Notification

A watchdog process could monitor the current status and responsiveness of the daemon. In case of crash or error, an email notification should be sent.

Note

A corresponding watchdog based on a shell script already exists.

Email notification for all possible error conditions

Concept

Email notification should be robust to exceptions and crashes, and should work based on logging classes. Newly added code should easily benefit from email notification without programming overhead.

Possible implementation

  1. email notification handled by a standalone thread or process based on logfile parsing.
  2. email notification based on the logging library

Data validation and checking

Allow execution of sanity checking code on datasets to validate the data status.

Data management, and (automatic/easy) synchronization with storage

Avoid Database duplication

See: Combining Processing Information with Storage Information

Many properties of datasets can and should be annotated with the datasets in the storage. For these properties, the authority should be the storage and not the workflow manager. The workflow manager may cache some or all of these properties for speed purposes, however it should allow to gather or update the properties from the storage when required.

Possible implementations

  1. The workflow manager fetches most dataset properties from the storage when asked for
  2. The workflow manager caches properties and updates the cache on a regular basis
  3. The workflow manager caches properties and updates the cache on manual request

Easy data maintenance: remove duplicate results, recover failed storage, ...

See: Split into User Interface and Backend Daemon

This is given by tight integration with the storage. Data maintenance is easy on the storage backend (openBIS or file system). Therefore, data maintenance is not required in the workflow manager, because it uses the storage backend for fetching data information.

Resolve separation of workflows and module configuration

See: Workflow Specification Syntax

Modules are partially configured in:

  1. the workflow
  2. the module pipeline (if exists)
  3. module settings files and models (if exists)

Simple workflow description

Clean (chroot) Cluster environment for modules

See: CHROOT-like Cluster Environment

  1. Executed modules should not be able to overwrite installation files
  2. Executed modules should not overwrite output files from previous executions or other modules

Condition based branching (i.e. fall-back for errors), and generated/adjusted module configuration

Flexible code design (easy to extend, easy to make larger modifications)

Run a workflow on a Linux desktop computer

Run a workflow on a Windows desktop computer

beewm/devel/feature_list.1374741000.txt.gz · Last modified: 2013/07/25 10:30 by epujadas