Datasets

Use existing storage datasets as input for processing

See:
User Interface
Workflow Dataset Combination

Concept

Allow to comfortably select a set of datasets as input for a workflow from the set of all available datasets. Typical examples could include:

all RAW image datasets from the Bartonella screen (from 12 different experiments and two spaces)
all RAW image datasets imaged with microscope XM1
all Thresholded Infection Scoring datasets from 2011 (from all experiments and spaces)

Notes

This feature request might not be of high priority, if the following feature “Deduce which results need to be computed automatically” exists.

Use single dataset(s) as input for processing (for example CP pipeline)

See:
User Interface
Workflow Dataset Combination

This is specifically a request for the user interface. When starting a specific workflow on a number of plates, the UI must allow to additionally pick certain “fixed” datasets from storage that will be additionally made available on the cluster. Examples could be a CP pipeline, a shading model, an object classification model, etc. Such input files have the status of settings of the workflow, i.e. a shading model is a setting in the same way as a parameter for a module.

Possible implementations

Probably starting a workflow with other datasets which are more like parameters would be possible already now. As workflow chaining works on group datasets the staging of this datasets would work.

Building a gui also wouldn't be a problem we have the means to query openbis and display this query results (see iPo rtal). The question is here how to organize this kind of data and display on the gui.

Deduce which results need to be computed automatically

See: Dataset Equivalence Classes

Concept

When initially parsing the workflow, check which modules are required to execute.
If all the following conditions are met, a module might not be required to execute:
1. The result(s) of this module exist already, in a new enough version
2. None of the inputs for this module would be changed by this workflow
In contrast, the module must execute if one of the following conditions are met:
1. The results of this module do not exist in storage, in a new enough version
2. At least one of the inputs for this module is changed by this workflow
3. The module version gathered from the workflow is sufficiently newer than existing results

If the module must not execute, then the workflow manager must check if the outputs of the module are required for subsequent processing steps. All results that are required for subsequent processing steps must be made available on the cluster. The results should be made available in a way that hides the fact that the module did not execute, i.e. be transparent about whether the module executed or not.

Possible implementations

Probably this wouldn't be too hard to implement in the current module structure of ibrain2. In the first state handler of the module this logic can be included with the usage of the OpenBis java API. The OpenBis API provides methods to query datasets belonging to projects/experiments.

The bigger problem here would be defining the proper equivalence relation between datasets.

Tight Acquisition Integration

See: Tight Acquisition Integration

Make use the time of the acquisition. There are steps should be performed per image and can be done right after the image is available. These steps are checking the images and preprocessing them. The process does this steps would be able to check the machinery of the microscope and send warnings to the right person if something happens (like the microscope stops acquiring).

Allow for "cheap" registration or no registration of datasets

In the current iBRAIN2 concept, datasets must first be registered with iBRAIN2 in order to be used for any processing. This makes it hard to work with datasets that come from sources unknown to iBRAIN2. A possible use case is the registration of metadata alongside images in openBIS: we could register in openBIS a compound datasets consisting of ( images, TIFF metadata, thumbnails small, thumbnails big, quality assessment, and shading correction model ). However, these datasets would be unknown to iBRAIN2.

Ideally, iBRAIN2 could learn about these datasets “cheaply” for example by automatic openBIS inspection.

Possible implementations

If we get rid off the dataset specific information from the database and we use the gui to query the datasets we don't need anymore the dataset registration. For the automatic processing we could also use dataset queries to check if there is new data in a previously set project.

Recover, remove or finalize failed workflows

Accumulate a list of processings that ended in an error condition. For each job that ended in an error condition, allow to

re-try processing where the old processing failed
remove (or hide) the processing and all results produced by this process
finalize the processing: skip all remaining steps and assign the completed steps a success status

Easy status maintenance

hide unneeded or bad workflows
hide unneeded or bad processings

Automatic resolution of parallelism (run parallel whatever possible)

See: Dependency Resolution

Trigger processing when data becomes available

See: Trigger of Processings

Staging, Resource Handling and Cluster Integration

Clever handling of resource bottlenecks (i.e. never stage two plates at once, stage smaller datasets first, ...)

See: Resource Optimization

Typical resource bottlenecks and rules for clever handling of resource bottlenecks are:

network transfer of data
1. staging of smaller datasets before larger datasets (log10-scale)
2. do not stage more than N datasets at once (for N = 2 or 3)
hard disk reading/writing
1. do not move more than N datasets at once (for N = 2 or 3)
openBIS data import
cluster queuing system
cluster CPU time

Allow manual prioritization

See: Prioritization

Allow setting initial priority for each processing
Allow dynamic changing of priority for each processing

Prioritize processing by Resource Bottlenecks

See: Prioritization

Prefer processings for which more datasets are already staged
Prefer processing for which smaller amounts of data need to be staged
Prefer processings after which larger amounts of data can be deleted
Prefer processings of workflows with fewer remaining steps
Prefer processings which are quicker

Stage repeatedly used data only once

See: Reusable Datasets in Cluster

Maintain information about the pool of datasets on the cluster.

Free cluster scratch space immediately after use

See: Reusable Datasets in Cluster

Delete data from the pool of datasets on the cluster as soon as the dataset is not required by subsequent modules anymore.

During Startup Recover or Kill Running Cluster Jobs

See: Cluster Job Management and Visualization

Make sure that when the workflow manager starts, it either kills all running cluster jobs or it can recover them?

Stop running processes (include killing of jobs)

See: Cluster Job Management and Visualization

Allow stopping of jobs, including forced stop where cluster processes are killed.

Optimize number of jobs for Cluster Queuing System

See: Cluster Queue Optimization

Avoid overloading the cluster queuing system by starting unnecessary large numbers of cluster jobs. Maintain a list of running and queued jobs dynamically based on current cluster load.

Monitoring and Maintenance of cluster resources

Manually triggered maintenance tasks for the cluster:

remove old temporary files owned by the cluster job user from cluster nodes /tmp
clean resources from crashed jobs: crash dump files etc…
warn about endlessly running jobs

Robustness and Code Design

Automatic Restart or Email notification in case of daemon crash

See: Watchdog and User Notification

A watchdog process could monitor the current status and responsiveness of the daemon. In case of crash or error, an email notification should be sent.

Note

A corresponding watchdog based on a shell script already exists.

Email notification for all possible error conditions

See: Watchdog and User Notification

Concept

Email notification should be robust to exceptions and crashes, and should work based on logging classes. Newly added code should easily benefit from email notification without programming overhead.

Possible implementation

email notification handled by a standalone thread or process based on logfile parsing.
email notification based on the logging library

Data validation and checking

Allow execution of sanity checking code on datasets to validate the data status.

Data management, and (automatic/easy) synchronization with storage

Avoid Database duplication

See: Combining Processing Information with Storage Information

Many properties of datasets can and should be annotated with the datasets in the storage. For these properties, the authority should be the storage and not the workflow manager. The workflow manager may cache some or all of these properties for speed purposes, however it should allow to gather or update the properties from the storage when required.

Possible implementations

The workflow manager fetches most dataset properties from the storage when asked for
The workflow manager caches properties and updates the cache on a regular basis
The workflow manager caches properties and updates the cache on manual request

Easy data maintenance: remove duplicate results, recover failed storage, ...

This is given by tight integration with the storage. Data maintenance is easy on the storage backend (openBIS or file system). Therefore, data maintenance is not required in the workflow manager, because it uses the storage backend for fetching data information.

Resolve separation of workflows and module configuration

See: Workflow Specification Syntax

Modules are partially configured in:

the workflow
the module pipeline (if exists)
module settings files and models (if exists)

Simple workflow description

See: Workflow Specification Syntax

Clean (chroot) Cluster environment for modules

See: CHROOT-like Cluster Environment

Executed modules should not be able to overwrite installation files
Executed modules should not overwrite output files from previous executions or other modules

screeningBee Data Analysis Tools

User Tools

Site Tools

Sidebar

Table of Contents

Datasets

Use existing storage datasets as input for processing

Concept

Notes

Use single dataset(s) as input for processing (for example CP pipeline)

Possible implementations

Deduce which results need to be computed automatically

Concept

Possible implementations

Tight Acquisition Integration

Allow for "cheap" registration or no registration of datasets

Possible implementations

Recover, remove or finalize failed workflows

Easy status maintenance

Automatic resolution of parallelism (run parallel whatever possible)

Trigger processing when data becomes available

Staging, Resource Handling and Cluster Integration

Clever handling of resource bottlenecks (i.e. never stage two plates at once, stage smaller datasets first, ...)

Allow manual prioritization

Prioritize processing by Resource Bottlenecks

Stage repeatedly used data only once

Free cluster scratch space immediately after use

During Startup Recover or Kill Running Cluster Jobs

Stop running processes (include killing of jobs)

Optimize number of jobs for Cluster Queuing System

Monitoring and Maintenance of cluster resources

Robustness and Code Design

Automatic Restart or Email notification in case of daemon crash

Note

Email notification for all possible error conditions

Concept

Possible implementation

Data validation and checking

Data management, and (automatic/easy) synchronization with storage

Avoid Database duplication

Possible implementations

Easy data maintenance: remove duplicate results, recover failed storage, ...

Resolve separation of workflows and module configuration

Simple workflow description

Clean (chroot) Cluster environment for modules

Condition based branching (i.e. fall-back for errors), and generated/adjusted module configuration

Flexible code design (easy to extend, easy to make larger modifications)

Run a workflow on a Linux desktop computer

Run a workflow on a Windows desktop computer

Page Tools