User Tools

Site Tools


beewm:devel:combining_processing_and_storage_information

Combining Processing and Storage Information

:!: This documentation needs to be updated to describe the current implementation

Motivation

To avoid the database duplication we should store the dataset related information only in the storage (openBIS) and processing related information in the database of the workflow manager. To provide meaningful information we should be able to display both information.

Possible solution

OpenBIS has an API to query dataset related information. The workflow manager should also provide an API (like ibrain's RestFul API) to query all the processing related information. We only need to store keys on both sides, then the gui can display the necessary details.

Keys for connecting the databases:

  • DatasetId: the workflow manager anyway should know about the data on which the workflow was started, this identifier should be enough to query all the necessary information.
  • ProcessId: if we store the id of the processing which generated the given dataset as a metadata of the dataset we will be able to query all the information from that processing.

Where to store the Information

Database of the workflow manager

Here we should store every information which is needed for the scheduling of the processing steps. These are:

  • Workflow definitions: Name,Id, Description, Author,
    • Module definitions: Module name, Module version, parameters..
  • Processing data: Workflow Id, Input dataset Ids, Status, Priority
    • Module Processing informations: Module Id, Status
    • Cluster job properties: Id, Status

Storage (OpenBIS)

Here we should store every dataset related information.

  • Organization of the data: Spaces/Projects/Experiments
  • Meta data of the datasets: Parent dataset, Dataset Type, Workflow, Module, Versions, Parameters, Processing Logs (Attached to the dataset or as a separate dataset)

On the cluster

Here we can store or acces information which shows the status of the system or running processes.

  • System status: Running cluster jobs, Size of the dataset cache/work directory
  • Processing status: Logs of the running processes

Considerations

  • Speed/Caching: this mainly depend on the information we would like to display together. For example: if we just display the list of the processing with a link to the dataset details querying the dataset details on request probably won't take too much time, but if we wanna display a complex list with mixed information it might take a long time..
  • Data duplication: We may duplicate workflow definition data, because workflow manager needs to know the definition of the processing, but this data is also important on the storage side describing the generating process (and we cannot trust in the workflow manager as a long term storage). One possible solution would be to store the workflow description in the storage too as a dataset. Then the workflow manager would always get the workflow description when a new processing is submitted. Probably this won't solve any data duplication problem, because we need the module version parameter information per datasets and this are also stored in the workflow, but accessing and parsing the workflow description each time when the workflow manager checks the existence of results won't be efficient.
  • Storing processing logs in the storage: On the one hand it's a logical decision to store logs with the results as they can be important to understand the results, but these logs are also important for debugging and if some error happens usually the processing will be stopped and the dataset won't be stored. So we have to make the logs of the running processing available on the interface through a different way than the logs of the successfully finished processing.
beewm/devel/combining_processing_and_storage_information.txt ยท Last modified: 2013/07/25 14:29 by epujadas