User Tools

Site Tools


beewm:devel:workflow_specification_syntax

This is an old revision of the document!


Workflow Template Specification

<workflow name="String (required)" author="String (required)" cleanup="TRUE|FALSE (optional, defaults to TRUE)">
    <hosts>
        <run_on>CLUSTER_HOST|LOCAL_HOST (optional, defaults to CLUSTER_HOST)</run_on>
    </hosts>
    <input>
        <datasets>
            <dataset name="String (required)" id="String (storage ID, required)" type="String (storage dataset type, required)" stage="TRUE|FALSE (optional, defaults to TRUE)" />
            ....
        </datasets>
    </input>
    <modules>
        <module name="String (required)" version="Regex pattern (required)" class="String (optional, defaults to ch.systemsx.bee.workflowmanager.module.MockModule)" required_runtime_minutes="Integer (optional)" required_memory_mb="Integer (optional)" >
            <params (optional)>
                <param name="indexbuilder_dataset|indexbuilder_regex|indexes_per_job|indexes_start (required)" value="String (required)" />
                ....
            </params>
            <executable>
                <path>String with optional variables (required)</path>
                <args (optional)>
                    <arg name="String (optional)" type="STRING|PATH (required)" value="String|dataset:String (required)" selector="String (optional)" />
                        ....
                </args>
            </executable>
            <output>
                <datasets (optional)>
                    <dataset name="String (required for datasets not to store, no default)" type="String (required for datasets to store, no default)" store="TRUE|FALSE (optional, defaults to FALSE)" relevant="TRUE|FALSE (optional, defaults to TRUE)">
                        <files (considered and required only for datasets to store) in_dir="String (optional, defaults to the root of the module's work directory)" regex="Regex pattern (required)" />
                    </dataset>
                    .....
                </datasets>
                <validations (optional) level="TASK|MODULE (required)">
                    <validation mode="COUNT|SIZE|CONTENT (required)" sub_dir="String (optional, defaults to the root of the module's work directory)" regex="Regex pattern (optional, no default)" content_regex="Regex pattern (optional, no default)" comparator="EQUAL|GREATER|LESS|NOT_EQUAL|GREATER_OR_EQUAL|LESS_OR_EQUAL (required)" target_value="Integer (required)" fail_status="VALIDATION_ERROR|VALIDATION_WARNING (required)" fail_message="String (optional, no default)" />
                    ......
                </validations>
            </output>
        </module>

Workflow Template Example

<workflow name="bee_workflow" author="bee_author">
    <hosts>
        <run_on>CLUSTER_HOST</run_on>
    </hosts>
    <input>
        <datasets>
            <dataset name="RawImages" id="0bCDME-BE01" type="HCS_IMAGE_RAW" stage="true" />
            <dataset name="ComputeShadingCorrectionAverageImageSettings" id="12345" type="HCS_ARGUMENTS" stage="true" />
            <dataset name="MergeShadingCorrectionAverageImageSettings" id="1234567890" type="HCS_ARGUMENTS" stage="true" />
        </datasets>
    </input> 
    <modules>
        <module name="ComputeShadingCorrectionAvgImg" version="1.*.*" class="ch.systemsx.resit.ibrain2.module.generic.CommandLineOnClusterModule">
            <params>
                <param name="indexbuilder_dataset" value="RawImages" />
                <param name="indexbuilder_regex" value=".*_cDAPI_.*.(tif|jp2)$" />
                <param name="indexes_per_job" value="1000" />
                <param name="indexes_start" value="1" />
            </params>
            <executable>
                <path>${config.extras_path}/ShadingCorrectionAverageImage_${module.version}/ComputeShadingCorrectionAvgImg.command</path>
                <args>
                    <!-- First argument is the input directory, resolved from the input-section above  -->
                    <arg type="path" value="dataset:RawImages" />
                    <!-- Second argument is the settings file dataset, resolved from the input-section above -->
                    <arg type="path" value="dataset:ComputeShadingCorrectionAverageImageSettings" />
                    <!-- Third argument is the start index -->
                    <arg type="string" value="${bee_indexer.start_index}" />
                    <!-- Fourth argument is the end index -->
                    <arg type="string" value="${bee_indexer.end_index}" />
                    <!-- Fifth argument is the output directory, will create the dataset from the output-section below -->
                    <arg type="path" value="dataset:ComputeShadingCorrectionAverageImageResults" />
                </args>
            </executable>
            <output>
                <datasets>
                    <dataset name="ShadingCorrectedImages" type="HCS_CHANNEL_SHADINGCORR_IMAGES" store="true">
                        <files in_dir="" regex=".*\.mat" />
                        <files in_dir="" regex=".*\.png$" />
                    </dataset>
                    <dataset type="CLUSTER_JOB_LOGS" store="true">
                        <files in_dir="" regex=".*\.stderr" />
                        <files in_dir="" regex=".*\.stdout" />
                    </dataset>
                    <dataset name="ComputeShadingCorrectionAverageImageResults" store="false" relevant="false" />
                </datasets>               
                <validations level="task">
                    <validation mode="count" sub_dir="" regex="${task.log_stdout}" comparator="equal" target_value="1" fail_status="VALIDATION_ERROR" fail_message="Missing stdout log file" />
                    <validation mode="count" sub_dir="" regex="${task.log_stderr}" comparator="equal" target_value="1" fail_status="VALIDATION_ERROR" fail_message="Missing stderr log file" />
                    <validation mode="content" sub_dir="" regex="${task.log_stdout}" content_regex="finished successfully" comparator="equal" target_value="1" fail_status="VALIDATION_ERROR" fail_message="Missing success message in stdout log file" />
                    <validation mode="content" sub_dir="" regex="${task.log_stdout}" content_regex="Warning:" comparator="equal" target_value="0" fail_status="VALIDATION_WARNING" fail_message="Detected warning message in stdout log file" />
                    <validation mode="content" sub_dir="" regex="${task.log_stderr}" content_regex="Warning:" comparator="equal" target_value="0" fail_status="VALIDATION_WARNING" fail_message="Detected warning message in stderr log file" />
                    <validation mode="content" sub_dir="" regex="${task.log_stdout}" content_regex="Error:" comparator="equal" target_value="0" fail_status="VALIDATION_ERROR" fail_message="Detected error message in stdout log file" />
                    <validation mode="content" sub_dir="" regex="${task.log_stderr}" content_regex="Error:" comparator="equal" target_value="0" fail_status="VALIDATION_ERROR" fail_message="Detected error message in stderr log file" />
                    <validation mode="content" sub_dir="" regex="${task.log_stdout}" content_regex="\?\?\?" comparator="equal" target_value="0" fail_status="VALIDATION_ERROR" fail_message="Detected matlab error message in stdout log file" />
                    <validation mode="content" sub_dir="" regex="${task.log_stderr}" content_regex="\?\?\?" comparator="equal" target_value="0" fail_status="VALIDATION_ERROR" fail_message="Detected matlab error message in stderr log file" />
                    <validation mode="size" sub_dir="" regex="${task.log_stderr}" comparator="equal" target_value="0" fail_status="VALIDATION_WARNING" fail_message="Detected content in stderr log file" />
                </validations>
                <validations level="module">
                    <validation mode="count" sub_dir="" regex="ComputeShadingCorrectionAverageImageResults" comparator="equal" target_value="1" fail_status="VALIDATION_ERROR" fail_message="Missing output directory" />
                    <validation mode="count" sub_dir="ComputeShadingCorrectionAverageImageResults" regex=".*\.mat" comparator="greater" target_value="0" fail_status="VALIDATION_WARNING" fail_message="Missing output file(s)" />
                </validations>
           </output>
        </module>
        <module name="MergeShadingCorrectionAverageImage" version="2.*.*" class="ch.systemsx.bee.workflowmanager.module.ClusterModule">
            <executable>
                <path>${config.extras_dir}/ShadingCorrectionAverageImage/ShadingCorrectionAverageImage_v${module.version}/MergeShadingCorrectionAverageImage.command</path>
                <args>
                    <!-- First argument is the input directory, resolved from the output-section above -->
                    <arg type="path" value="dataset:ComputeShadingCorrectionAverageImageResults" />
                    <!-- Second argument is the settings file dataset, resolved from the input-section above -->
                    <arg type="path" value="dataset:MergeShadingCorrectionAverageImageSettings" />
                    <!-- Third argument is the output directory -->
                    <arg type="path" value="moduledir" selector="MergeShadingCorrectionAverageImageResults" />
                </args>
            </executable>
            <output>
                <datasets>
                    <dataset type="CLUSTER_JOB_LOGS" store="true">
                        <files in_dir="" regex=".*\.stderr" />
                        <files in_dir="" regex=".*\.stdout" />
                    </dataset>
               </datasets>
           </output>
        </module>
        <module>
        ...
        </module>
        ...
    </modules>
</workflow>

Definitions

<workflow>

<workflow name="..." ... >

Name of the workflow.

<workflow author="..." ... >

Author of the workflow.

<workflow cleanup="..." ... >

Specifies if the process directories are to be deleted from the scratch space after success execution of the workflow.

<input> <datasets>

This element is used to specify which datasets coming from storage are needed by at least one of the modules forming the workflow.

When a process is started by submitting a particular workflow together with one or several input datasets, it will be checked that:

  • the input dataset(s) identified by its storage ID exist in storage
  • the input dataset(s) in storage have the same type as specified in the <input><datasets><dataset … /> element

<dataset name="..." ... >

Required. It will be used to identify the dataset when it is needed by a module. Therefore, the dataset name must be unique in the whole workflow description.

<dataset id="..." ... >

Required. Corresponds to the ID necessary for the storage to localize this dataset.

<dataset type="..." ... >

Required. Corresponds to the dataset type provided by the storage. It is used for validation purposes.

<dataset stage="..." ... >

Optional, defaults to true.
If true, the dataset will be staged to the processing scratch space.
If false, an empty directory will be created in the processing scratch space; in that case the dataset content will not be copied to scratch, only a directory with the expected name will be created there.

:!: This feature is not implemented. The application behaves as this value would be set to true. To be discussed if it is needed, since the directory of a dataset could be specified as metadata or as variable.

<module name="..." ... >

The module names (“name” attribute) in a workflow description file must be unique.

<executable> <path>

The path to the executable will be resolved by using the ${module.version} variable. The exact resolving algorithm is to be described.

<executable> <args> <arg type="..." value="..." selector="..." />

type="PATH"

This argument will be resolved to:

  1. the path to a dataset: to provide the directory of a dataset the value should have the dataset: prefix and then the name of the referenced dataset. E.g. value=“dataset:hcs_plate”
  2. a subdirectory of the module's work directory: the value has to be moduledir. E.g.: value=“moduledir”

selector=".."

It's possible to select a file or directory in the referenced directory by using the selector attribute.
Possible values for the selector=“..” attribute depend on the value of the value=“..” attribute:

value=“dataset:some_input dataset” selector=“some_regex|some_name It refers to an input dataset defined in the workflow section of the workflow.xml. Therefore, we are referring to a dataset coming from storage. The path to this input dataset must exist.
It will be searched in the path to this input dataset, for a file or directory matching the regex or name. Attention!! If there is no such file / directory or there are more than one, that will cause an error.
value=“dataset:some_output dataset” selector=“some_nameIt refers to an output dataset produced by a previously executed module or by the current module. The module producing this output dataset, must define this dataset in the workflow.xml, and provide it with a name. The path to this dataset is always: /path_to_moduledir/output_dataset_name
The value of this argument will be resolved to a subdirectory or file with the name some_name, in the path to the specified output dataset. So, it will be resolved to: /path_to_moduledir/output_dataset_name/some_name
value=“moduledir” selector=“some_directory_nameA directory with that name will be created under the work directory of the current module. So, the created directory will be: /path_to_moduledir/some_directory_name

Examples

<arg type=“path” value=“dataset:DataRefactoring” />This will be resolved to the absolute path of a dataset called DataRefactoring. This dataset should be an:
- input dataset: it comes from storage, and its path points to the cache in the execution host.
- output dataset: the path will be a subdirectory, called DataRefactoring, in the work directory of the module which produced this dataset.
<arg type=“path” value=“dataset:Cpv1BatchFile” selector=“Batch_data.mat” />This will be resolved to an absolute path to the file Batch_data.mat. In case of an input dataset, this file should uniquely exist in the dataset directory.
<arg type=“path” value=“moduledir” selector=“FeatureZScoring” />This will create a path appending FeatureZScoring to the path of the work directory of the current module.

<output> <datasets>

This element is used to specify new datasets created by execution of modules.

Those datasets are specified for two purposes:

  • Storage of results: All or some of the files/directories produced by the module need to be sent to the store
  • Input for a later module: All or some of the files/directories produced by the module are necessary input for some module executed after the current one.

One output dataset can have only one of those purposes, or both.

<dataset name="..." ... >

Required for output datasets of purpose 2 (input datasets of later modules). The name will be used by later executed modules to identify the output dataset.

<dataset type="..." ... >

Required for output datasets of purpose 1 (to be stored). That's the dataset type specified by the storage provider.

<dataset store="..." ... >

Optional, defaults to false.
It signals if the dataset needs to be sent to storage. Therefore, for output datasets of purpose 1 it will have value true.

<dataset relevant="..." ... >

Optional, defaults to true.
It signals if the dataset is relevant in case of re-processing of the workflow.
If true the availability of this dataset will be considered required.
If false, it will not be recreated in case it is not available.

<files in_dir="..." regex="..." />

This element is required only in case of output datasets of purpose 1, i.e. that need to be sent to storage. If the output dataset is not to be stored, the <files… /> element(s), if any, will be ignored.

With this element, the files and/or directories to store are selected. All the “files” elements defined in an output dataset will be evaluated. The files and/or directories selected will be the following:

  • located in the “in_dir”, as a subdirectory of the module's work directory; or located directly in the module's work directory if the “in_dir” attribute is empty.
  • matching the regex expression specified in the “regex” attribute.

<output> <validations ...>

Validation levels

MODULE: validation will be performed in all the files located in the module's working directory's sub_dir specified, matching the regex specified.

TASK: validation will be performed in the .stdout and/or .stderr files produced by all the module executed tasks.
To specify if the validation should be done in the .stdout or .stderr file, the regex attribute is to be used in the following way: regex=“${task.log_stdout}” or regex=“${task.log_stderr}”
The sub_dir attribute has no meaning in the task level validations (only in module level ones).

Validation modes

content: The validation element, with the regex attribute, should match only one file and that should be a text file. The content_regex must be defined. The validation will be performed by “greping” the content of the file for the content_regex pattern specified.

size: The validation element, with the regex attribute, should match only one file. The validation will be done by checking the size of this file.

count: The validation will be done by counting the number of objects (files and/or directories) with name matching the specified regex.

Resolving Workflow Templates

Bee supports the use of variables in its workflow description file. Variables are specified with the following terminology: ${variable_type.variable_name}.

Variable types

config

The currently supported variable of this type is:

  • ${config.extras_dir}

The value corresponds to the property extras.dir defined in the system.config file.
The workflow template will be resolved by substituting this variable with the effectively used value to provide traceability.
Example:
The template snippet:

<path>${config.extras_path}/ShadingCorrectionAverageImage/ComputeShadingCorrectionAvgImg.command</path>

would be resolved into:

<path>/import/bc2/home/resit/mx_nas/stage/bee/extras/ShadingCorrectionAverageImage/ComputeShadingCorrectionAvgImg.command</path>

task

The currently supported variables of this type are:

  • ${task.log_stdout}
  • ${task.log_stderr}

This variables correspond the files where the cluster job (or task) standard output and standard error streams are directed.
The extension of such files is defined to be: .stdout and .stderr.
These variables can be used in the task validations to specify the files to validate.
These variables will not be resolved in the workflow template since in the case of parallel running modules, one set of such files is produced per task.
In case there is interest on examining/keeping those files, they can be sent to storage by locating them in the module's work directory using their file extension (.stdout, .stderr).
Example:

<output>
    <dataset type="CLUSTER_JOB_LOGS" store="true">
      <files in_dir="" regex=".*\.stderr" />
      <files in_dir="" regex=".*\.stdout" />
    </dataset>
  </datasets>
  <validations level="task">
    <validation mode="count" sub_dir="" regex="${task.log_stdout}" comparator="equal" target_value="1" fail_status="validation_error" fail_message="Missing Stdout Log File" />
    <validation mode="content" sub_dir="" regex="${task.log_stdout}" content_regex="finished successfully" comparator="equal" target_value="1" fail_status="validation_error" fail_message="Missing Success Message in Stdout Log File" />

The resolved workflow should be stored together with the stored results (as it is in the current iBrain2).

beewm/devel/workflow_specification_syntax.1378902530.txt.gz · Last modified: 2013/09/11 14:28 by epujadas