Creating and running the first workflow

The description of a processing a “workflow” is defined in an xml file. Instead of providing a full reference this section shows the basics through an example.

The example processing

Let's take a quite simple processing. We have a dataset, a bunch of files and we would like to run some long running analysis on them and when this is finished then we would like to make some post processing on the results and extract the summary information we are interested in. So we have to executables:

analysis INPUT_DIR OUTPUT_DIR PARAMETERS: analysis takes a directory with the input files and produces an other directory with the results and it has some parameters which can tune the analysis procedure.
postprocess INPUT_DIR OUTPUT_DIR: postprocess takes an input directory which contains the results generated by analysis and produces result file into the directory specified in the second argument (for the sake of simplicity this OUTPUT_DIR will contain at the end a summary.txt and a summary.jpg file)

Define the workflow xml

<workflow name="my_analysis" author="Joe" email="joe@emailprovider.com">
  <input>
    <datasets>
      <dataset name="inputData" id="${api.datasetId}" type="MY_DATASETS" stage="true" />
    </datasets>
  </input>
  <modules>
    <!-- Module 1: Running the analysy -->
    <module name="analysis" version="2.1.0">
      <executable>
        <path>/PATH_TO_THE_TOOL/analysis</path>
        <args>
          <arg type="path" value="dataset:inputData"/>
          <arg type="path" value="dataset:analysisResults"/>
          <arg type="string" value="${api.analysisParameters}"/>
        </args>
      </executable>
      <output>
        <datasets>
          <dataset  name="analysisResults" store="false" relevant="true">
          </dataset>
        </datasets>
      </output>
    </module>
    <!-- Module 2: Running the postprocessing -->
    <module name="postprocessing" version="0.1.0">
      <executable>
        <path>/PATH_TO_THE_TOOL/postprocess</path>
        <args>
          <arg type="path" value="dataset:analysisResults"/>
          <arg type="path" value="dataset:finalResults"/>
        </args>
      </executable>
      <output>
        <datasets>
          <dataset name="finalResults" type="MY_RESULTS" store="true" relevant="true">
            <files in_dir="" regex="summary\.txt" />
            <files in_dir="" regex="summary\.jpg" />
            <files in_dir="" regex=".*\.stderr" />
            <files in_dir="" regex=".*\.stdout" />
          </dataset>
        </datasets>
        <validations level="module">
          <validation mode="count" sub_dir="finalResults" regex="summary\.txt" comparator="equal" target_value="1"
            fail_status="validation_error" fail_message="Missing summary txt file" />
          <validation mode="count" sub_dir="finalResults" regex="summary\.jpg" comparator="equal" target_value="1"
            fail_status="validation_error" fail_message="Missing summary jpg file" />
        </validations>
      </output>
    </module>

First we define the header of the workflow with an arbitrary name and the name of the author and an optional email address. If this email is provided then a notification will be sent to this address when the processing finished.

Then the input dataset definition comes which describes the input data we would like to analyse. In our case it means only one input dataset, this will contain the input files of the analysis tool. We give a name inputData to the dataset and for the id we only provide a variable not a concrete id, in this way we can choose the dataset when we submit the workflow and we can analyse many datasets with this workflow. We also have to provide the type of the dataset which is used by the storage.

After specifying the workflow input we describe the processing steps, these are called as modules in bee. We would like to run two tools one after the other, therefor we will have two module definition.

The first module defines the execution of the analysis tool. After providing a name and version number we define the executable and its parameters, we provide the full path of the executable which has to be accessible on the computing resource and the parameters of this. According the example definition we have 3 arguments the first is the path of the input data, the second one is where the results will be placed and the third is the parameters (here we use again the variable notation to be able to run this processing with different parameters). In the output section we define the datasets which are generated by the module and are important for us. In this case we have here the analysisResults, as this dataset is referenced between the tool arguments the workflow manager will create a directory in the work directory with this name and provide the path of it to the tool therefore it will be able to produce the results into it.

The second module runs the postprocess tool. The definition of the executable goes similar to the previous case and then we define the output dataset of the module. These outputs which are the summary.txt and summary.jpg files and the std out and error logs of the tool execution are the important things for us, that's why we set the store flag to true. This means that the workflow manager will copy back these files to the storage. We also show a little example here for the validation feature, which can perform different validations on the results and filter out erroneous executions. In this case we only check if the two summary files are really produced by the tool.

Workflow submission

So our analysis workflow is ready, a huge set of data is waiting for analysis in the storage so let's start the processing.

Currently there are two ways to submit a workflow through the RESTfull interface of the workflow manager.

We can register the workflow xml in the storage, then we can provide the dataset id of the registered workflow file with the variable assignment
We can submit the content of the xml file with the variable assignment.

As the dataset registration is storage specific we will show an example to the second option. There are many ways to make RESTfull request; command line tools like curl or graphical tools like the poster firefox plugin. Here we show an example with a python script which is provided with the source of the workflow manager. It can be found under the bee-workflowmanager/src/test/python directory and it is called as submit_xml.py.

You can submit the workflow on the dataset with id 123 and with the –fast analysis parameter with the following command execution:

./submit_xml.py localhost:9999 workflow.xml "{'api.datasetId' : '123', 'api.analysisParameters' :'--fast'}"

where the workflow manager is running on the localhost:9999 address and the xml definition of the workflow is stored in the workflow.xml file.

After the successful invocation of this command a process id will be returned by the script. This can be later used as a reference to this processing.

screeningBee Data Analysis Tools

Sidebar

Table of Contents

Creating and running the first workflow

The example processing

Define the workflow xml

Workflow submission

screeningBee Data Analysis Tools

User Tools

Site Tools

Sidebar

Table of Contents

Creating and running the first workflow

The example processing

Define the workflow xml

Workflow submission

Page Tools