To reduce the amount of the unnecessary processing to save resources and time we would like to compute only that steps that didn't already produce a compatible result (Dataset Equivalence Classes).
With the expected result definition in the workflow definition it would be easy to implement this approach. We just start from the expected results:
checkList = { Expected Results} modulesToExecute = { } while checkList is not empty do dataSet = checkList.getHead() if dataSet is an input dataset continue if modulesToExecute contains module which produces dataSet continue if there is compatible dataset in storage continue module = the module from the workflow which produces dataSet modulesToExecute.add( module ) checkList.pushBack( module.getInputDatasets() ) done
Note: This algorithm will only select that modules that are needed to compute the results. So if the workflow description contains branches/modules that produce datasets not listed in the results section or needed to an other module these modules never will be executed.
In this case we start from the modules only depend on the input datasets, check if their result should be recomputed then continue with the modules need this dataset too.
dataSetsReady = { Input dataSets} modules = { all the modules from the workflow} modulesToExecute = { } while modules is not empty do module = a module from modules which only depends on datasets from dataSetsReady modules.remove(module) dataSetReady.add( module.getOutputs() ) if there's no compatible dataset in the storage for the outputs of module modulesToExecute.add(module) done
Top-down approach implemented.
See the following two methods in the WorkflowStarter
class:
setOutputEquivalences(Set<Module> modules, WorkflowConfig workflowConfig, DatasetEquivalenceChecker datasetEquivalenceChecker)
setParentsComplete(List<Module> moduleGraph)