User Tools

Site Tools


beewm:devel:dataset_equivalence_classes

Dataset Equivalence Classes

The workflow manager should avoid unnessecary computations on the cluster. In order to take a decision wether compuation of a dataset is required or not, a logic to compare two datasets is required. In order to achieve such a comparison, we define dataset equivalence classes.

Is Computation Required?

We do not want to (re-)compute a result, if for at least one dataset D in the storage, all of the following conditions are met:

  1. the dataset type D.Type is the same as the requested type
  2. the dataset version D.Version matches the requested version
  3. the parameters of the module execution are identical
  4. the input datasets of the module execution are the same
  5. conditions 3 and 4 must recursively also be true for all input datasets

We want to define an equivalence test that calls two datasets “equivalent”, if and only if the above conditions are all met. We do not want to (re-)compute a result, if an equivalent result already exists.

Note: some parameters might be not relevant for the comparison. They could for example be flagged in the workflow?

Definition of Dataset Annotation:

In order to simplify the test for equivalence, we can define a dataset annotation that aids in testing for equivalence. We first define the required annotation for the equivalence test, and later show that this annotation is sufficient to perform the test.

First, we require two definitions For a RAW dataset, the annotation T is defined as the storage dataset ID

T=<dataset ID>

For a derived dataset, the annotation T is defined as

<T of input 1>,...,<T of input p>,<module identifier><module version>,<argument 1>,...,<argument q>

This annotation can be created based on the workflow and the input datasets, all of which are known during workflow execution.

Definition of the Equivalence Class

Two datasets are equivalent, if the annotation T is identical, and if the version numbers within T are matching (see version numbers for a discussion of “matching”).

Proof that this Annotation is Sufficient

RAW datasets

A raw dataset is equivalent only to itself. This is obvious from the definition.

Derived datasets that are based exclusively on RAW datsets

A derived dataset that is exclusively only on RAW datasets should be equivalent to another dataset, if all the following conditions are met:

  1. the identifier of the module that produced the dataset is the same
  2. the version number of the module that produced the dataset is matching
  3. the parameters of the module that produced the dataset are the same
  4. the RAW input datasets of the module that produced the dataset are the same

These are exactly the conditions that are checked by comparing the annotation T for two datasets, using above equivalence test.

Derived datasets that are based on other derived datsets

A derived dataset that is exclusively only on RAW datasets should be equivalent to another dataset, if all the following conditions are met:

  1. all conditions 1-4 from above
  2. the same conditions must be true for each of the derived input datasets

Condition 2 is recursively checked by prefixing the annotation T with the annotations T' of the derived input datasets.

Implementation details

The information required to build up the equivalence class is stored in json format as a dataset property called bee.dataset.equivalence.

The information stored for raw data is the dataset id itself, or when the dataset filtering is enabled then the dataset id and the concatenated form of the regular expressions from the filter. This information is not stored directly in the storage under the bee.dataset.equivalence property.

In case of derived datasets the name of the module which generated it and its version stored with the equivalence information of the input datasets of the module recursively.

For finding possible equivalent datasets we query the child datasets of one of the workflow input dataset from the storage.

beewm/devel/dataset_equivalence_classes.txt · Last modified: 2014/12/05 16:39 by behullar