====== Dataset Equivalence Classes ======
The workflow manager should avoid unnessecary computations on the
cluster. In order to take a decision wether compuation of a dataset
is required or not, a logic to compare two datasets is required.
In order to achieve such a comparison, we define dataset equivalence
classes.
===== Is Computation Required? =====
We do not want to (re-)compute a result, if for at least one
dataset D in the storage, all of the following conditions are met:
- the dataset type D.Type is the same as the requested type
- the dataset version D.Version matches the requested version
- the parameters of the module execution are identical
- the input datasets of the module execution are the same
- conditions 3 and 4 must recursively also be true for all input datasets
We want to define an equivalence test that calls two datasets
"equivalent", if and only if the above conditions are all met.
We do not want to (re-)compute a result, if an equivalent result
already exists.
Note: some parameters might be not relevant for the comparison.
They could for example be flagged in the workflow?
===== Definition of Dataset Annotation: =====
In order to simplify the test for equivalence, we can define a
dataset annotation that aids in testing for equivalence. We first
define the required annotation for the equivalence test, and later
show that this annotation is sufficient to perform the test.
First, we require two definitions
For a RAW dataset, the annotation T is defined as the storage dataset ID
T=
For a derived dataset, the annotation T is defined as
,...,,,,...,
This annotation can be created based on the workflow and the input
datasets, all of which are known during workflow execution.
===== Definition of the Equivalence Class =====
Two datasets are equivalent, if the annotation T is identical, and if
the version numbers within T are matching (see version numbers for a
discussion of "matching").
===== Proof that this Annotation is Sufficient =====
==== RAW datasets ====
A raw dataset is equivalent only to itself. This is obvious from the definition.
==== Derived datasets that are based exclusively on RAW datsets ====
A derived dataset that is exclusively only on RAW datasets should be
equivalent to another dataset, if all the following conditions are met:
- the identifier of the module that produced the dataset is the same
- the version number of the module that produced the dataset is matching
- the parameters of the module that produced the dataset are the same
- the RAW input datasets of the module that produced the dataset are the same
These are exactly the conditions that are checked by comparing the
annotation T for two datasets, using above equivalence test.
==== Derived datasets that are based on other derived datsets ====
A derived dataset that is exclusively only on RAW datasets should be
equivalent to another dataset, if all the following conditions are met:
- all conditions 1-4 from above
- the same conditions must be true for each of the derived input datasets
Condition 2 is recursively checked by prefixing the annotation T with
the annotations T' of the derived input datasets.
**Implementation details**
The information required to build up the equivalence class is stored in json format as a dataset property called bee.dataset.equivalence.
The information stored for raw data is the dataset id itself, or when the dataset filtering is enabled then the dataset id and the concatenated form of the regular expressions from the filter. This information is not stored directly in the storage under the bee.dataset.equivalence property.
In case of derived datasets the name of the module which generated it and its version stored with the equivalence information of the input datasets of the module recursively.
For finding possible equivalent datasets we query the child datasets of one of the workflow input dataset from the storage.