Dataset Equivalence Classes

The workflow manager should avoid unnessecary computations on the cluster. In order to take a decision wether compuation of a dataset is required or not, a logic to compare two datasets is required. In order to achieve such a comparison, we define dataset equivalence classes.

Is Computation Required?

We do not want to (re-)compute a result, if for at least one dataset D in the storage, all of the following conditions are met:

the dataset type D.Type is the same as the requested type
the dataset version D.Version matches the requested version
the parameters of the module execution are identical
the input datasets of the module execution are the same
conditions 3 and 4 must recursively also be true for all input datasets

We want to define an equivalence test that calls two datasets “equivalent”, if and only if the above conditions are all met. We do not want to (re-)compute a result, if an equivalent result already exists.

Note: some parameters might be not relevant for the comparison. They could for example be flagged in the workflow?

Definition of Dataset Annotation:

In order to simplify the test for equivalence, we can define a dataset annotation that aids in testing for equivalence. We first define the required annotation for the equivalence test, and later show that this annotation is sufficient to perform the test.

First, we require two definitions For a RAW dataset, the annotation T is defined as the storage dataset ID

T=<dataset ID>

For a derived dataset, the annotation T is defined as

<T of input 1>,...,<T of input p>,<module identifier><module version>,<argument 1>,...,<argument q>

This annotation can be created based on the workflow and the input datasets, all of which are known during workflow execution.

Definition of the Equivalence Class

Two datasets are equivalent, if the annotation T is identical, and if the version numbers within T are matching (see version numbers for a discussion of “matching”).

Proof that this Annotation is Sufficient

RAW datasets

A raw dataset is equivalent only to itself. This is obvious from the definition.

Derived datasets that are based exclusively on RAW datsets

A derived dataset that is exclusively only on RAW datasets should be equivalent to another dataset, if all the following conditions are met:

the identifier of the module that produced the dataset is the same
the version number of the module that produced the dataset is matching
the parameters of the module that produced the dataset are the same
the RAW input datasets of the module that produced the dataset are the same

These are exactly the conditions that are checked by comparing the annotation T for two datasets, using above equivalence test.

Derived datasets that are based on other derived datsets

A derived dataset that is exclusively only on RAW datasets should be equivalent to another dataset, if all the following conditions are met:

all conditions 1-4 from above
the same conditions must be true for each of the derived input datasets

Condition 2 is recursively checked by prefixing the annotation T with the annotations T' of the derived input datasets.

Implementation details

The information required to build up the equivalence class is stored in json format as a dataset property called bee.dataset.equivalence.

The information stored for raw data is the dataset id itself, or when the dataset filtering is enabled then the dataset id and the concatenated form of the regular expressions from the filter. This information is not stored directly in the storage under the bee.dataset.equivalence property.

In case of derived datasets the name of the module which generated it and its version stored with the equivalence information of the input datasets of the module recursively.

For finding possible equivalent datasets we query the child datasets of one of the workflow input dataset from the storage.

screeningBee Data Analysis Tools

Sidebar

Table of Contents

Dataset Equivalence Classes

Is Computation Required?

Definition of Dataset Annotation:

Definition of the Equivalence Class

Proof that this Annotation is Sufficient

RAW datasets

Derived datasets that are based exclusively on RAW datsets

Derived datasets that are based on other derived datsets

screeningBee Data Analysis Tools

User Tools

Site Tools

Sidebar

Table of Contents

Dataset Equivalence Classes

Is Computation Required?

Definition of Dataset Annotation:

Definition of the Equivalence Class

Proof that this Annotation is Sufficient

RAW datasets

Derived datasets that are based exclusively on RAW datsets

Derived datasets that are based on other derived datsets

Page Tools