User Tools

Site Tools


Reusable Datasets in Cluster: Cache

Current Implementation

Dataset caching procedure

  1. Request for caching a dataset
  2. Ask CacheSizeChecker
    CacheSizeChecker: Checks if the remaining cache free space is enough to allocate files with a size of at least the specified required space; if there is, it returns 0, otherwise it returns the amount of required space in kilobytes.
  3. If there is not enough space to allocate the required dataset, ask EvictionPolicy to select enough not needed datasets in cache, if possible. Otherwise, wait until it is possible to remove datasets from cache.
    EvictionPolicy: returns a list of datasets to evict from the cache. If there are not enough cached datasets available for eviction which cover the needed free space, an empty list is returned.

CacheSizeChecker Implementations


It calculates the amount of space (in kilobytes) which is necessary to free up in the cache, in order to allocate the specified required space, and still keep the free space at least as big as the system specified highwater mark.

The amount of available free space in the cache is calculated by taking in account the quota set in the system configuration file (see property storage.clustercache.size.max).


Null implementation which assumes that the cache has always enough free space.

EvictionPolicy Implementations


Eviction policy which selects based on:

  • the dataset's frequency of use in the scheduled modules
  • the optimal eviction size

The policy will select first the datasets of modules which will be least used in the future, and whose size is smaller. E.g.:

  1. Dataset E: will be used by 0 modules in the future, 1GB
  2. Dataset D: will be used by 0 modules in the future, 100GB
  3. Dataset C: will be used by 1 modules in the future
  4. Dataset B: will be used by 4 modules in the future
  5. Dataset A: used by a currently running module: never evict!

This policy will select datasets for eviction in this sort order.

About the quantity of datasets to evict, it will select datasets until the optimal eviction size is reached, or, if it is not possible to completely reach this size, at least until the free space needed in cache is covered. If the free space needed can not be covered, there will be no datasets selected for eviction.

IBTWO-519 : It raises the issue about how to know if a cached datasets is being used and, consequently, can not be removed.

Analysis: Using Ehcache to implement a reusable dataset cache

Ehcache allows two usage patterns: write-through and write-behind which could be used to implement the dataset cache in the scratch space. See

For that, the class CacheWriter and a CacheWriterFactory need to be implemented. Those two classes have been implemented, together with a JUnit test, as a proof of concept, and have been stored in the SVN:

The test behaves as expected, once you add a dataset with its ID as key in the cache, the dataset files are copied to the cluster scratch space.

Other implementations which would be necessary are:

It does not seem that we would have much added value from using this framework, except for the writing configuration. See:

beewm/devel/cluster_cache.txt ยท Last modified: 2016/05/17 16:17 (external edit)