Reusable Datasets in Cluster: Cache

Current Implementation

Dataset caching procedure

Request for caching a dataset
Ask CacheSizeChecker
CacheSizeChecker: Checks if the remaining cache free space is enough to allocate files with a size of at least the specified required space; if there is, it returns 0, otherwise it returns the amount of required space in kilobytes.
If there is not enough space to allocate the required dataset, ask EvictionPolicy to select enough not needed datasets in cache, if possible. Otherwise, wait until it is possible to remove datasets from cache.
EvictionPolicy: returns a list of datasets to evict from the cache. If there are not enough cached datasets available for eviction which cover the needed free space, an empty list is returned.

CacheSizeChecker Implementations

HighwatermarkCacheSizeChecker

It calculates the amount of space (in kilobytes) which is necessary to free up in the cache, in order to allocate the specified required space, and still keep the free space at least as big as the system specified highwater mark.

The amount of available free space in the cache is calculated by taking in account the quota set in the system configuration file (see property storage.clustercache.size.max).

NullCacheSizeChecker

Null implementation which assumes that the cache has always enough free space.

EvictionPolicy Implementations

ModuleStateEvictionPolicy

Eviction policy which selects based on:

the dataset's frequency of use in the scheduled modules
the optimal eviction size

The policy will select first the datasets of modules which will be least used in the future, and whose size is smaller. E.g.:

Dataset E: will be used by 0 modules in the future, 1GB
Dataset D: will be used by 0 modules in the future, 100GB
Dataset C: will be used by 1 modules in the future
Dataset B: will be used by 4 modules in the future
Dataset A: used by a currently running module: never evict!

This policy will select datasets for eviction in this sort order.

About the quantity of datasets to evict, it will select datasets until the optimal eviction size is reached, or, if it is not possible to completely reach this size, at least until the free space needed in cache is covered. If the free space needed can not be covered, there will be no datasets selected for eviction.

Related JIRA isues

IBTWO-519 : It raises the issue about how to know if a cached datasets is being used and, consequently, can not be removed.

Analysis: Using Ehcache to implement a reusable dataset cache

Ehcache allows two usage patterns: write-through and write-behind which could be used to implement the dataset cache in the scratch space. See http://ehcache.org/documentation/apis/write-through-caching.

For that, the class CacheWriter and a CacheWriterFactory need to be implemented. Those two classes have been implemented, together with a JUnit test, as a proof of concept, and have been stored in the SVN:

main
test

The test behaves as expected, once you add a dataset with its ID as key in the cache, the dataset files are copied to the cluster scratch space.

Other implementations which would be necessary are:

an eviction policy. See http://ehcache.org/documentation/apis/cache-eviction-algorithms#plugging-in-your-own-eviction-algorithm.
identify when the cache is full. It seems that it is not possible to define that the cache is full depending on the size on disk of the copied datasets. For that it would be necessary to implement an event listener, which listens every time a new dataset is put in cache, and checks the remaining disk space. See http://ehcache.org/documentation/apis/cache-event-listeners.

It does not seem that we would have much added value from using this framework, except for the writing configuration. See: http://ehcache.org/documentation/apis/write-through-caching#configuration.

screeningBee Data Analysis Tools

Sidebar

Table of Contents

Reusable Datasets in Cluster: Cache

Current Implementation

Dataset caching procedure

CacheSizeChecker Implementations

HighwatermarkCacheSizeChecker

NullCacheSizeChecker

EvictionPolicy Implementations

ModuleStateEvictionPolicy

Related JIRA isues

Analysis: Using Ehcache to implement a reusable dataset cache

screeningBee Data Analysis Tools

User Tools

Site Tools

Sidebar

Table of Contents

Reusable Datasets in Cluster: Cache

Current Implementation

Dataset caching procedure

CacheSizeChecker Implementations

HighwatermarkCacheSizeChecker

NullCacheSizeChecker

EvictionPolicy Implementations

ModuleStateEvictionPolicy

Related JIRA isues

Analysis: Using Ehcache to implement a reusable dataset cache

Page Tools