====== Reusable Datasets in Cluster: Cache ====== ===== Current Implementation ===== ==== Dataset caching procedure ==== - Request for caching a dataset - Ask ''CacheSizeChecker'' \\ ''CacheSizeChecker'': Checks if the remaining cache free space is enough to allocate files with a size of at least the specified required space; if there is, it returns 0, otherwise it returns the __amount of required space in kilobytes__. - If there is not enough space to allocate the required dataset, ask ''EvictionPolicy'' to select enough not needed datasets in cache, if possible. Otherwise, wait until it is possible to remove datasets from cache. \\ ''EvictionPolicy'': returns a __list of datasets to evict__ from the cache. If there are not enough cached datasets available for eviction which cover the needed free space, an __empty list__ is returned. ==== CacheSizeChecker Implementations ==== === HighwatermarkCacheSizeChecker === It calculates the amount of space (in kilobytes) which is necessary to free up in the cache, in order to allocate the specified required space, and still keep the free space at least as big as the system specified [[:beewm:devel:system_configuration|highwater mark]]. The amount of available free space in the cache is calculated by taking in account the quota set in the system configuration file (see property [[:beewm:devel:system_configuration|storage.clustercache.size.max]]). === NullCacheSizeChecker === Null implementation which assumes that the cache has always enough free space. ==== EvictionPolicy Implementations ==== === ModuleStateEvictionPolicy === Eviction policy which selects based on: * the dataset's frequency of use in the scheduled modules * the optimal eviction size The policy will select first the datasets of modules which will be least used in the future, and whose size is smaller. E.g.: - Dataset E: will be used by 0 modules in the future, 1GB - Dataset D: will be used by 0 modules in the future, 100GB - Dataset C: will be used by 1 modules in the future - Dataset B: will be used by 4 modules in the future - Dataset A: used by a currently running module: never evict! This policy will select datasets for eviction in this sort order. About the quantity of datasets to evict, it will select datasets until the optimal eviction size is reached, or, if it is not possible to completely reach this size, at least until the free space needed in cache is covered. If the free space needed can not be covered, there will be no datasets selected for eviction. ===== Related JIRA isues ===== [[https://jira.biozentrum.unibas.ch/browse/IBTWO-519|IBTWO-519]] : It raises the issue about how to know if a cached datasets is being used and, consequently, can not be removed. ===== Analysis: Using Ehcache to implement a reusable dataset cache ===== Ehcache allows two usage patterns: ''write-through'' and ''write-behind'' which could be used to implement the dataset cache in the scratch space. See [[http://ehcache.org/documentation/apis/write-through-caching]]. For that, the class ''CacheWriter'' and a ''CacheWriterFactory'' need to be implemented. Those two classes have been implemented, together with a JUnit test, as a proof of concept, and have been stored in the SVN: * [[http://svn.code.sf.net/p/screeningbee/code/trunk/interfaces/src/main/java/ch/systemsx/bee/interfaces/ehcache/|main]] * [[http://svn.code.sf.net/p/screeningbee/code/trunk/interfaces/src/test/java/ch/systemsx/bee/interfaces/ehcache/|test]] The test behaves as expected, once you add a dataset with its ID as key in the cache, the dataset files are copied to the cluster scratch space. Other implementations which would be necessary are: * an eviction policy. See [[http://ehcache.org/documentation/apis/cache-eviction-algorithms#plugging-in-your-own-eviction-algorithm]]. * identify when the cache is full. It seems that it is not possible to define that the cache is full depending on the size on disk of the copied datasets. For that it would be necessary to implement an event listener, which listens every time a new dataset is put in cache, and checks the remaining disk space. See [[http://ehcache.org/documentation/apis/cache-event-listeners]]. It does not seem that we would have much added value from using this framework, except for the writing configuration. See: [[http://ehcache.org/documentation/apis/write-through-caching#configuration]].