Friday, 31 October 2014

Soft-Index File Store

Recently, Infinispan got a new local file-based cache store, called Soft-Index File Store. Why have we created just another cache store, what problems is it solving, what are its limitations and how is it designed?

Single File Store is a well performing cache store, but it stores all keys in-memory; that limits the number of keys you can store. File fragmentation could be even more of an issue: if you store larger and larger values (that happens quite a lot, as users e.g. add stuff into their shopping carts), the space is not reused and instead the entry is appended at the end of the file. The space (now empty) is reused only if you write entry that can fit there. Also, even if you remove all entries from the cache, the file won't shrink, and neither won't be de-fragmented.

LevelDB uses quite well performing Google's library written in native code. The major drawback is the native code - if LevelDB has a bug that ends in segfault, whole JVM crashes, bringing you application server down.

Our new Soft Index File Store is pure Java implementation that tries to get around Single File Store's drawbacks by implementing a variant of B+ tree that is cached in-memory using Java's soft references - here's where the name Soft Index File Store comes from. This B+ tree (called Index) is offloaded on filesystem to single file: in fact, this has theoretically similar problems with fragmentation as Single File Store - but in practice it shouldn't cause such problems. This index file does not need to be persisted - it is purged and rebuilt when the cache store restarts, its purpose is only offloading.

The data that should be persisted are stored in a set of files that are written in append-only way - that means that if you store this on conventional magnetic disk, it does not have to seek when writing a burst of entries. It is not stored in a single file but in a set of files. When any of these files drops below 50% of usage (the entries are marked as removed or overwritten), the file starts being collected, moving live entries into another file and in the end removing the old file from disk.

Most of the in-memory structures in Soft Index File Store are bounded, therefore you don't have to be afraid of OOMEs. You can also configure the limits for concurrently open files as well (so that you don't run out of file descriptors).

How to configure SIFS

The configuration is no different from regular cache store:

Implementation details

The Index does not use single file, in fact it can be split into multiple segments. That's because the algorithm updating this B+ tree is designed as single writer - multiple readers, but that could make the writer thread (called 'Index Updater') the bottleneck. Therefore, you can set how many segments should the Index be split into (according to keys' hashCode()).

Each node in the Index stores 'prefix' of all keys (or rather the serialized forms) used in the node in order to reduce the space required for the node. This comes with the assumption that the prefixes are often similar (e.g. when you use key "user000001" and "user000002"). If you can change how the keys are serialized, it is encouraged to move the changing part of the key to the end of the serialized data.

The data are written by single thread as well, the 'Log Appender'. There's no reason to let threads that access the cache store compete over file-system - Log Appender queues the write results, writes them into the file and wakes up the waiting thread. There are 2 possibly unnecessary context-switches, but in the original design we wanted to allow the write request to return only after the data have been fsynced. By batching the writes, Log Appender allows this as a configuration option - then you can be sure that the data are already on disk when the call returns.

When the entry is modified, the Index needs to be updated. The request is sent to Index Updater via bounded queue and the newest entry location is stored in Temporary Table until this is stored in the Index. The updated nodes are eventually offloaded onto disk in this way.

Known limitations

Size of a node in the Index is limited, by default it is 4096 bytes, though it can be configured. This size also limits the key length (or rather the length of the serialized form): you can't use keys longer than size of the node - 15 bytes. Moreover, the key length is stored as 'short', limiting it to 32767 bytes. There's no way how you can use longer keys - SIFS throws an exception when the key is longer after serialization.

When entries are stored with expiration, SIFS does not discover the a file is full of expired entries and the compaction of old data files may not be started ever (method AdvancedStore.purgeExpired() is not implemented). This can lead to excessive file-system space usage.

Future work

What we need to do know is to benchmark SIFS in many configurations and set the optimal values as defaults. However, we run mostly synthetic benchmarks - and that's where you can help. Let's play with Soft Index File Store a bit and tell us what configuration works best for you!

For storing large keys, building the B+ tree of hashCodes could perform better that storing the whole keys, though it would need additional handling for collisions. Tell us what keys do you use, please!

Currently, each index update needs to be eventually stored, and that means one or more writes into the file-system even when this is not necessary. In the future, we might try to use phantom references instead of soft references to write the Index only when it needs to release some memory. However, this requires a lot of further work, so test SIFS today and let us now how do you like it!


  1. Quite interesting. Is it available independently or it is distributed only with Infinispan ?

  2. It is part of Infinispan, though I guess you could extract it