Kamikaze is a utility package wrapping set implementations on sorted integer arrays. Search indexes, graph algorithms and certain sparse matrix representations tend to make heavy use of sorted integer arrays.
For example, in search engines, for each term t, the index, or called inverted index, contains an inverted list, which is typically a sequence of sorted integer document IDs (and other information which can also be considered as sequences of integers). Thus, inverted index compression techniques are concerned with compressing sequences of sorted integers.
A graph is often implemented as adanjency lists. In many cases, each list can be easily organized as a sorted integer array. For example, for the social graphs in large-scale social networks like Linkedin or Facebook, each list is, for a particular member, a sequence of all his friends (represented as integer member IDs). The performance of many algorithms on such graphs is thus greatly affected by the efficiency of various operations on such lists. For example, in order to find all common friends of two members, we need to find all intersected member IDs of their friend lists.
A matrix can be considered as an alternative implementation of a graph especially when most nodes are directly connected with each other. However, when the matrix is sparse (which is very common for the first or second degree friends in social graphs), it is more efficient to first transfer it into the adancency lists and then do various operations on the resulting lists.
In the above applications (large scale search engines or social networks), we often need to process a huge amount of data (arrays of integers) within milliseconds. The data often need to be compressed to be hold in main memory. Due to compression, the disk traffic and the network traffic are also greatly reduced since much less amount of data need to be communicated. We also need to be able to decompress the data very efficiently to maximize, for example, the query throughput of search engines. To achieve these goals, large search engines have been trying a lot of methods. For example, Lucene uses variable-byte coding (please refer to Managing Gigabytes for various inverted index compression methods) to compress indexes. Google also uses variable-byte coding to encode part of its indexes a long time ago and has switched to other compression methods lately (In my opinion, their new method is a variation of PForDelta which is also implemented in Kamikaze and optimized in Kamikaze version 3.0.0). Therefore, we can see that it is very important to build Kamikaze on top of a good compression method that can achieve both the small compressed size and fast decompression speed.
Kamikaze implements PForDelta compression algorithm (or called P4Delta) which was recently studied and has been shown by paper[1] and paper[2] to be able to achieve the best trade-off of the compression ratio and decompression speed for inverted index of search engines. Many other techniques for inverted index compression have been studied in the literature; see Managing Gigabytes for a survey and paper[2] and paper[3] and for very recent work, especially the detailed performance comparison between most of those techniques and PForDelta. Unfortunately, Lucene does not support PForDelta now although paper[2] and paper[3] have shown that PForDelta can achieve much better performance than variable-byte coding in terms of both compressed size and decompression speed.
Kamikaze builds an platform on top of PForDelta to perform efficient set operations and inverted list compression/decompression. Kamikaze Version 3.0.0 inherits the architecture of the first two versions and supports the same APIs. In Version 3.0.0., the PForDelta algorithm is highly optimized such that the performance of compression/decompression and the corresponding set operations are improved significantly.
In Linkedin, Kamikaze has been used in the distributed graph team and search team. We are also looking forward to contributing to the Lucene community with Kamikaze, especially the optimized PForDelta compression algorithm.