Kamikaze is a utility package wrapping set implementations on document lists. It is designed for serving compact set representation and fast set operations. The architecture of Kamikaze (which is similar for Version 1.0.0, Verison 2.0.0 and Version 3.0.0) is shown in the following figure.
Architecture in Version 3.0.0:In particular,
- On the lowest level of Kamikaze, three different classes are supported to represent the document lists: PForDeltaWithBase (which is an variant of PForDelta and can support compact data compression and fast decompression), IntArray (which represent each docId as a 32-bit integer), OpenBitSet (which represent all docIds using the OpenBitSet). One of them will be chosen to represent the data and perform the basic operations on the data, for example, representing (compressing) the docIds on the lists and retrieving (decompressing) them. PForDelta is always perferred than the other two methods except when the docIds on the lists are extremely densely distributed (that is, the list length is comparable to the value of the maximal docId) or the lists are extremly short (for example, tens of docIds).
- On the middle level, the correponding doc sets (inverted lists) are built on top of the above three data representation classess. These doc sets support the basic doc set operations, for example, adding a document, iterating the compressed set and finding a particular docId in the compressed set, etc.
- On the top level, the basic query processing operations are supported based on the set operations in the middle level. For example, PForDeltaAndDocIdSet supports the operation of finding all intersected docIds of multiple inverted lists (doc sets), that is, the AND operation in a Document-at-a-time (DAAT) manner.
PForDelta in Version 3.0.0
HistoryPForDelta is a compression method that was originally proposed by S. Heman. It was not originally designed to compress inverted indexes of search engines. It was first used in 2008 by WestLab (Web Exploration and Search Technology Lab at Polytechnic Institute of New York University) to compress inverted indexes of search engines. Recently, WestLab proposed several other optimizations on PForDelta and claimed that the optimized PForDelta can achieve a better tradeoff between the compressed size and the decompression speed than other inverted index compression methods, for example, Rice coding, Variable-byte coding, Gamma coding, Interpolative coding, etc.The beauty of PForDelta is that it supports extremely fast decompression while also achieving a small compressed size.
Basic IdeaThe basic idea of PForDelta is as follows: in order to compress a block of k numbers, say, 256 numbers, it first determines a value b such that most of the 256 values to be encoded (say, 90%) are less than 2^b and thus fit into a fixed bit field of b bit s each. The remaining values, called exceptions (please note this has nothing to do with Java run time exceptions), are coded separately. The data structure of PForDelta is shown in the following figure.
In the above figure, a block of 256 integers is compressed as a block of 256 b-bit slots plus some additional data appended to those slots. Most of the integers in the above block can fit into b=5 bits except two exceptions, 55 and 70. The two exceptions are coded in the same way as follows: we append their offsets within the block (expPos), that is, where the two exceptions occur, to the end of the 256 b-bit slots. We store the lower b bits of the exceptions into their corresponding slots(Lo55 and Lo70) and append their higher 32-b bits(Hi55 and Hi70) to the end of expPos, that is, expHighBits. ExpPos and expHighBits are concatenated together and compressed by another compression methods called Simple 16.
If we apply PForDelta to blocks containing some multiple of 32 values, then decompression involves extracting groups of 32 b-bit values, and finally patching the result by decoding a smaller number of exceptions. This process can be implemented extremely efficiently by providing, for each value of b, an optimized method for extracting 32 b-bit values from b memory words. PForDelta can be modified and tuned in various ways by choosing different thresholds for the number of exceptions allowed, and by encoding the exceptions in different ways.