A variation on Simple9 in between VByte and PFOR
to improve compression for position deltas.
VByte is byte aligned, and Simple9 and PFOR are int aligned.
For decoding VByte uses a test on the high bit of each byte, and Simple9 and PFOR use
a selector value for a switch statement.
The variation discussed here is byte aligned, and uses a switch statement
on a selector value for decoding.
For position data (see the prx data posted in October 2008 at
LUCENE-1410 for an example)
VByte has two problems. VByte can encode data lengths of multiples of 7 bits quite well,
but for other numbers of bits the higher multiple of 7 has to be chosen
and all remaining bits are unused.
Also there is some correlation between successive values, and this results
in redundancies in the encoding bits of VByte.
So to improve the compression short series of 2-5 bits
and 8-12 bits would be good, as well as short series of about 7 and 14 bits
in order not to lose the strong parts of VByte.
The first question is then how long these short series should be.
VByte currently encodes one number at a time, so one could try by using rather
short series of 2-4 numbers.
Taking another look at the data at
LUCENE-1410, one can see that for example
3 times 11 bits would be quite effective, and these would fit in 5 bytes,
but Simple9 only has 4 bytes.
On the other hand, Simple9 has among others 2 times 14 and 4 times 7 bits in 4 bytes
and those cases should not be lost when comparing to VByte.
That means that the number of bytes used to encode position data should be variable,
in other words there should be some cases with 4 bytes, and some with 5 bytes.
The next question is then how many bits in the selector value (4 bits in Simple9)
should be dedicated to the number of bytes, and how many bits should be dedicated
to the number of encoded values.
There are quite a few variations possible for that, and the following
one is worth discussing some more.
I have called this compression method Simple4In36, and the name means that
it can encode 1 to 4 values in 3 to 6 bytes. It uses 4 bits to select these
16 cases, 2 bits for the number of encoded values, and 2 bits for the number of bytes.
The following table shows the field sizes in bits, ((8 * #bytes - 4) / #values)
for all these cases (the number of unused bits is in brackets):
#bytes 3 4 5 6
1 20 28 36 44
2 10 14 18 22
3 6(2) 9(1) 12 14(2)
4 5 7 9 11
To encode these cases, the encoder tries to use a minimal number of bytes
for the given input, and outputs the selector value and the input data into
the output bytes once no more input will fit.
The decoder does a switch statement on the selector value for the cases, and
each case decodes the values with constant masks and shifts.
Compared to the data bits of VByte, that only has multiples of 7 data bits, this
table shows quite a few cases with 9-12 bits, so there is good opportunity to save
encoded data bits, and there are also cases for 7, 14, 22 (3*7+1) and 28 bits.
Comparing the encoding bits, Simple4In36 always uses 4, and VByte uses 1 per byte,
so to improve on VByte, one would prefer to use 5 or 6 bytes in Simple4In36.
For proximity data the most frequent case would be 5 bytes with 3 values of 12 bits each,
so that should improve compression.
That brings the question of why this table has a column for 3 bytes, and not one
for 7 bytes. The reason is the case of 3 bytes and 4 values of 5 bits each.
VByte cannot get below 7 data bits for such small values, and I think this case
would help encoding stop words or other rather frequent terms.
This is also the area where PFOR should do well for very frequent terms,
and cases smaller than 5 bits are probably better left to PFOR.
Long gene data (four terms A, C, T, G) for example would be a good candidate for PFOR.
Since there is no point in using fewer bytes than 3, the question is whether
the column with 6 byte cases could have been replaced by for example a row
with 5 encoded values. The main reason for that is that I would expect the
6 byte cases with 3 values of 14 bits and 4 values of 11 bits to be effective
for sparse terms in longer documents. Also encoding more than 4 values will
affect the compression ratio because all numbers should fit, and the maximum of these
numbers determines the field size, so each value smaller than the maximum
tends to unused data bits in the fields.
For position data distributed as the data at
LUCENE-1410 it does not make sense
to use Simple4In36 to encode a single position. Nevertheless there is a row
with 1 value to be encoded. The reason for this is that when the first number
has too many data bits (more than 22 bits for Simple4In36) only a single number
can be encoded.
To allow (arbitrarily) long values to be encoded, the longest one of the cases
for 1 value could be extended to encode extra bits in the same way as VByte does:
as long as the highest bit is set, there is another VByte of data available.
For efficient decoding of this case, the first bit of the data could be added
to the selector value.
As for the decoding speed of Simple4In36 compared to VByte, I would expect
that a switch statement for the selector value of Simple4In36 takes about the
same time as 4 conditional jumps of VByte. That means that for the most frequent
cases (o.a. 5 bytes, 3 values, 12 bit data) Simple4In36 should be somewhat
faster than VByte. For infrequent cases of 3 bytes and 3 or fewer numbers,
Simple4In36 will be somewhat slower.
Many variations on the theme of using some selector bits for the maximum number
of encoded values and the rest for the number of bytes of encoded data are possible.
For example one could use one bit more in the selector to allow 8 values to be encoded,
or to allow 2-9 bytes to be used.
One could also use one bit less in the selector and drop two
of the columns for the numbers of encoded bytes, leaving for example only
the 4 and 5 byte cases.
Adding or removing one bit in the selector however leaves an odd number of bits for the fields,
so in many of these cases some data bits might be unused. A provision could be
made to encode some fields with one bit more to use these bits, but that would
add some complexity to the encoder. Simple4In36 has no unused bits
except in 3 of the cases with 3 encoded numbers, so there is real no need to make
such a provision.
Adding two bits to the selector would allow 64 cases to be encoded. This would
allow more decompression speed at the cost of the compression ratio.
One could also encode 63 cases (7*9) to allow 9 possibilities for the number of
bytes and 1-7 for the values encoded, or the other way round. It is an open question
whether using this many bits for the selector is competitive with PFOR.
PFOR without patching could be seen as a limit case.
No program code yet, just these design ideas. Any takers?