Details

Type: Improvement

Status: Resolved

Priority: Minor

Resolution: Won't Fix

Affects Version/s: None

Fix Version/s: None

Component/s: core/index

Labels:None

Lucene Fields:New
Description
Simple9 is an alternative for VInt.

 Simple16Codec.java
 12 kB
 Renaud Delbru

 LUCENE2189a.patch
 32 kB
 Paul Elschot

 LUCENE2189a.patch
 31 kB
 Paul Elschot
Activity
Simple9 encoder/decoder and passing tests.
This 2189a patch still has a fixme at the encoder to not use more elements than given.
Just a comment on the switch:
As far as I know: Java switch statements are very fast if there are few cases and these cases are near together and therefore small numbers. I would suggest to not switch on the raw ANDed status, but better instead shift the status >>> 28 (and remove the &) and then only list the status values 0..9 in the switch statement.
Here the explanation: http://java.sun.com/docs/books/jvms/first_edition/html/Compiling.doc.html
About the switch: I had the shift down in there initially, but then I left it out for speed of decoding. I could move the status bits to the lower part so that the shift is not needed at all if that does not affect data decoding. I'll have a look at it. Thanks.
2010, Jan 5: this patch has a good encoder, and includes a good number of test cases for normal use. I also changed the decoder switch statement to use only small label values.
Still to be tested: out of bounds situations for offsets, sizes and input element values,
and performance comparisons with VInt and PFor.
Hi Paul,
I have uploaded the version of the Simple16 I have. It is a old piece of code (I worked on it one year ago) which cannot be used directly for Lucene, and it looks less advanced than your version of Simple9. But maybe it could be interesting for you (for inspiration).
It'd be best if the interface to Simple9/16 (and in general any variable sized int block encoder that we can tap into as a codec) were an incremental one.
EG I could instantiate the encoder, then call .add(int value) to add a new value, which'd return maybe a boolean indicating that a new output block has been flushed/created as a result of adding that value.
I'm trying to make a simpletoreuse variable sized int block codec where we could just drop in Simple9/16/etc, and this incremental API would work well for this.
A variation on Simple9 in between VByte and PFOR
to improve compression for position deltas.
VByte is byte aligned, and Simple9 and PFOR are int aligned.
For decoding VByte uses a test on the high bit of each byte, and Simple9 and PFOR use
a selector value for a switch statement.
The variation discussed here is byte aligned, and uses a switch statement
on a selector value for decoding.
For position data (see the prx data posted in October 2008 at LUCENE1410 for an example)
VByte has two problems. VByte can encode data lengths of multiples of 7 bits quite well,
but for other numbers of bits the higher multiple of 7 has to be chosen
and all remaining bits are unused.
Also there is some correlation between successive values, and this results
in redundancies in the encoding bits of VByte.
So to improve the compression short series of 25 bits
and 812 bits would be good, as well as short series of about 7 and 14 bits
in order not to lose the strong parts of VByte.
The first question is then how long these short series should be.
VByte currently encodes one number at a time, so one could try by using rather
short series of 24 numbers.
Taking another look at the data at LUCENE1410, one can see that for example
3 times 11 bits would be quite effective, and these would fit in 5 bytes,
but Simple9 only has 4 bytes.
On the other hand, Simple9 has among others 2 times 14 and 4 times 7 bits in 4 bytes
and those cases should not be lost when comparing to VByte.
That means that the number of bytes used to encode position data should be variable,
in other words there should be some cases with 4 bytes, and some with 5 bytes.
The next question is then how many bits in the selector value (4 bits in Simple9)
should be dedicated to the number of bytes, and how many bits should be dedicated
to the number of encoded values.
There are quite a few variations possible for that, and the following
one is worth discussing some more.
I have called this compression method Simple4In36, and the name means that
it can encode 1 to 4 values in 3 to 6 bytes. It uses 4 bits to select these
16 cases, 2 bits for the number of encoded values, and 2 bits for the number of bytes.
The following table shows the field sizes in bits, ((8 * #bytes  4) / #values)
for all these cases (the number of unused bits is in brackets):
#bytes 3 4 5 6 #values 1 20 28 36 44 2 10 14 18 22 3 6(2) 9(1) 12 14(2) 4 5 7 9 11
To encode these cases, the encoder tries to use a minimal number of bytes
for the given input, and outputs the selector value and the input data into
the output bytes once no more input will fit.
The decoder does a switch statement on the selector value for the cases, and
each case decodes the values with constant masks and shifts.
Compared to the data bits of VByte, that only has multiples of 7 data bits, this
table shows quite a few cases with 912 bits, so there is good opportunity to save
encoded data bits, and there are also cases for 7, 14, 22 (3*7+1) and 28 bits.
Comparing the encoding bits, Simple4In36 always uses 4, and VByte uses 1 per byte,
so to improve on VByte, one would prefer to use 5 or 6 bytes in Simple4In36.
For proximity data the most frequent case would be 5 bytes with 3 values of 12 bits each,
so that should improve compression.
That brings the question of why this table has a column for 3 bytes, and not one
for 7 bytes. The reason is the case of 3 bytes and 4 values of 5 bits each.
VByte cannot get below 7 data bits for such small values, and I think this case
would help encoding stop words or other rather frequent terms.
This is also the area where PFOR should do well for very frequent terms,
and cases smaller than 5 bits are probably better left to PFOR.
Long gene data (four terms A, C, T, G) for example would be a good candidate for PFOR.
Since there is no point in using fewer bytes than 3, the question is whether
the column with 6 byte cases could have been replaced by for example a row
with 5 encoded values. The main reason for that is that I would expect the
6 byte cases with 3 values of 14 bits and 4 values of 11 bits to be effective
for sparse terms in longer documents. Also encoding more than 4 values will
affect the compression ratio because all numbers should fit, and the maximum of these
numbers determines the field size, so each value smaller than the maximum
tends to unused data bits in the fields.
For position data distributed as the data at LUCENE1410 it does not make sense
to use Simple4In36 to encode a single position. Nevertheless there is a row
with 1 value to be encoded. The reason for this is that when the first number
has too many data bits (more than 22 bits for Simple4In36) only a single number
can be encoded.
To allow (arbitrarily) long values to be encoded, the longest one of the cases
for 1 value could be extended to encode extra bits in the same way as VByte does:
as long as the highest bit is set, there is another VByte of data available.
For efficient decoding of this case, the first bit of the data could be added
to the selector value.
As for the decoding speed of Simple4In36 compared to VByte, I would expect
that a switch statement for the selector value of Simple4In36 takes about the
same time as 4 conditional jumps of VByte. That means that for the most frequent
cases (o.a. 5 bytes, 3 values, 12 bit data) Simple4In36 should be somewhat
faster than VByte. For infrequent cases of 3 bytes and 3 or fewer numbers,
Simple4In36 will be somewhat slower.
Many variations on the theme of using some selector bits for the maximum number
of encoded values and the rest for the number of bytes of encoded data are possible.
For example one could use one bit more in the selector to allow 8 values to be encoded,
or to allow 29 bytes to be used.
One could also use one bit less in the selector and drop two
of the columns for the numbers of encoded bytes, leaving for example only
the 4 and 5 byte cases.
Adding or removing one bit in the selector however leaves an odd number of bits for the fields,
so in many of these cases some data bits might be unused. A provision could be
made to encode some fields with one bit more to use these bits, but that would
add some complexity to the encoder. Simple4In36 has no unused bits
except in 3 of the cases with 3 encoded numbers, so there is real no need to make
such a provision.
Adding two bits to the selector would allow 64 cases to be encoded. This would
allow more decompression speed at the cost of the compression ratio.
One could also encode 63 cases (7*9) to allow 9 possibilities for the number of
bytes and 17 for the values encoded, or the other way round. It is an open question
whether using this many bits for the selector is competitive with PFOR.
PFOR without patching could be seen as a limit case.
No program code yet, just these design ideas. Any takers?
Any one test the performance of it?
a comparision of index size(frq file size) and decoding speeding with normal dgap VINT
Any one test the performance of it?
Not that I know of.
Someone needs to convert the Simple9/16 impl here to match the incremental API that VariableIntBlockCodec wants, ie provide a .add(int value), returning an int count of how many values were just flushed to the output. Ie, it'd return a series of 0's and then suddenly like 7 indicating 7 values were just written.
Once we do that, it's trivial to make a codec out of it...
hao yan has posted a patch at LUCENE1410 on 16 Dec 2010 that contains a Simple16 implementation.
Meanwhile a Simple64 implementation has become available, so the chance that there is real use for a Simple9 implementation has become very low.
I've started some work on Simple9, also known as S9. It works like this:
A compressed 32bit word contains 4 status bits and 28 data bits.
There are nine different ways of dividing up the 28
data bits: 28 1bit numbers, 14 2bit numbers, 9 3bit numbers (one
bit unused), 7 4bit numbers, 5 5numbers (three bits unused), 4
7bit numbers, 3 9bit numbers (one bit unused), 2 14bit numbers,
or 1 28bit number. The four status bits store which of the
9 cases is used. Decompression can be done by doing a switch operation
on the status bits, where each of the 9 cases applies a ﬁxed bit mask to
extract the integers.
I've used this paper:
Jiangong Zhang, Xiaohui Long, Torsten Suel,
"Performance of Compressed Inverted List Caching in Search Engines",
WWW 2008 / Refereed Track: Search  Corpus Characterization & Search
Performance, Beijing, China
A decoder is working decently. The encoder still needs a bit of work,
mostly because I initially assumed it would be ok to encode more numbers
than given. I'll post a patch later this week, hopefully with a better encoder.
From the paper one can conclude that, when compared to VInt, Simple9 is useful for encoding:
LUCENE1410, which is blocked by this issue.No performance measurements yet, but comparing the code to the PFOR code
supports the view of the paper that Simple9 improves VInt, but will never be as good
as PFOR.