Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Simple9 is an alternative for VInt.

      1. LUCENE-2189a.patch
        32 kB
        Paul Elschot
      2. LUCENE-2189a.patch
        31 kB
        Paul Elschot
      3. Simple16Codec.java
        12 kB
        Renaud Delbru

        Activity

        Hide
        Paul Elschot added a comment -

        I've started some work on Simple9, also known as S9. It works like this:

        A compressed 32-bit word contains 4 status bits and 28 data bits.
        There are nine different ways of dividing up the 28
        data bits: 28 1-bit numbers, 14 2-bit numbers, 9 3-bit numbers (one
        bit unused), 7 4-bit numbers, 5 5-numbers (three bits unused), 4
        7-bit numbers, 3 9-bit numbers (one bit unused), 2 14-bit numbers,
        or 1 28-bit number. The four status bits store which of the
        9 cases is used. Decompression can be done by doing a switch operation
        on the status bits, where each of the 9 cases applies a fixed bit mask to
        extract the integers.

        I've used this paper:
        Jiangong Zhang, Xiaohui Long, Torsten Suel,
        "Performance of Compressed Inverted List Caching in Search Engines",
        WWW 2008 / Refereed Track: Search - Corpus Characterization & Search
        Performance, Beijing, China

        A decoder is working decently. The encoder still needs a bit of work,
        mostly because I initially assumed it would be ok to encode more numbers
        than given. I'll post a patch later this week, hopefully with a better encoder.

        From the paper one can conclude that, when compared to VInt, Simple9 is useful for encoding:

        • frequencies and positions, (smaller compressed size, higher decompression speed)
        • exceptions for PFOR, as in LUCENE-1410, which is blocked by this issue.

        No performance measurements yet, but comparing the code to the PFOR code
        supports the view of the paper that Simple9 improves VInt, but will never be as good
        as PFOR.

        Show
        Paul Elschot added a comment - I've started some work on Simple9, also known as S9. It works like this: A compressed 32-bit word contains 4 status bits and 28 data bits. There are nine different ways of dividing up the 28 data bits: 28 1-bit numbers, 14 2-bit numbers, 9 3-bit numbers (one bit unused), 7 4-bit numbers, 5 5-numbers (three bits unused), 4 7-bit numbers, 3 9-bit numbers (one bit unused), 2 14-bit numbers, or 1 28-bit number. The four status bits store which of the 9 cases is used. Decompression can be done by doing a switch operation on the status bits, where each of the 9 cases applies a fixed bit mask to extract the integers. I've used this paper: Jiangong Zhang, Xiaohui Long, Torsten Suel, "Performance of Compressed Inverted List Caching in Search Engines", WWW 2008 / Refereed Track: Search - Corpus Characterization & Search Performance, Beijing, China A decoder is working decently. The encoder still needs a bit of work, mostly because I initially assumed it would be ok to encode more numbers than given. I'll post a patch later this week, hopefully with a better encoder. From the paper one can conclude that, when compared to VInt, Simple9 is useful for encoding: frequencies and positions, (smaller compressed size, higher decompression speed) exceptions for PFOR, as in LUCENE-1410 , which is blocked by this issue. No performance measurements yet, but comparing the code to the PFOR code supports the view of the paper that Simple9 improves VInt, but will never be as good as PFOR.
        Hide
        Paul Elschot added a comment -

        Simple9 encoder/decoder and passing tests.
        This 2189a patch still has a fixme at the encoder to not use more elements than given.

        Show
        Paul Elschot added a comment - Simple9 encoder/decoder and passing tests. This 2189a patch still has a fixme at the encoder to not use more elements than given.
        Hide
        Uwe Schindler added a comment - - edited

        Just a comment on the switch:
        As far as I know: Java switch statements are very fast if there are few cases and these cases are near together and therefore small numbers. I would suggest to not switch on the raw ANDed status, but better instead shift the status >>> 28 (and remove the &) and then only list the status values 0..9 in the switch statement.

        Show
        Uwe Schindler added a comment - - edited Just a comment on the switch: As far as I know: Java switch statements are very fast if there are few cases and these cases are near together and therefore small numbers. I would suggest to not switch on the raw ANDed status, but better instead shift the status >>> 28 (and remove the &) and then only list the status values 0..9 in the switch statement.
        Show
        Uwe Schindler added a comment - Here the explanation: http://java.sun.com/docs/books/jvms/first_edition/html/Compiling.doc.html
        Hide
        Paul Elschot added a comment -

        About the switch: I had the shift down in there initially, but then I left it out for speed of decoding. I could move the status bits to the lower part so that the shift is not needed at all if that does not affect data decoding. I'll have a look at it. Thanks.

        Show
        Paul Elschot added a comment - About the switch: I had the shift down in there initially, but then I left it out for speed of decoding. I could move the status bits to the lower part so that the shift is not needed at all if that does not affect data decoding. I'll have a look at it. Thanks.
        Hide
        Paul Elschot added a comment -

        2010, Jan 5: this patch has a good encoder, and includes a good number of test cases for normal use. I also changed the decoder switch statement to use only small label values.

        Still to be tested: out of bounds situations for offsets, sizes and input element values,
        and performance comparisons with VInt and PFor.

        Show
        Paul Elschot added a comment - 2010, Jan 5: this patch has a good encoder, and includes a good number of test cases for normal use. I also changed the decoder switch statement to use only small label values. Still to be tested: out of bounds situations for offsets, sizes and input element values, and performance comparisons with VInt and PFor.
        Hide
        Renaud Delbru added a comment -

        Hi Paul,

        I have uploaded the version of the Simple16 I have. It is a old piece of code (I worked on it one year ago) which cannot be used directly for Lucene, and it looks less advanced than your version of Simple9. But maybe it could be interesting for you (for inspiration).

        Show
        Renaud Delbru added a comment - Hi Paul, I have uploaded the version of the Simple16 I have. It is a old piece of code (I worked on it one year ago) which cannot be used directly for Lucene, and it looks less advanced than your version of Simple9. But maybe it could be interesting for you (for inspiration).
        Hide
        Michael McCandless added a comment -

        It'd be best if the interface to Simple9/16 (and in general any variable sized int block encoder that we can tap into as a codec) were an incremental one.

        EG I could instantiate the encoder, then call .add(int value) to add a new value, which'd return maybe a boolean indicating that a new output block has been flushed/created as a result of adding that value.

        I'm trying to make a simple-to-reuse variable sized int block codec where we could just drop in Simple9/16/etc, and this incremental API would work well for this.

        Show
        Michael McCandless added a comment - It'd be best if the interface to Simple9/16 (and in general any variable sized int block encoder that we can tap into as a codec) were an incremental one. EG I could instantiate the encoder, then call .add(int value) to add a new value, which'd return maybe a boolean indicating that a new output block has been flushed/created as a result of adding that value. I'm trying to make a simple-to-reuse variable sized int block codec where we could just drop in Simple9/16/etc, and this incremental API would work well for this.
        Hide
        Paul Elschot added a comment -

        A variation on Simple9 in between VByte and PFOR
        to improve compression for position deltas.

        VByte is byte aligned, and Simple9 and PFOR are int aligned.
        For decoding VByte uses a test on the high bit of each byte, and Simple9 and PFOR use
        a selector value for a switch statement.
        The variation discussed here is byte aligned, and uses a switch statement
        on a selector value for decoding.

        For position data (see the prx data posted in October 2008 at LUCENE-1410 for an example)
        VByte has two problems. VByte can encode data lengths of multiples of 7 bits quite well,
        but for other numbers of bits the higher multiple of 7 has to be chosen
        and all remaining bits are unused.
        Also there is some correlation between successive values, and this results
        in redundancies in the encoding bits of VByte.
        So to improve the compression short series of 2-5 bits
        and 8-12 bits would be good, as well as short series of about 7 and 14 bits
        in order not to lose the strong parts of VByte.

        The first question is then how long these short series should be.
        VByte currently encodes one number at a time, so one could try by using rather
        short series of 2-4 numbers.

        Taking another look at the data at LUCENE-1410, one can see that for example
        3 times 11 bits would be quite effective, and these would fit in 5 bytes,
        but Simple9 only has 4 bytes.

        On the other hand, Simple9 has among others 2 times 14 and 4 times 7 bits in 4 bytes
        and those cases should not be lost when comparing to VByte.

        That means that the number of bytes used to encode position data should be variable,
        in other words there should be some cases with 4 bytes, and some with 5 bytes.
        The next question is then how many bits in the selector value (4 bits in Simple9)
        should be dedicated to the number of bytes, and how many bits should be dedicated
        to the number of encoded values.
        There are quite a few variations possible for that, and the following
        one is worth discussing some more.

        I have called this compression method Simple4In36, and the name means that
        it can encode 1 to 4 values in 3 to 6 bytes. It uses 4 bits to select these
        16 cases, 2 bits for the number of encoded values, and 2 bits for the number of bytes.

        The following table shows the field sizes in bits, ((8 * #bytes - 4) / #values)
        for all these cases (the number of unused bits is in brackets):

            #bytes  3      4      5   6
        #values
        1          20     28     36  44
        2          10     14     18  22
        3           6(2)   9(1)  12  14(2)
        4           5      7      9  11
        

        To encode these cases, the encoder tries to use a minimal number of bytes
        for the given input, and outputs the selector value and the input data into
        the output bytes once no more input will fit.
        The decoder does a switch statement on the selector value for the cases, and
        each case decodes the values with constant masks and shifts.

        Compared to the data bits of VByte, that only has multiples of 7 data bits, this
        table shows quite a few cases with 9-12 bits, so there is good opportunity to save
        encoded data bits, and there are also cases for 7, 14, 22 (3*7+1) and 28 bits.

        Comparing the encoding bits, Simple4In36 always uses 4, and VByte uses 1 per byte,
        so to improve on VByte, one would prefer to use 5 or 6 bytes in Simple4In36.
        For proximity data the most frequent case would be 5 bytes with 3 values of 12 bits each,
        so that should improve compression.

        That brings the question of why this table has a column for 3 bytes, and not one
        for 7 bytes. The reason is the case of 3 bytes and 4 values of 5 bits each.
        VByte cannot get below 7 data bits for such small values, and I think this case
        would help encoding stop words or other rather frequent terms.
        This is also the area where PFOR should do well for very frequent terms,
        and cases smaller than 5 bits are probably better left to PFOR.
        Long gene data (four terms A, C, T, G) for example would be a good candidate for PFOR.

        Since there is no point in using fewer bytes than 3, the question is whether
        the column with 6 byte cases could have been replaced by for example a row
        with 5 encoded values. The main reason for that is that I would expect the
        6 byte cases with 3 values of 14 bits and 4 values of 11 bits to be effective
        for sparse terms in longer documents. Also encoding more than 4 values will
        affect the compression ratio because all numbers should fit, and the maximum of these
        numbers determines the field size, so each value smaller than the maximum
        tends to unused data bits in the fields.

        For position data distributed as the data at LUCENE-1410 it does not make sense
        to use Simple4In36 to encode a single position. Nevertheless there is a row
        with 1 value to be encoded. The reason for this is that when the first number
        has too many data bits (more than 22 bits for Simple4In36) only a single number
        can be encoded.
        To allow (arbitrarily) long values to be encoded, the longest one of the cases
        for 1 value could be extended to encode extra bits in the same way as VByte does:
        as long as the highest bit is set, there is another VByte of data available.
        For efficient decoding of this case, the first bit of the data could be added
        to the selector value.

        As for the decoding speed of Simple4In36 compared to VByte, I would expect
        that a switch statement for the selector value of Simple4In36 takes about the
        same time as 4 conditional jumps of VByte. That means that for the most frequent
        cases (o.a. 5 bytes, 3 values, 12 bit data) Simple4In36 should be somewhat
        faster than VByte. For infrequent cases of 3 bytes and 3 or fewer numbers,
        Simple4In36 will be somewhat slower.

        Many variations on the theme of using some selector bits for the maximum number
        of encoded values and the rest for the number of bytes of encoded data are possible.
        For example one could use one bit more in the selector to allow 8 values to be encoded,
        or to allow 2-9 bytes to be used.
        One could also use one bit less in the selector and drop two
        of the columns for the numbers of encoded bytes, leaving for example only
        the 4 and 5 byte cases.
        Adding or removing one bit in the selector however leaves an odd number of bits for the fields,
        so in many of these cases some data bits might be unused. A provision could be
        made to encode some fields with one bit more to use these bits, but that would
        add some complexity to the encoder. Simple4In36 has no unused bits
        except in 3 of the cases with 3 encoded numbers, so there is real no need to make
        such a provision.
        Adding two bits to the selector would allow 64 cases to be encoded. This would
        allow more decompression speed at the cost of the compression ratio.
        One could also encode 63 cases (7*9) to allow 9 possibilities for the number of
        bytes and 1-7 for the values encoded, or the other way round. It is an open question
        whether using this many bits for the selector is competitive with PFOR.
        PFOR without patching could be seen as a limit case.

        No program code yet, just these design ideas. Any takers?

        Show
        Paul Elschot added a comment - A variation on Simple9 in between VByte and PFOR to improve compression for position deltas. VByte is byte aligned, and Simple9 and PFOR are int aligned. For decoding VByte uses a test on the high bit of each byte, and Simple9 and PFOR use a selector value for a switch statement. The variation discussed here is byte aligned, and uses a switch statement on a selector value for decoding. For position data (see the prx data posted in October 2008 at LUCENE-1410 for an example) VByte has two problems. VByte can encode data lengths of multiples of 7 bits quite well, but for other numbers of bits the higher multiple of 7 has to be chosen and all remaining bits are unused. Also there is some correlation between successive values, and this results in redundancies in the encoding bits of VByte. So to improve the compression short series of 2-5 bits and 8-12 bits would be good, as well as short series of about 7 and 14 bits in order not to lose the strong parts of VByte. The first question is then how long these short series should be. VByte currently encodes one number at a time, so one could try by using rather short series of 2-4 numbers. Taking another look at the data at LUCENE-1410 , one can see that for example 3 times 11 bits would be quite effective, and these would fit in 5 bytes, but Simple9 only has 4 bytes. On the other hand, Simple9 has among others 2 times 14 and 4 times 7 bits in 4 bytes and those cases should not be lost when comparing to VByte. That means that the number of bytes used to encode position data should be variable, in other words there should be some cases with 4 bytes, and some with 5 bytes. The next question is then how many bits in the selector value (4 bits in Simple9) should be dedicated to the number of bytes, and how many bits should be dedicated to the number of encoded values. There are quite a few variations possible for that, and the following one is worth discussing some more. I have called this compression method Simple4In36, and the name means that it can encode 1 to 4 values in 3 to 6 bytes. It uses 4 bits to select these 16 cases, 2 bits for the number of encoded values, and 2 bits for the number of bytes. The following table shows the field sizes in bits, ((8 * #bytes - 4) / #values) for all these cases (the number of unused bits is in brackets): #bytes 3 4 5 6 #values 1 20 28 36 44 2 10 14 18 22 3 6(2) 9(1) 12 14(2) 4 5 7 9 11 To encode these cases, the encoder tries to use a minimal number of bytes for the given input, and outputs the selector value and the input data into the output bytes once no more input will fit. The decoder does a switch statement on the selector value for the cases, and each case decodes the values with constant masks and shifts. Compared to the data bits of VByte, that only has multiples of 7 data bits, this table shows quite a few cases with 9-12 bits, so there is good opportunity to save encoded data bits, and there are also cases for 7, 14, 22 (3*7+1) and 28 bits. Comparing the encoding bits, Simple4In36 always uses 4, and VByte uses 1 per byte, so to improve on VByte, one would prefer to use 5 or 6 bytes in Simple4In36. For proximity data the most frequent case would be 5 bytes with 3 values of 12 bits each, so that should improve compression. That brings the question of why this table has a column for 3 bytes, and not one for 7 bytes. The reason is the case of 3 bytes and 4 values of 5 bits each. VByte cannot get below 7 data bits for such small values, and I think this case would help encoding stop words or other rather frequent terms. This is also the area where PFOR should do well for very frequent terms, and cases smaller than 5 bits are probably better left to PFOR. Long gene data (four terms A, C, T, G) for example would be a good candidate for PFOR. Since there is no point in using fewer bytes than 3, the question is whether the column with 6 byte cases could have been replaced by for example a row with 5 encoded values. The main reason for that is that I would expect the 6 byte cases with 3 values of 14 bits and 4 values of 11 bits to be effective for sparse terms in longer documents. Also encoding more than 4 values will affect the compression ratio because all numbers should fit, and the maximum of these numbers determines the field size, so each value smaller than the maximum tends to unused data bits in the fields. For position data distributed as the data at LUCENE-1410 it does not make sense to use Simple4In36 to encode a single position. Nevertheless there is a row with 1 value to be encoded. The reason for this is that when the first number has too many data bits (more than 22 bits for Simple4In36) only a single number can be encoded. To allow (arbitrarily) long values to be encoded, the longest one of the cases for 1 value could be extended to encode extra bits in the same way as VByte does: as long as the highest bit is set, there is another VByte of data available. For efficient decoding of this case, the first bit of the data could be added to the selector value. As for the decoding speed of Simple4In36 compared to VByte, I would expect that a switch statement for the selector value of Simple4In36 takes about the same time as 4 conditional jumps of VByte. That means that for the most frequent cases (o.a. 5 bytes, 3 values, 12 bit data) Simple4In36 should be somewhat faster than VByte. For infrequent cases of 3 bytes and 3 or fewer numbers, Simple4In36 will be somewhat slower. Many variations on the theme of using some selector bits for the maximum number of encoded values and the rest for the number of bytes of encoded data are possible. For example one could use one bit more in the selector to allow 8 values to be encoded, or to allow 2-9 bytes to be used. One could also use one bit less in the selector and drop two of the columns for the numbers of encoded bytes, leaving for example only the 4 and 5 byte cases. Adding or removing one bit in the selector however leaves an odd number of bits for the fields, so in many of these cases some data bits might be unused. A provision could be made to encode some fields with one bit more to use these bits, but that would add some complexity to the encoder. Simple4In36 has no unused bits except in 3 of the cases with 3 encoded numbers, so there is real no need to make such a provision. Adding two bits to the selector would allow 64 cases to be encoded. This would allow more decompression speed at the cost of the compression ratio. One could also encode 63 cases (7*9) to allow 9 possibilities for the number of bytes and 1-7 for the values encoded, or the other way round. It is an open question whether using this many bits for the selector is competitive with PFOR. PFOR without patching could be seen as a limit case. No program code yet, just these design ideas. Any takers?
        Hide
        LiLi added a comment -

        Any one test the performance of it?
        a comparision of index size(frq file size) and decoding speeding with normal d-gap VINT

        Show
        LiLi added a comment - Any one test the performance of it? a comparision of index size(frq file size) and decoding speeding with normal d-gap VINT
        Hide
        Michael McCandless added a comment -

        Any one test the performance of it?

        Not that I know of.

        Someone needs to convert the Simple9/16 impl here to match the incremental API that VariableIntBlockCodec wants, ie provide a .add(int value), returning an int count of how many values were just flushed to the output. Ie, it'd return a series of 0's and then suddenly like 7 indicating 7 values were just written.

        Once we do that, it's trivial to make a codec out of it...

        Show
        Michael McCandless added a comment - Any one test the performance of it? Not that I know of. Someone needs to convert the Simple9/16 impl here to match the incremental API that VariableIntBlockCodec wants, ie provide a .add(int value), returning an int count of how many values were just flushed to the output. Ie, it'd return a series of 0's and then suddenly like 7 indicating 7 values were just written. Once we do that, it's trivial to make a codec out of it...
        Hide
        Paul Elschot added a comment -

        hao yan has posted a patch at LUCENE-1410 on 16 Dec 2010 that contains a Simple16 implementation.

        Show
        Paul Elschot added a comment - hao yan has posted a patch at LUCENE-1410 on 16 Dec 2010 that contains a Simple16 implementation.
        Hide
        Paul Elschot added a comment - - edited

        Meanwhile a Simple64 implementation has become available, so the chance that there is real use for a Simple9 implementation has become very low.

        Show
        Paul Elschot added a comment - - edited Meanwhile a Simple64 implementation has become available, so the chance that there is real use for a Simple9 implementation has become very low.

          People

          • Assignee:
            Unassigned
            Reporter:
            Paul Elschot
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development