Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.5, 6.0
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Kamikaze 3.0.1 is the updated version of Kamikaze 2.0.0. It can achieve significantly better performance then Kamikaze 2.0.0 in terms of both compressed size and decompression speed. The main difference between the two versions is Kamikaze 3.0.x uses the much more efficient implementation of the PForDelta compression algorithm. My goal is to integrate the highly efficient PForDelta implementation into Lucene Codec.

      1. LUCENE-2750.patch
        21 kB
        Adrien Grand
      2. LUCENE-2750.patch
        19 kB
        Adrien Grand
      3. LUCENE-2750.patch
        19 kB
        Adrien Grand

        Activity

        Hide
        Otis Gospodnetic added a comment -

        This issue (actually its parent - LUCENE-1969) was mentioned in Adrien Grand's "LUCENE-5081 Compress doc ID sets". Is this issue still relevant, Robert Muir, Michael McCandless?

        Show
        Otis Gospodnetic added a comment - This issue (actually its parent - LUCENE-1969 ) was mentioned in Adrien Grand 's " LUCENE-5081 Compress doc ID sets". Is this issue still relevant, Robert Muir , Michael McCandless ?
        Hide
        Michael McCandless added a comment -

        I think this is still relevant? Compressed bit sets should have many compelling uses in Lucene...

        Show
        Michael McCandless added a comment - I think this is still relevant? Compressed bit sets should have many compelling uses in Lucene...
        Hide
        Adrien Grand added a comment -

        Agreed. I will perform some evaluation of these impls compared to the one on LUCENE-5081 and the one Paul Elschot is working on to see which ones (potentially all if they offer different useful trade-offs) make sense to include in Lucene.

        Show
        Adrien Grand added a comment - Agreed. I will perform some evaluation of these impls compared to the one on LUCENE-5081 and the one Paul Elschot is working on to see which ones (potentially all if they offer different useful trade-offs) make sense to include in Lucene.
        Hide
        Adrien Grand added a comment -

        For the record, Daniel Lemire wrote a post about PFOR delta decompression speed at http://lemire.me/blog/archives/2013/07/08/fast-integer-compression-in-java/. I have not dug the reasons that may explain why his implementation is faster than kamikaze or if the benchmark itself is relevant to Lucene but I think it's something worth looking into.

        Show
        Adrien Grand added a comment - For the record, Daniel Lemire wrote a post about PFOR delta decompression speed at http://lemire.me/blog/archives/2013/07/08/fast-integer-compression-in-java/ . I have not dug the reasons that may explain why his implementation is faster than kamikaze or if the benchmark itself is relevant to Lucene but I think it's something worth looking into.
        Hide
        Adrien Grand added a comment -

        FYI I ran his benchmark and the thing is that the version of kamikaze he is using decompresses ints one by one instead of using routines that decompress a full block in one go. Here is the relevant part of the kamikaze code base: https://github.com/linkedin/kamikaze/blob/master/src/main/java/com/kamikaze/pfordelta/PForDelta.java#L114 decompressBBitSlotsWithHardCodes is commented out in favor of decompressBBitSlots.

        Show
        Adrien Grand added a comment - FYI I ran his benchmark and the thing is that the version of kamikaze he is using decompresses ints one by one instead of using routines that decompress a full block in one go. Here is the relevant part of the kamikaze code base: https://github.com/linkedin/kamikaze/blob/master/src/main/java/com/kamikaze/pfordelta/PForDelta.java#L114 decompressBBitSlotsWithHardCodes is commented out in favor of decompressBBitSlots.
        Hide
        Adrien Grand added a comment -

        I wrote an implementation of a PForDeltaDocIdSet based on the ones in Kamikaze and D. Lemire's JavaFastPFOR (both are licensed under the ASL 2.0).

        On the contrary to the original implementation, it uses FOR to encode exceptions (this was easier given that we already have lots of utility methods to pack integers).

        Show
        Adrien Grand added a comment - I wrote an implementation of a PForDeltaDocIdSet based on the ones in Kamikaze and D. Lemire's JavaFastPFOR (both are licensed under the ASL 2.0). On the contrary to the original implementation, it uses FOR to encode exceptions (this was easier given that we already have lots of utility methods to pack integers).
        Hide
        Paul Elschot added a comment - - edited

        For a moment ignoring the fact Elias-Fano requires ordered input and PFor allows random input:
        PFor normally has better compression than Elias-Fano when the exceptional values have a lot more bits than the normal values that fit in the low bits.
        In such cases Elias-Fano has to use an upperbound that is too large for effective compression.

        Also Elias-Fano could still be improved by adding block decoding and using that when most of the block is needed.

        Show
        Paul Elschot added a comment - - edited For a moment ignoring the fact Elias-Fano requires ordered input and PFor allows random input: PFor normally has better compression than Elias-Fano when the exceptional values have a lot more bits than the normal values that fit in the low bits. In such cases Elias-Fano has to use an upperbound that is too large for effective compression. Also Elias-Fano could still be improved by adding block decoding and using that when most of the block is needed.
        Hide
        Adrien Grand added a comment -

        Updated patch: DISI.cost() now returns the cardinality of the set, computed at building time.

        Show
        Adrien Grand added a comment - Updated patch: DISI.cost() now returns the cardinality of the set, computed at building time.
        Hide
        Adrien Grand added a comment -

        Updated patch:

        • The number of exceptions per block (128 documents) is limited to 24 (~20%). Higher values are almost never picked so adding this limit helps make building faster by not taking into consideration numbers of bits per value that would involve more exceptions.
        • In order to never be significantly larger than a FixedBitSet, the blocks switch to unary encoding when it saves space.

        I will commit this patch soon if there is no objection.

        Show
        Adrien Grand added a comment - Updated patch: The number of exceptions per block (128 documents) is limited to 24 (~20%). Higher values are almost never picked so adding this limit helps make building faster by not taking into consideration numbers of bits per value that would involve more exceptions. In order to never be significantly larger than a FixedBitSet, the blocks switch to unary encoding when it saves space. I will commit this patch soon if there is no objection.
        Hide
        Jack Krupansky added a comment -

        The summary references release 3.0.1, but Kamikaze seems to be up to 3.0.7 now. Is the summary correct, or is a newer Kamikaze intended, or... is this simply a re-implementation of Kamikaze's algorithm rather than integration of some release of Kamikaze?

        Show
        Jack Krupansky added a comment - The summary references release 3.0.1, but Kamikaze seems to be up to 3.0.7 now. Is the summary correct, or is a newer Kamikaze intended, or... is this simply a re-implementation of Kamikaze's algorithm rather than integration of some release of Kamikaze?
        Hide
        Adrien Grand added a comment -

        Hi Jack. Indeed the patch consists more of a reimplementation in order to leverage existing pieces of code we have in Lucene for bit packing and to borrow some ideas from Daniel Lemire's FastPFor (https://github.com/lemire/JavaFastPFOR).

        Show
        Adrien Grand added a comment - Hi Jack. Indeed the patch consists more of a reimplementation in order to leverage existing pieces of code we have in Lucene for bit packing and to borrow some ideas from Daniel Lemire's FastPFor ( https://github.com/lemire/JavaFastPFOR ).
        Hide
        ASF subversion and git services added a comment -

        Commit 1516375 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1516375 ]

        LUCENE-2750: PForDeltaDocIdSet, an in-memory DocIdSet impl based on PFOR encoding.

        Show
        ASF subversion and git services added a comment - Commit 1516375 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1516375 ] LUCENE-2750 : PForDeltaDocIdSet, an in-memory DocIdSet impl based on PFOR encoding.
        Hide
        ASF subversion and git services added a comment -

        Commit 1516380 from Adrien Grand in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1516380 ]

        LUCENE-2750: PForDeltaDocIdSet, an in-memory DocIdSet impl based on PFOR encoding.

        Show
        ASF subversion and git services added a comment - Commit 1516380 from Adrien Grand in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1516380 ] LUCENE-2750 : PForDeltaDocIdSet, an in-memory DocIdSet impl based on PFOR encoding.
        Hide
        Adrien Grand added a comment -

        4.5 release -> bulk close

        Show
        Adrien Grand added a comment - 4.5 release -> bulk close

          People

          • Assignee:
            Adrien Grand
            Reporter:
            hao yan
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 336h
              336h
              Remaining:
              Remaining Estimate - 336h
              336h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development