Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In LUCENE-5101, Paul Elschot mentioned that it would be interesting to be able to encode the inverse set to also compress very dense sets.

      1. LUCENE-5150.patch
        13 kB
        Adrien Grand

        Activity

        Hide
        Adrien Grand added a comment -

        Here is a patch. It reserves an additional bit in the header to say whether the encoding should be "inversed" (meaning clean words are actually 0xFF instead of 0x00).

        It should reduce the amount of memory required to build and store dense sets. In spite of this change, compression ratios remain the same for sparse sets.

        For random dense sets, I observed compression ratios of 87% when the load factor is 90% and 20% when the load factor is 99% (vs. 100% before).

        Show
        Adrien Grand added a comment - Here is a patch. It reserves an additional bit in the header to say whether the encoding should be "inversed" (meaning clean words are actually 0xFF instead of 0x00). It should reduce the amount of memory required to build and store dense sets. In spite of this change, compression ratios remain the same for sparse sets. For random dense sets, I observed compression ratios of 87% when the load factor is 90% and 20% when the load factor is 99% (vs. 100% before).
        Hide
        Adrien Grand added a comment -

        I'll commit soon if there is no objection. These dense sets can be common in cases where e.g. users are allowed to see everything but something.

        Show
        Adrien Grand added a comment - I'll commit soon if there is no objection. These dense sets can be common in cases where e.g. users are allowed to see everything but something.
        Hide
        Robert Muir added a comment -

        Thanks Adrien, i am too curious if its possible for you to re-run http://people.apache.org/~jpountz/doc_id_sets.html

        Because now with smaller sets in the dense case, maybe there is no need for wacky heuristics in CachingWrapperFilter and we could just always cache (i am sure some cases would be slower, but if in general its faster...). This would really simplify LUCENE-5101.

        Show
        Robert Muir added a comment - Thanks Adrien, i am too curious if its possible for you to re-run http://people.apache.org/~jpountz/doc_id_sets.html Because now with smaller sets in the dense case, maybe there is no need for wacky heuristics in CachingWrapperFilter and we could just always cache (i am sure some cases would be slower, but if in general its faster...). This would really simplify LUCENE-5101 .
        Hide
        ASF subversion and git services added a comment -

        Commit 1512422 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1512422 ]

        LUCENE-5150: Better compression of dense sets with WAH8DocIdSet.

        Show
        ASF subversion and git services added a comment - Commit 1512422 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1512422 ] LUCENE-5150 : Better compression of dense sets with WAH8DocIdSet.
        Hide
        ASF subversion and git services added a comment -

        Commit 1512423 from Adrien Grand in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1512423 ]

        LUCENE-5150: Better compression of dense sets with WAH8DocIdSet.

        Show
        ASF subversion and git services added a comment - Commit 1512423 from Adrien Grand in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1512423 ] LUCENE-5150 : Better compression of dense sets with WAH8DocIdSet.
        Hide
        Adrien Grand added a comment -

        Robert, I commented on LUCENE-5101 with an updated version of the benchmark.

        Show
        Adrien Grand added a comment - Robert, I commented on LUCENE-5101 with an updated version of the benchmark.

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development