Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Robert pointed me to this paper: http://arxiv.org/pdf/1402.6407v4 that describes an interesting way to build doc id sets: The bit space is divided into blocks of 2^16 bits so that you can store the bits which are set either in a short[] (2 bytes per doc ID) or in a FixedBitSet. The choice is easy, if less than 2^12 bits are set, then the short[] representation is more compact otherwise a FixedBitSet would be more compact. It's quite similar to the way that Solr builds DocSets in SolrIndexSearcher.getDocSet(DocsEnumState).

      1. LUCENE-5983.patch
        24 kB
        Adrien Grand

        Activity

        Hide
        Adrien Grand added a comment -

        Here are a patch and the results of the benchmark on this new DocIdSet: http://people.apache.org/~jpountz/doc_id_sets4.html

        Compared to the paper, I also optimized the super-dense case in order to encode the negation of the set to save space (like WAH8DocIdSet does). I like the fact that this set is overall very fast, especially to build, which is an interesting characteristic for caching, so I replaced WAH8DocIdSet with this new DocIdSet in CachingWrapperFilter. Another nice property is that it is naturally indexed, it doesn't need a side-car data-structure like the WAH8 and PFOR doc id sets: in the array case, you can just perform a binary search, and in the FixedBitSet case you already have random-access.

        In order to avoid the profusion of doc id sets, I'm thinking of removing the WAH8 and PFOR doc id sets since this one looks better to me in most areas.

        Show
        Adrien Grand added a comment - Here are a patch and the results of the benchmark on this new DocIdSet: http://people.apache.org/~jpountz/doc_id_sets4.html Compared to the paper, I also optimized the super-dense case in order to encode the negation of the set to save space (like WAH8DocIdSet does). I like the fact that this set is overall very fast, especially to build, which is an interesting characteristic for caching, so I replaced WAH8DocIdSet with this new DocIdSet in CachingWrapperFilter. Another nice property is that it is naturally indexed, it doesn't need a side-car data-structure like the WAH8 and PFOR doc id sets: in the array case, you can just perform a binary search, and in the FixedBitSet case you already have random-access. In order to avoid the profusion of doc id sets, I'm thinking of removing the WAH8 and PFOR doc id sets since this one looks better to me in most areas.
        Hide
        David Smiley added a comment -

        Cool. I also like that it's worst-case in advance() is not too shabby.

        Why remove WAH8 & PFOR yet not also Elias-Fano? Because EF compresses the most and doesn't perform as bad as those two in most advance() scenarios?

        Show
        David Smiley added a comment - Cool. I also like that it's worst-case in advance() is not too shabby. Why remove WAH8 & PFOR yet not also Elias-Fano? Because EF compresses the most and doesn't perform as bad as those two in most advance() scenarios?
        Hide
        Adrien Grand added a comment -

        Why remove WAH8 & PFOR yet not also Elias-Fano? Because EF compresses the most and doesn't perform as bad as those two in most advance() scenarios?

        I have to admit I don't know this set as well as the PFOR and WAH8 ones, but indeed it seems to compress very efficiently sparse sets and also has the nice property that WAH8 and PFOR miss that it is naturally indexed, ie. it doesn't need a side-car data-structure in order to be able to skip efficiently.

        Show
        Adrien Grand added a comment - Why remove WAH8 & PFOR yet not also Elias-Fano? Because EF compresses the most and doesn't perform as bad as those two in most advance() scenarios? I have to admit I don't know this set as well as the PFOR and WAH8 ones, but indeed it seems to compress very efficiently sparse sets and also has the nice property that WAH8 and PFOR miss that it is naturally indexed, ie. it doesn't need a side-car data-structure in order to be able to skip efficiently.
        Hide
        ASF subversion and git services added a comment -

        Commit 1629605 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1629605 ]

        LUCENE-5983: CachingWrapperFilter now uses RoaringDocIdSet.

        Show
        ASF subversion and git services added a comment - Commit 1629605 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1629605 ] LUCENE-5983 : CachingWrapperFilter now uses RoaringDocIdSet.
        Hide
        ASF subversion and git services added a comment -

        Commit 1629606 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1629606 ]

        LUCENE-5983: CachingWrapperFilter now uses RoaringDocIdSet.

        Show
        ASF subversion and git services added a comment - Commit 1629606 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1629606 ] LUCENE-5983 : CachingWrapperFilter now uses RoaringDocIdSet.
        Hide
        Adrien Grand added a comment -

        Committed to trunk and branch_5x. I also opened /LUCENE-5993 to discuss the removal of the WAH8 and PForDelta sets.

        Show
        Adrien Grand added a comment - Committed to trunk and branch_5x. I also opened / LUCENE-5993 to discuss the removal of the WAH8 and PForDelta sets.
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development