Lucene - Core
  1. Lucene - Core
  2. LUCENE-6645

BKD tree queries should use BitDocIdSet.Builder

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.3, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      When I was iterating on BKD tree originally I remember trying to use this builder (which makes a sparse bit set at first and then upgrades to dense if enough bits get set) and being disappointed with its performance.

      I wound up just making a FixedBitSet every time, but this is obviously wasteful for small queries.

      It could be the perf was poor because I was always .or'ing in DISIs that had 512 - 1024 hits each time (the size of each leaf cell in the BKD tree)? I also had to make my own DISI wrapper around each leaf cell... maybe that was the source of the slowness, not sure.

      I also sort of wondered whether the SmallDocSet in spatial module (backed by a SentinelIntSet) might be faster ... though it'd need to be sorted in the and after building before returning to Lucene.

      1. LUCENE-6645.patch
        4 kB
        Michael McCandless
      2. LUCENE-6645.patch
        3 kB
        Michael McCandless
      3. LUCENE-6645.patch
        59 kB
        Adrien Grand
      4. LUCENE-6645.patch
        51 kB
        Adrien Grand
      5. LUCENE-6645.patch
        25 kB
        Adrien Grand
      6. LUCENE-6645.patch
        13 kB
        Michael McCandless
      7. LUCENE-6645-spatial.patch
        10 kB
        Adrien Grand

        Activity

        Hide
        David Smiley added a comment -

        FYI I originally used FixedBitSet in the RPT based spatial filters (technically OpenBitSet originally) because at that time there was no Sparse impl with a builder. That's also why SmallDocSet exists. I'm curious to see what you find here regarding the performance; I'd like to use the bitset builder and remove SmallDocSet.

        Show
        David Smiley added a comment - FYI I originally used FixedBitSet in the RPT based spatial filters (technically OpenBitSet originally ) because at that time there was no Sparse impl with a builder. That's also why SmallDocSet exists. I'm curious to see what you find here regarding the performance; I'd like to use the bitset builder and remove SmallDocSet.
        Hide
        Michael McCandless added a comment -

        Here's a patch, just restoring what I had in earlier iterations on the original BKD tree issue (LUCENE-6477) ... maybe I am doing something silly?

        The BKD test passes ... I'll compare performance vs current trunk (FixedBitSet every time).

        I had to make little reusable DISI classes to pass to the BitDocIdSet.Builder.or method.

        It could be if we made the BKD wasteful by indexing prefix terms so that the number of DISIs we need to or together are small, the perf hit wouldn't be so much ...

        Show
        Michael McCandless added a comment - Here's a patch, just restoring what I had in earlier iterations on the original BKD tree issue ( LUCENE-6477 ) ... maybe I am doing something silly? The BKD test passes ... I'll compare performance vs current trunk (FixedBitSet every time). I had to make little reusable DISI classes to pass to the BitDocIdSet.Builder.or method. It could be if we made the BKD wasteful by indexing prefix terms so that the number of DISIs we need to or together are small, the perf hit wouldn't be so much ...
        Hide
        Michael McCandless added a comment -

        The lat/lons to index are here:

        http://people.apache.org/~mikemccand/latlon.subsetPlusAllLondon.txt.lzma

        it uncompresses to ~1.9 GB.

        Then run IndexAndSearchOpenStreetMaps.java in
        luceneutil/src/main/perf. (You have to edit the hard path to this
        lat/lons input file).

        Run it first with that createIndex uncommented, then comment it out
        (you can just re-use that index to test searching).

        When I run this on trunk I get 1.54 sec for 225 "bboxes around
        London", and with the patch 3.89 seconds, or ~2.5 X slower ...

        Show
        Michael McCandless added a comment - The lat/lons to index are here: http://people.apache.org/~mikemccand/latlon.subsetPlusAllLondon.txt.lzma it uncompresses to ~1.9 GB. Then run IndexAndSearchOpenStreetMaps.java in luceneutil/src/main/perf. (You have to edit the hard path to this lat/lons input file). Run it first with that createIndex uncommented, then comment it out (you can just re-use that index to test searching). When I run this on trunk I get 1.54 sec for 225 "bboxes around London", and with the patch 3.89 seconds, or ~2.5 X slower ...
        Hide
        Adrien Grand added a comment -

        I played a bit with the benchmark and have similar results (1.76 sec for trunk and more than 4 sec with the patch). It's a worst case for BitDocIdSetBuilder given that it always starts to build a SparseFixedBitSet to eventually upgrade it to a FixedBitSet. But still it's disappointing that it's so slow compared to building a FixedBitSet directly.

        I've experimented with a more brute-force approach (see attached patch) that uses a plain int[] instead of a SparseFixedBitSet for the sparse case, and it seems to perform better: the benchmark runs in 1.76 sec on trunk and 2.70 sec with the patch if the builder is configured to use an int[] up to number of docs of maxDoc / 128. It goes down to 1.96 with a threshold of maxDoc / 2048. Maybe this is what we should use instead of BitDocIdSetBuilder?

        I tried to see how this affects our luceneutil benchmark and there is barely any change:

                            TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                          Fuzzy1       74.41     (18.3%)       69.59     (19.4%)   -6.5% ( -37% -   38%)
                         LowTerm      761.39      (2.4%)      749.20      (3.6%)   -1.6% (  -7% -    4%)
                    OrNotHighLow      877.81      (2.2%)      867.60      (5.3%)   -1.2% (  -8% -    6%)
                    OrHighNotMed       76.63      (2.1%)       75.89      (2.7%)   -1.0% (  -5% -    3%)
                         MedTerm      309.75      (1.3%)      306.86      (2.6%)   -0.9% (  -4% -    2%)
                      OrHighHigh       26.86      (5.4%)       26.64      (3.3%)   -0.8% (  -9% -    8%)
                   OrNotHighHigh       67.94      (1.0%)       67.42      (2.0%)   -0.8% (  -3% -    2%)
                        HighTerm      132.28      (1.4%)      131.29      (1.7%)   -0.7% (  -3% -    2%)
                         Respell       78.71      (2.8%)       78.14      (3.2%)   -0.7% (  -6% -    5%)
                       LowPhrase      121.23      (0.8%)      120.47      (1.3%)   -0.6% (  -2% -    1%)
                    OrHighNotLow      112.94      (2.3%)      112.25      (2.5%)   -0.6% (  -5% -    4%)
                    OrNotHighMed      223.81      (2.4%)      222.52      (3.8%)   -0.6% (  -6% -    5%)
                       OrHighLow       71.79      (4.3%)       71.39      (3.3%)   -0.6% (  -7% -    7%)
                     MedSpanNear       23.33      (1.1%)       23.21      (1.8%)   -0.5% (  -3% -    2%)
                     AndHighHigh       62.01      (1.9%)       61.71      (3.6%)   -0.5% (  -5% -    5%)
                       OrHighMed       41.79      (5.5%)       41.61      (3.6%)   -0.4% (  -9% -    9%)
                      AndHighMed       90.86      (2.0%)       90.61      (2.8%)   -0.3% (  -5% -    4%)
                HighSloppyPhrase       47.43      (4.6%)       47.33      (4.8%)   -0.2% (  -9% -    9%)
                      HighPhrase       28.36      (1.6%)       28.30      (1.3%)   -0.2% (  -3% -    2%)
                       MedPhrase      147.25      (1.4%)      146.99      (1.6%)   -0.2% (  -3% -    2%)
                 LowSloppyPhrase       37.07      (2.2%)       37.03      (2.3%)   -0.1% (  -4% -    4%)
                 MedSloppyPhrase      156.95      (3.7%)      156.80      (3.6%)   -0.1% (  -7% -    7%)
                     LowSpanNear       29.05      (2.2%)       29.02      (2.0%)   -0.1% (  -4% -    4%)
                   OrHighNotHigh       61.13      (1.5%)       61.08      (1.6%)   -0.1% (  -3% -    3%)
                    HighSpanNear       15.36      (1.7%)       15.36      (1.8%)    0.0% (  -3% -    3%)
                        Wildcard      111.57      (3.1%)      113.05      (2.1%)    1.3% (  -3% -    6%)
                          IntNRQ        7.49      (7.3%)        7.60      (5.2%)    1.4% ( -10% -   14%)
                         Prefix3       72.81      (4.6%)       74.18      (4.1%)    1.9% (  -6% -   11%)
                      AndHighLow      974.36      (3.0%)      994.46      (2.9%)    2.1% (  -3% -    8%)
                          Fuzzy2       47.42     (16.1%)       53.71     (16.5%)   13.3% ( -16% -   54%)
        

        I suspect this is because our multi-term queries in this benchmark match some high-frequency terms so the upgrade to a FixedBitSet happens quickly anyway.

        Show
        Adrien Grand added a comment - I played a bit with the benchmark and have similar results (1.76 sec for trunk and more than 4 sec with the patch). It's a worst case for BitDocIdSetBuilder given that it always starts to build a SparseFixedBitSet to eventually upgrade it to a FixedBitSet. But still it's disappointing that it's so slow compared to building a FixedBitSet directly. I've experimented with a more brute-force approach (see attached patch) that uses a plain int[] instead of a SparseFixedBitSet for the sparse case, and it seems to perform better: the benchmark runs in 1.76 sec on trunk and 2.70 sec with the patch if the builder is configured to use an int[] up to number of docs of maxDoc / 128. It goes down to 1.96 with a threshold of maxDoc / 2048. Maybe this is what we should use instead of BitDocIdSetBuilder? I tried to see how this affects our luceneutil benchmark and there is barely any change: TaskQPS baseline StdDev QPS patch StdDev Pct diff Fuzzy1 74.41 (18.3%) 69.59 (19.4%) -6.5% ( -37% - 38%) LowTerm 761.39 (2.4%) 749.20 (3.6%) -1.6% ( -7% - 4%) OrNotHighLow 877.81 (2.2%) 867.60 (5.3%) -1.2% ( -8% - 6%) OrHighNotMed 76.63 (2.1%) 75.89 (2.7%) -1.0% ( -5% - 3%) MedTerm 309.75 (1.3%) 306.86 (2.6%) -0.9% ( -4% - 2%) OrHighHigh 26.86 (5.4%) 26.64 (3.3%) -0.8% ( -9% - 8%) OrNotHighHigh 67.94 (1.0%) 67.42 (2.0%) -0.8% ( -3% - 2%) HighTerm 132.28 (1.4%) 131.29 (1.7%) -0.7% ( -3% - 2%) Respell 78.71 (2.8%) 78.14 (3.2%) -0.7% ( -6% - 5%) LowPhrase 121.23 (0.8%) 120.47 (1.3%) -0.6% ( -2% - 1%) OrHighNotLow 112.94 (2.3%) 112.25 (2.5%) -0.6% ( -5% - 4%) OrNotHighMed 223.81 (2.4%) 222.52 (3.8%) -0.6% ( -6% - 5%) OrHighLow 71.79 (4.3%) 71.39 (3.3%) -0.6% ( -7% - 7%) MedSpanNear 23.33 (1.1%) 23.21 (1.8%) -0.5% ( -3% - 2%) AndHighHigh 62.01 (1.9%) 61.71 (3.6%) -0.5% ( -5% - 5%) OrHighMed 41.79 (5.5%) 41.61 (3.6%) -0.4% ( -9% - 9%) AndHighMed 90.86 (2.0%) 90.61 (2.8%) -0.3% ( -5% - 4%) HighSloppyPhrase 47.43 (4.6%) 47.33 (4.8%) -0.2% ( -9% - 9%) HighPhrase 28.36 (1.6%) 28.30 (1.3%) -0.2% ( -3% - 2%) MedPhrase 147.25 (1.4%) 146.99 (1.6%) -0.2% ( -3% - 2%) LowSloppyPhrase 37.07 (2.2%) 37.03 (2.3%) -0.1% ( -4% - 4%) MedSloppyPhrase 156.95 (3.7%) 156.80 (3.6%) -0.1% ( -7% - 7%) LowSpanNear 29.05 (2.2%) 29.02 (2.0%) -0.1% ( -4% - 4%) OrHighNotHigh 61.13 (1.5%) 61.08 (1.6%) -0.1% ( -3% - 3%) HighSpanNear 15.36 (1.7%) 15.36 (1.8%) 0.0% ( -3% - 3%) Wildcard 111.57 (3.1%) 113.05 (2.1%) 1.3% ( -3% - 6%) IntNRQ 7.49 (7.3%) 7.60 (5.2%) 1.4% ( -10% - 14%) Prefix3 72.81 (4.6%) 74.18 (4.1%) 1.9% ( -6% - 11%) AndHighLow 974.36 (3.0%) 994.46 (2.9%) 2.1% ( -3% - 8%) Fuzzy2 47.42 (16.1%) 53.71 (16.5%) 13.3% ( -16% - 54%) I suspect this is because our multi-term queries in this benchmark match some high-frequency terms so the upgrade to a FixedBitSet happens quickly anyway.
        Hide
        Michael McCandless added a comment -

        Thanks Adrien Grand, this is impressive! A radix sorter, an Adder abstraction to skip hot loop if statements... looks like you had fun with hotspot

        Did the Adder make a substantial improvement over the more straightforward if-per-add? Maybe we could just add a .grow which would pre-grow the int[] if necessary ... not sure.

        Most of the changes are in DocIdSetBuilder, and less so in BKDTreeReader. You might get a bit more speedup if instead of prepareAdd(1), you did the prepareAdd up front outside the loop with the worst case count (number of docs in the leaf block)? I.e. this would reserve for the worse case (all docs pass the filter). These leaf blocks are smallish by default (up to 1024 docs), and worst case is this upgrades to a bitset a bit early? Not sure it will help that much, since most of the cells are visited via addAll...

        In each cell of the BKD tree the docIDs are sorted ... but we don't take advantage of that here (not sure how we would).

        Maybe we should try to contain the added hairiness to BKDTreeReader instead of DocIdSetBuilder if indeed this is the only user of this API that is so strange (or of tons of tiny sorted docID blocks)

        Show
        Michael McCandless added a comment - Thanks Adrien Grand , this is impressive! A radix sorter, an Adder abstraction to skip hot loop if statements... looks like you had fun with hotspot Did the Adder make a substantial improvement over the more straightforward if-per-add? Maybe we could just add a .grow which would pre-grow the int[] if necessary ... not sure. Most of the changes are in DocIdSetBuilder, and less so in BKDTreeReader. You might get a bit more speedup if instead of prepareAdd(1) , you did the prepareAdd up front outside the loop with the worst case count (number of docs in the leaf block)? I.e. this would reserve for the worse case (all docs pass the filter). These leaf blocks are smallish by default (up to 1024 docs), and worst case is this upgrades to a bitset a bit early? Not sure it will help that much, since most of the cells are visited via addAll... In each cell of the BKD tree the docIDs are sorted ... but we don't take advantage of that here (not sure how we would). Maybe we should try to contain the added hairiness to BKDTreeReader instead of DocIdSetBuilder if indeed this is the only user of this API that is so strange (or of tons of tiny sorted docID blocks)
        Hide
        Adrien Grand added a comment -

        Thanks for having a look!

        Not sure it will help that much, since most of the cells are visited via addAll...

        Indeed I did not try to optimize here since my profiler did not see it as a hotspot.

        Maybe we should try to contain the added hairiness to BKDTreeReader instead of DocIdSetBuilder if indeed this is the only user of this API that is so strange (or of tons of tiny sorted docID blocks)

        I agree the API is a bit crazy now. It was a way to avoid checking the array length on every call to add(). I'll see how much it costs to remove this 'Adder' class in favour of a grow() method . However, adding tons of tiny sorted doc ID blocks is something that MultiTermQueries can do all the time too so I'm quite happy that your benchmark revealed this inefficiency.

        Show
        Adrien Grand added a comment - Thanks for having a look! Not sure it will help that much, since most of the cells are visited via addAll... Indeed I did not try to optimize here since my profiler did not see it as a hotspot. Maybe we should try to contain the added hairiness to BKDTreeReader instead of DocIdSetBuilder if indeed this is the only user of this API that is so strange (or of tons of tiny sorted docID blocks) I agree the API is a bit crazy now. It was a way to avoid checking the array length on every call to add(). I'll see how much it costs to remove this 'Adder' class in favour of a grow() method . However, adding tons of tiny sorted doc ID blocks is something that MultiTermQueries can do all the time too so I'm quite happy that your benchmark revealed this inefficiency.
        Hide
        Adrien Grand added a comment -

        Here is a patch with less API crazyness that should still provide similar performance. It removed the Adder intermediate and uses a grow() method instead as suggested.

        Show
        Adrien Grand added a comment - Here is a patch with less API crazyness that should still provide similar performance. It removed the Adder intermediate and uses a grow() method instead as suggested.
        Hide
        Michael McCandless added a comment -

        Thanks Adrien Grand!

        spatial module is angry w/ the patch... but I was able to run the "2% of OpenStreetMap points" benchmark.

        Today (FixedBitSet) it's 1.56 seconds to run 225 queries, and with the last patch it's 2.35 seconds which is a great improvement over where I started here (3.89 seconds)!

        Show
        Michael McCandless added a comment - Thanks Adrien Grand ! spatial module is angry w/ the patch... but I was able to run the "2% of OpenStreetMap points" benchmark. Today (FixedBitSet) it's 1.56 seconds to run 225 queries, and with the last patch it's 2.35 seconds which is a great improvement over where I started here (3.89 seconds)!
        Hide
        Adrien Grand added a comment -

        woops I forgot to svn add a file, will upload a fixed patch tomorrow. Thanks for confirming it is faster now!

        Show
        Adrien Grand added a comment - woops I forgot to svn add a file, will upload a fixed patch tomorrow. Thanks for confirming it is faster now!
        Hide
        Adrien Grand added a comment -

        Here is a fixed patch.

        Show
        Adrien Grand added a comment - Here is a fixed patch.
        Hide
        ASF subversion and git services added a comment -

        Commit 1690026 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1690026 ]

        LUCENE-6645: Optimized DocIdSet building for the "many small postings lists" case.

        Show
        ASF subversion and git services added a comment - Commit 1690026 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1690026 ] LUCENE-6645 : Optimized DocIdSet building for the "many small postings lists" case.
        Hide
        Michael McCandless added a comment -

        Thanks Adrien Grand!

        Show
        Michael McCandless added a comment - Thanks Adrien Grand !
        Hide
        ASF subversion and git services added a comment -

        Commit 1690041 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1690041 ]

        LUCENE-6645: Optimized DocIdSet building for the "many small postings lists" case.

        Show
        ASF subversion and git services added a comment - Commit 1690041 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1690041 ] LUCENE-6645 : Optimized DocIdSet building for the "many small postings lists" case.
        Hide
        Michael McCandless added a comment -

        I think with this small change to the builder, we can save one if check in the add method.

        In the BKD test this gives a good gain ... from 3.08 sec in trunk down to 2.32 sec ...

        Show
        Michael McCandless added a comment - I think with this small change to the builder, we can save one if check in the add method. In the BKD test this gives a good gain ... from 3.08 sec in trunk down to 2.32 sec ...
        Hide
        Adrien Grand added a comment -

        +1 I'm curious if you know what part makes it so much faster? Is it limiting the maximum array size to threshold at most, or the fact that some ArrayUtil logic is now inlined in the builder?

        Show
        Adrien Grand added a comment - +1 I'm curious if you know what part makes it so much faster? Is it limiting the maximum array size to threshold at most, or the fact that some ArrayUtil logic is now inlined in the builder?
        Hide
        Michael McCandless added a comment -

        Slightly improved patch: I moved an assert higher up, and made the calls to growBuffer consistently only call when < threshold.

        I'm not sure why the gains are so high, it must be hotspot silliness...

        Show
        Michael McCandless added a comment - Slightly improved patch: I moved an assert higher up, and made the calls to growBuffer consistently only call when < threshold. I'm not sure why the gains are so high, it must be hotspot silliness...
        Hide
        Adrien Grand added a comment -

        +1 thanks for fixing the javadocs

        Show
        Adrien Grand added a comment - +1 thanks for fixing the javadocs
        Hide
        ASF subversion and git services added a comment -

        Commit 1690175 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1690175 ]

        LUCENE-6645: optimize DocIdSetBuilder a bit more

        Show
        ASF subversion and git services added a comment - Commit 1690175 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1690175 ] LUCENE-6645 : optimize DocIdSetBuilder a bit more
        Hide
        ASF subversion and git services added a comment -

        Commit 1690176 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1690176 ]

        LUCENE-6645: optimize DocIdSetBuilder a bit more

        Show
        ASF subversion and git services added a comment - Commit 1690176 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1690176 ] LUCENE-6645 : optimize DocIdSetBuilder a bit more
        Hide
        David Smiley added a comment -

        I took a look at the code out of curiosity and I see that "BitDocIdSetBuilder" is the old implementation and that it's been kept around in the spatial module for the IntersectsRPTVerifyQuery. I think that deserved mention in the commentary here. Can't we remove it, and have IntersectsRPTVerifyQuery use the new DocIdSetBuilder? I suspect this was done because of BitDocIdSetBuilder.isDefinitelyEmpty(); yes? If so can't we add a similar method to DocIdSetBuilder?

        Shouldn't QueryBitSetProducer in the "join" module use RoaringDocIdSet for it's cached docIdSets instead of the Fixed/Sparse choice chosen by the new BitSet.of method added in this patch? RoaringDocIdSet is ideal for caches; no?

        Show
        David Smiley added a comment - I took a look at the code out of curiosity and I see that "BitDocIdSetBuilder" is the old implementation and that it's been kept around in the spatial module for the IntersectsRPTVerifyQuery. I think that deserved mention in the commentary here. Can't we remove it, and have IntersectsRPTVerifyQuery use the new DocIdSetBuilder? I suspect this was done because of BitDocIdSetBuilder.isDefinitelyEmpty(); yes? If so can't we add a similar method to DocIdSetBuilder? Shouldn't QueryBitSetProducer in the "join" module use RoaringDocIdSet for it's cached docIdSets instead of the Fixed/Sparse choice chosen by the new BitSet.of method added in this patch? RoaringDocIdSet is ideal for caches; no?
        Hide
        Adrien Grand added a comment -

        Sorry David, indeed I should have mentionned it indeed. When I looked at IntersectsRPTVerifyQuery, I saw it was using the produced bits so I thought it actually need bit sets, but maybe it doesn't and we could just use advance()? Regarding isDefinitelyEmpty, I'm wondering if we could keep the builders initially empty and then instantiate them on the first time than we need to add data? Then we could use a null check to know whether they have any content at all, would it work?

        Shouldn't QueryBitSetProducer in the "join" module use RoaringDocIdSet for it's cached docIdSets instead of the Fixed/Sparse choice chosen by the new BitSet.of method added in this patch? RoaringDocIdSet is ideal for caches; no?

        RoaringDocIdSet is indeed our best option for caching. However, the join module needs random access and in particular nextSetBit/prevSetBit operations which we can't provide with RoaringDocIdSet. RoaringDocIdSet could potentially use binary search on blocks that are represented with a short[] for random-access, which ought to be fast given that short[] blocks can only contain 4096 documents at most (when then are more docs, we use a bit set), but it was still much slower than random-access on SparseFixedBitSet and FixedBitSet so I preferred not to expose random access which might be a performance trap.

        Show
        Adrien Grand added a comment - Sorry David, indeed I should have mentionned it indeed. When I looked at IntersectsRPTVerifyQuery, I saw it was using the produced bits so I thought it actually need bit sets, but maybe it doesn't and we could just use advance()? Regarding isDefinitelyEmpty, I'm wondering if we could keep the builders initially empty and then instantiate them on the first time than we need to add data? Then we could use a null check to know whether they have any content at all, would it work? Shouldn't QueryBitSetProducer in the "join" module use RoaringDocIdSet for it's cached docIdSets instead of the Fixed/Sparse choice chosen by the new BitSet.of method added in this patch? RoaringDocIdSet is ideal for caches; no? RoaringDocIdSet is indeed our best option for caching. However, the join module needs random access and in particular nextSetBit/prevSetBit operations which we can't provide with RoaringDocIdSet. RoaringDocIdSet could potentially use binary search on blocks that are represented with a short[] for random-access, which ought to be fast given that short[] blocks can only contain 4096 documents at most (when then are more docs, we use a bit set), but it was still much slower than random-access on SparseFixedBitSet and FixedBitSet so I preferred not to expose random access which might be a performance trap.
        Hide
        David Smiley added a comment -

        When I looked at IntersectsRPTVerifyQuery, I saw it was using the produced bits so I thought it actually need bit sets, but maybe it doesn't and we could just use advance()?

        It doesn't need bit sets (random-access). I did a little playing around just now and saw DocIdSetBuilder plugged in easily except for isDefinitelyEmpty(). ...

        Regarding isDefinitelyEmpty, I'm wondering if we could keep the builders initially empty and then instantiate them on the first time than we need to add data? Then we could use a null check to know whether they have any content at all, would it work?

        We could, but that's a little more error prone (null check) & more code than simply having an isDefinitelyEmpty() method. In fact it would simply be isEmpty() for DocIdSetBuilder as it has a definitive answer. Nonetheless if you feel this method is somehow a bad idea then we can proceed with your suggestion.

        RE RoaringDocIdSet – that's very interesting; thanks for the background. Perhaps a comment in QueryBitSetProducer would clarify why this choice is made.

        Show
        David Smiley added a comment - When I looked at IntersectsRPTVerifyQuery, I saw it was using the produced bits so I thought it actually need bit sets, but maybe it doesn't and we could just use advance()? It doesn't need bit sets (random-access). I did a little playing around just now and saw DocIdSetBuilder plugged in easily except for isDefinitelyEmpty(). ... Regarding isDefinitelyEmpty, I'm wondering if we could keep the builders initially empty and then instantiate them on the first time than we need to add data? Then we could use a null check to know whether they have any content at all, would it work? We could, but that's a little more error prone (null check) & more code than simply having an isDefinitelyEmpty() method. In fact it would simply be isEmpty() for DocIdSetBuilder as it has a definitive answer. Nonetheless if you feel this method is somehow a bad idea then we can proceed with your suggestion. RE RoaringDocIdSet – that's very interesting; thanks for the background. Perhaps a comment in QueryBitSetProducer would clarify why this choice is made.
        Hide
        Adrien Grand added a comment -

        I gave it a try without adding a method to DocIdSetBuilder and by using iterators of the returned DocIdSets instead of using random-access. David, maybe you could have a look?

        Show
        Adrien Grand added a comment - I gave it a try without adding a method to DocIdSetBuilder and by using iterators of the returned DocIdSets instead of using random-access. David, maybe you could have a look?
        Hide
        David Smiley added a comment -

        Looks great Adrien! I took a careful look at it. +1

        Show
        David Smiley added a comment - Looks great Adrien! I took a careful look at it. +1
        Hide
        Adrien Grand added a comment -

        Thanks for reviewing David, I'll commit shortly.

        Show
        Adrien Grand added a comment - Thanks for reviewing David, I'll commit shortly.
        Hide
        ASF subversion and git services added a comment -

        Commit 1691335 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1691335 ]

        LUCENE-6645: Remove BitDocIdSetBuilder from lucene-spatial as well.

        Show
        ASF subversion and git services added a comment - Commit 1691335 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1691335 ] LUCENE-6645 : Remove BitDocIdSetBuilder from lucene-spatial as well.
        Hide
        ASF subversion and git services added a comment -

        Commit 1691344 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1691344 ]

        LUCENE-6645: Remove BitDocIdSetBuilder from lucene-spatial as well.

        Show
        ASF subversion and git services added a comment - Commit 1691344 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1691344 ] LUCENE-6645 : Remove BitDocIdSetBuilder from lucene-spatial as well.
        Hide
        Shalin Shekhar Mangar added a comment -

        Bulk close for 5.3.0 release

        Show
        Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

          People

          • Assignee:
            Unassigned
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development