Lucene - Core
  1. Lucene - Core
  2. LUCENE-4236

clean up booleanquery conjunction optimizations a bit

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      After LUCENE-3505, I want to do a slight cleanup:

      • compute the term conjunctions optimization in scorer(), so its applied even if we have optional and prohibited clauses that dont exist in the segment (e.g. return null)
      • use the term conjunctions optimization when optional.size() == minShouldMatch, as that means they are all mandatory, too.
      • don't return booleanscorer1 when optional.size() == minShouldMatch, because it means we have required clauses and in general BS2 should do a much better job (e.g. use advance).
      1. LUCENE-4236.patch
        35 kB
        Robert Muir
      2. LUCENE-4236.patch
        35 kB
        Robert Muir
      3. LUCENE-4236.patch
        73 kB
        Robert Muir
      4. LUCENE-4236.patch
        46 kB
        Robert Muir
      5. LUCENE-4236.patch
        6 kB
        Robert Muir
      6. LUCENE-4236.patch
        6 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          updated patch with a code comment

          Show
          Robert Muir added a comment - updated patch with a code comment
          Hide
          Simon Willnauer added a comment -

          cool nice little optimization! looks good to me while you could add that minShouldMatch only applies to optional scorers so nobody needs to hunt that down again

          Show
          Simon Willnauer added a comment - cool nice little optimization! looks good to me while you could add that minShouldMatch only applies to optional scorers so nobody needs to hunt that down again
          Hide
          Robert Muir added a comment -

          I removed reference to the SOLR issue, i thought for some reason it was a performance problem, but that was just faulty brain cells (its just some unrelated parsing bug).

          Still it seems like apps like Solr might be generating these disjunctions with minShouldmatch = optional.size() and we should handle them as conjunctions always.

          Show
          Robert Muir added a comment - I removed reference to the SOLR issue, i thought for some reason it was a performance problem, but that was just faulty brain cells (its just some unrelated parsing bug). Still it seems like apps like Solr might be generating these disjunctions with minShouldmatch = optional.size() and we should handle them as conjunctions always.
          Hide
          Robert Muir added a comment -

          The previous patch had a bug: coord was computed for the conjunction as scorers.length/scorers.length but it should be scorers.length/maxCoord (in the case of optional clauses that happened to not exist in the segment).

          Random testing is great.

          This patch removes the specialized TermConjunctionScorer, making the TermConjunctionScorer algorithm ConjunctionScorer. I added a method to Scorer so that any scorer can return an estimated cost of how many postings it will read:

          • term: docFreq()
          • disjunction: sum(cost)
          • conjunction/phrase: min(cost) * numSubs

          and so on, which ConjunctionScorer uses.

          No perf degradation... refactoring continues

          Show
          Robert Muir added a comment - The previous patch had a bug: coord was computed for the conjunction as scorers.length/scorers.length but it should be scorers.length/maxCoord (in the case of optional clauses that happened to not exist in the segment). Random testing is great. This patch removes the specialized TermConjunctionScorer, making the TermConjunctionScorer algorithm ConjunctionScorer. I added a method to Scorer so that any scorer can return an estimated cost of how many postings it will read: term: docFreq() disjunction: sum(cost) conjunction/phrase: min(cost) * numSubs and so on, which ConjunctionScorer uses. No perf degradation... refactoring continues
          Hide
          Robert Muir added a comment -

          Current benchmarks with the generalized conjunction scorer:

                          Task   QPS trunkStdDev trunk   QPS patchStdDev patch      Pct diff
                        Phrase       15.69        0.51       15.29        0.30   -7% -    2%
                    AndHighMed       44.35        0.55       43.34        0.63   -4% -    0%
                        Fuzzy1       83.95        2.84       83.05        3.35   -8% -    6%
                       Respell       75.00        3.43       74.30        3.79  -10% -    9%
                  SloppyPhrase        7.05        0.44        7.00        0.17   -8% -    8%
                TermBGroup1M1P       58.41        1.15       58.17        0.96   -3% -    3%
                   TermGroup1M       30.72        0.14       30.73        0.32   -1% -    1%
                        Fuzzy2       33.81        1.67       33.87        1.26   -8% -    9%
                  TermBGroup1M       48.73        0.34       48.89        0.27    0% -    1%
                   AndHighHigh        8.04        0.16        8.09        0.10   -2% -    3%
                      PKLookup      296.11        2.98      298.19        3.29   -1% -    2%
                          Term      109.66        3.70      110.69        2.76   -4% -    7%
                      Wildcard       61.99        0.80       63.00        2.53   -3% -    7%
                       Prefix3       70.64        1.63       72.07        3.21   -4% -    9%
                     OrHighMed       21.48        1.29       21.94        1.05   -8% -   13%
                    OrHighHigh        8.48        0.47        8.68        0.40   -7% -   13%
                      SpanNear        7.70        0.37        7.96        0.41   -6% -   14%
                        IntNRQ        8.89        0.52        9.40        0.96  -10% -   23%
          

          Luceneutil doesnt yet benchmark more complicated BQs (e.g. nested ones, or minShouldMatch, or whatever).
          So we don't see any benefit in these benchmarks.

          Show
          Robert Muir added a comment - Current benchmarks with the generalized conjunction scorer: Task QPS trunkStdDev trunk QPS patchStdDev patch Pct diff Phrase 15.69 0.51 15.29 0.30 -7% - 2% AndHighMed 44.35 0.55 43.34 0.63 -4% - 0% Fuzzy1 83.95 2.84 83.05 3.35 -8% - 6% Respell 75.00 3.43 74.30 3.79 -10% - 9% SloppyPhrase 7.05 0.44 7.00 0.17 -8% - 8% TermBGroup1M1P 58.41 1.15 58.17 0.96 -3% - 3% TermGroup1M 30.72 0.14 30.73 0.32 -1% - 1% Fuzzy2 33.81 1.67 33.87 1.26 -8% - 9% TermBGroup1M 48.73 0.34 48.89 0.27 0% - 1% AndHighHigh 8.04 0.16 8.09 0.10 -2% - 3% PKLookup 296.11 2.98 298.19 3.29 -1% - 2% Term 109.66 3.70 110.69 2.76 -4% - 7% Wildcard 61.99 0.80 63.00 2.53 -3% - 7% Prefix3 70.64 1.63 72.07 3.21 -4% - 9% OrHighMed 21.48 1.29 21.94 1.05 -8% - 13% OrHighHigh 8.48 0.47 8.68 0.40 -7% - 13% SpanNear 7.70 0.37 7.96 0.41 -6% - 14% IntNRQ 8.89 0.52 9.40 0.96 -10% - 23% Luceneutil doesnt yet benchmark more complicated BQs (e.g. nested ones, or minShouldMatch, or whatever). So we don't see any benefit in these benchmarks.
          Hide
          Robert Muir added a comment -

          Updated patch... I tried to speed up BS2 some more here in this patch:

                         Task   QPS trunkStdDev trunk   QPS patchStdDev patch      Pct diff
                       Respell       76.69        1.75       74.06        1.36   -7% -    0%
                    AndHighMed       89.69        1.85       86.70        2.83   -8% -    1%
                      SpanNear        2.75        0.09        2.70        0.07   -7% -    4%
                        Fuzzy2       34.94        0.66       34.46        0.54   -4% -    2%
                        Fuzzy1      115.28        2.24      113.81        1.59   -4% -    2%
                   AndHighHigh       12.97        0.27       12.84        0.34   -5% -    3%
                  TermBGroup1M       47.87        0.27       47.42        0.51   -2% -    0%
                TermBGroup1M1P       53.84        1.08       53.35        0.81   -4% -    2%
                   TermGroup1M       42.55        0.51       42.36        0.56   -2% -    2%
                      PKLookup      298.51        0.80      297.81        2.68   -1% -    0%
                      Wildcard       20.66        1.13       20.75        1.09   -9% -   11%
                        Phrase        6.64        0.26        6.67        0.31   -7% -    9%
                       Prefix3       32.82        1.69       33.07        1.72   -9% -   11%
                  SloppyPhrase       26.12        0.38       26.46        0.42   -1% -    4%
                        IntNRQ       11.24        1.55       11.39        1.38  -21% -   31%
                          Term      138.34        0.93      141.27        8.71   -4% -    9%
                    OrHighHigh        7.28        0.38        7.71        0.50   -5% -   19%
                     OrHighMed       20.29        1.21       21.81        1.61   -6% -   22%
          

          It could use some code comments and cleanup but its time for a break

          Show
          Robert Muir added a comment - Updated patch... I tried to speed up BS2 some more here in this patch: Task QPS trunkStdDev trunk QPS patchStdDev patch Pct diff Respell 76.69 1.75 74.06 1.36 -7% - 0% AndHighMed 89.69 1.85 86.70 2.83 -8% - 1% SpanNear 2.75 0.09 2.70 0.07 -7% - 4% Fuzzy2 34.94 0.66 34.46 0.54 -4% - 2% Fuzzy1 115.28 2.24 113.81 1.59 -4% - 2% AndHighHigh 12.97 0.27 12.84 0.34 -5% - 3% TermBGroup1M 47.87 0.27 47.42 0.51 -2% - 0% TermBGroup1M1P 53.84 1.08 53.35 0.81 -4% - 2% TermGroup1M 42.55 0.51 42.36 0.56 -2% - 2% PKLookup 298.51 0.80 297.81 2.68 -1% - 0% Wildcard 20.66 1.13 20.75 1.09 -9% - 11% Phrase 6.64 0.26 6.67 0.31 -7% - 9% Prefix3 32.82 1.69 33.07 1.72 -9% - 11% SloppyPhrase 26.12 0.38 26.46 0.42 -1% - 4% IntNRQ 11.24 1.55 11.39 1.38 -21% - 31% Term 138.34 0.93 141.27 8.71 -4% - 9% OrHighHigh 7.28 0.38 7.71 0.50 -5% - 19% OrHighMed 20.29 1.21 21.81 1.61 -6% - 22% It could use some code comments and cleanup but its time for a break
          Hide
          Robert Muir added a comment -

          rmuir20120906-bulk-40-change

          Show
          Robert Muir added a comment - rmuir20120906-bulk-40-change
          Hide
          Michael McCandless added a comment -

          +1

          This patch looks great!

          It cleans up BS2 and specialized term conjunction scorer, and makes more accurate decisions about which sub-scorer to enumerate first (no more first docID heuristic).

          We could also use the cost estimate to sometimes let BooleanScorer take MUST clauses.

          Show
          Michael McCandless added a comment - +1 This patch looks great! It cleans up BS2 and specialized term conjunction scorer, and makes more accurate decisions about which sub-scorer to enumerate first (no more first docID heuristic). We could also use the cost estimate to sometimes let BooleanScorer take MUST clauses.
          Hide
          Michael McCandless added a comment -

          Maybe we should make it .estimateHitCount instead of estimateCost, so it's more explicit?

          Show
          Michael McCandless added a comment - Maybe we should make it .estimateHitCount instead of estimateCost, so it's more explicit?
          Hide
          Robert Muir added a comment -

          There's a lot of things i'm not happy with in the patch, i think it was more of an exploration of ideas.

          I think we could split out the cost/hitcount/conjunctionscorer idea into a separate issue as a start?

          This would keep things contained.

          Show
          Robert Muir added a comment - There's a lot of things i'm not happy with in the patch, i think it was more of an exploration of ideas. I think we could split out the cost/hitcount/conjunctionscorer idea into a separate issue as a start? This would keep things contained.
          Hide
          Simon Willnauer added a comment -

          Parts of this issue are contained in LUCENE-4607.

          Maybe we should make it .estimateHitCount instead of estimateCost, so it's more explicit?

          I changed this to estimateDocCount since its now on DISI

          Show
          Simon Willnauer added a comment - Parts of this issue are contained in LUCENE-4607 . Maybe we should make it .estimateHitCount instead of estimateCost, so it's more explicit? I changed this to estimateDocCount since its now on DISI
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Uwe Schindler added a comment -

          Move issue to Lucene 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Lucene 4.9.
          Hide
          Robert Muir added a comment -

          Patch just synced to trunk. Some of the previous stuff in it has been committed.

          Its still not ready to be committed: there is a lot of messy/bogus stuff I did. Will maybe make a branch and try to clean it up more.

          Show
          Robert Muir added a comment - Patch just synced to trunk. Some of the previous stuff in it has been committed. Its still not ready to be committed: there is a lot of messy/bogus stuff I did. Will maybe make a branch and try to clean it up more.
          Hide
          ASF subversion and git services added a comment -

          Commit 1596372 from Robert Muir in branch 'dev/branches/lucene4236'
          [ https://svn.apache.org/r1596372 ]

          LUCENE-4236: create branch to try to cleanup

          Show
          ASF subversion and git services added a comment - Commit 1596372 from Robert Muir in branch 'dev/branches/lucene4236' [ https://svn.apache.org/r1596372 ] LUCENE-4236 : create branch to try to cleanup
          Hide
          ASF subversion and git services added a comment -

          Commit 1596373 from Robert Muir in branch 'dev/branches/lucene4236'
          [ https://svn.apache.org/r1596373 ]

          LUCENE-4236: commit current state

          Show
          ASF subversion and git services added a comment - Commit 1596373 from Robert Muir in branch 'dev/branches/lucene4236' [ https://svn.apache.org/r1596373 ] LUCENE-4236 : commit current state
          Hide
          ASF subversion and git services added a comment -

          Commit 1596440 from Robert Muir in branch 'dev/branches/lucene4236'
          [ https://svn.apache.org/r1596440 ]

          LUCENE-4236: try to make more palatable

          Show
          ASF subversion and git services added a comment - Commit 1596440 from Robert Muir in branch 'dev/branches/lucene4236' [ https://svn.apache.org/r1596440 ] LUCENE-4236 : try to make more palatable
          Hide
          ASF subversion and git services added a comment -

          Commit 1596442 from Robert Muir in branch 'dev/branches/lucene4236'
          [ https://svn.apache.org/r1596442 ]

          LUCENE-4236: move all crazies into one place

          Show
          ASF subversion and git services added a comment - Commit 1596442 from Robert Muir in branch 'dev/branches/lucene4236' [ https://svn.apache.org/r1596442 ] LUCENE-4236 : move all crazies into one place
          Hide
          Robert Muir added a comment -

          I tried to clean this up as much as I can...

          OR tasks look fine (with BS1 disabled on both checkouts):

                              Task   QPS trunk      StdDev   QPS patch      StdDev                Pct diff
                      OrNotHighLow      958.58      (2.1%)      988.37      (2.7%)    3.1% (  -1% -    8%)
                        OrHighHigh       18.76      (8.2%)       19.83     (15.1%)    5.7% ( -16% -   31%)
                         OrHighMed       43.12      (8.3%)       45.64     (15.3%)    5.9% ( -16% -   32%)
                         OrHighLow       44.76      (9.1%)       47.92     (16.9%)    7.1% ( -17% -   36%)
                      OrNotHighMed      168.34      (3.4%)      189.12      (4.3%)   12.3% (   4% -   20%)
                     OrNotHighHigh       45.16      (3.5%)       60.43     (12.2%)   33.8% (  17% -   51%)
                     OrHighNotHigh       27.07      (3.4%)       36.27     (12.6%)   34.0% (  17% -   51%)
                      OrHighNotMed       79.81      (3.7%)      111.10     (15.3%)   39.2% (  19% -   60%)
                      OrHighNotLow       96.78      (4.0%)      137.49     (17.1%)   42.1% (  20% -   65%)
          

          It would also be nice to run the "eg.filter.tasks" but these are currently broken in luceneutil.

          Show
          Robert Muir added a comment - I tried to clean this up as much as I can... OR tasks look fine (with BS1 disabled on both checkouts): Task QPS trunk StdDev QPS patch StdDev Pct diff OrNotHighLow 958.58 (2.1%) 988.37 (2.7%) 3.1% ( -1% - 8%) OrHighHigh 18.76 (8.2%) 19.83 (15.1%) 5.7% ( -16% - 31%) OrHighMed 43.12 (8.3%) 45.64 (15.3%) 5.9% ( -16% - 32%) OrHighLow 44.76 (9.1%) 47.92 (16.9%) 7.1% ( -17% - 36%) OrNotHighMed 168.34 (3.4%) 189.12 (4.3%) 12.3% ( 4% - 20%) OrNotHighHigh 45.16 (3.5%) 60.43 (12.2%) 33.8% ( 17% - 51%) OrHighNotHigh 27.07 (3.4%) 36.27 (12.6%) 34.0% ( 17% - 51%) OrHighNotMed 79.81 (3.7%) 111.10 (15.3%) 39.2% ( 19% - 60%) OrHighNotLow 96.78 (4.0%) 137.49 (17.1%) 42.1% ( 20% - 65%) It would also be nice to run the "eg.filter.tasks" but these are currently broken in luceneutil.
          Hide
          ASF subversion and git services added a comment -

          Commit 1596606 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1596606 ]

          LUCENE-4236: add a new test for crazy corner cases of coord() handling

          Show
          ASF subversion and git services added a comment - Commit 1596606 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1596606 ] LUCENE-4236 : add a new test for crazy corner cases of coord() handling
          Hide
          ASF subversion and git services added a comment -

          Commit 1596607 from Robert Muir in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1596607 ]

          LUCENE-4236: add a new test for crazy corner cases of coord() handling

          Show
          ASF subversion and git services added a comment - Commit 1596607 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1596607 ] LUCENE-4236 : add a new test for crazy corner cases of coord() handling
          Hide
          Robert Muir added a comment -

          Here's the patch. I think its ready.

          I committed the new test already to trunk/4.x.

          Show
          Robert Muir added a comment - Here's the patch. I think its ready. I committed the new test already to trunk/4.x.
          Hide
          Michael McCandless added a comment -

          +1, this is a great cleanup: more understandable than what we have today.

          Maybe we should leave FilterScorer package private until there's a need for public?

          Show
          Michael McCandless added a comment - +1, this is a great cleanup: more understandable than what we have today. Maybe we should leave FilterScorer package private until there's a need for public?
          Hide
          Robert Muir added a comment -

          Good idea. If we have a need somewhere else we can open it up.

          Show
          Robert Muir added a comment - Good idea. If we have a need somewhere else we can open it up.
          Hide
          ASF subversion and git services added a comment -

          Commit 1596640 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1596640 ]

          LUCENE-4236: cleanup/optimize BooleanScorer in-order creation

          Show
          ASF subversion and git services added a comment - Commit 1596640 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1596640 ] LUCENE-4236 : cleanup/optimize BooleanScorer in-order creation
          Hide
          ASF subversion and git services added a comment -

          Commit 1596646 from Robert Muir in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1596646 ]

          LUCENE-4236: cleanup/optimize BooleanScorer in-order creation

          Show
          ASF subversion and git services added a comment - Commit 1596646 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1596646 ] LUCENE-4236 : cleanup/optimize BooleanScorer in-order creation
          Hide
          Mikhail Khludnev added a comment -

          Robert Muir great job!

          btw, i wonder if Solr is allowed to search with BooleanScorer (term-at-time)?

          Show
          Mikhail Khludnev added a comment - Robert Muir great job! btw, i wonder if Solr is allowed to search with BooleanScorer (term-at-time)?

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development