Lucene - Core
  1. Lucene - Core
  2. LUCENE-3328

Specialize BooleanQuery if all clauses are TermQueries

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.4, 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      During work on LUCENE-3319 I ran into issues with BooleanQuery compared to PhraseQuery in the exact case. If I disable scoring on PhraseQuery and bypass the position matching, essentially doing a conjunction match, ExactPhraseScorer beats plain boolean scorer by 40% which is a sizeable gain. I converted a ConjunctionScorer to use DocsEnum directly but still didn't get all the 40% from PhraseQuery. Yet, it turned out with further optimizations this gets very close to PhraseQuery. The biggest gain here came from converting the hand crafted loop in ConjunctionScorer#doNext to a for loop which seems to be less confusing to hotspot. In this particular case I think code specialization makes lots of sense since BQ with TQ is by far one of the most common queries.

      I will upload a patch shortly

      1. LUCENE-3328.patch
        16 kB
        Simon Willnauer
      2. LUCENE-3328.patch
        8 kB
        Robert Muir
      3. LUCENE-3328.patch
        11 kB
        Simon Willnauer

        Activity

        Hide
        Simon Willnauer added a comment -

        here is a first patch that rewrites to ConjunctinTermQuery if possible.

        Show
        Simon Willnauer added a comment - here is a first patch that rewrites to ConjunctinTermQuery if possible.
        Hide
        Robert Muir added a comment -

        do we really need an extra query or rewrite? Can't booleanweight just pick ConjunctionTermScorer in this case?

        Show
        Robert Muir added a comment - do we really need an extra query or rewrite? Can't booleanweight just pick ConjunctionTermScorer in this case?
        Hide
        Robert Muir added a comment -

        here is the same thing, only as a scorer that booleanweight picks.

        this is much simpler imo:

        • no state-keeping of 'rewriteToConjunctionTermsQuery' in the Query
        • no hassles for any highlighters or anything doing instanceof
        • plays correct with booleanquery's rewrite: e.g. hoisting of single-clauses into termqueries.

        In general i think Query.rewrite should be reserved for simplifying Queries, this is not a simpler query, just a faster scorer

        Show
        Robert Muir added a comment - here is the same thing, only as a scorer that booleanweight picks. this is much simpler imo: no state-keeping of 'rewriteToConjunctionTermsQuery' in the Query no hassles for any highlighters or anything doing instanceof plays correct with booleanquery's rewrite: e.g. hoisting of single-clauses into termqueries. In general i think Query.rewrite should be reserved for simplifying Queries, this is not a simpler query, just a faster scorer
        Hide
        Simon Willnauer added a comment - - edited

        here is the same thing, only as a scorer that booleanweight picks.

        I like the size of the patch! Thanks for moving this into the weight. I had it separate to make BW less complex but this looks good though.

        In general i think Query.rewrite should be reserved for simplifying Queries, this is not a simpler query, just a faster scorer

        I disagree here, if this would be the case it should be called simplify(Query). In general its a rewrite method and should not be judged if it simplifies or not.

        Here are some benchmark results trunk vs. patch (10M medium wiki docs):

        Task QPS Trunk StdDev QPS Patch  StdDev Pct diff
        Prefix3 29.84 1.14 29.02 1.37 -10% - 5%
        IntNRQ  5.82 0.67 5.68  0.55 -20% - 20%
        Wildcard 15.96  0.77 15.62  0.63 -10% - 7%
        Term 79.10  4.32 77.43  3.25 -11% - 7%
        OrHighMed  15.67  0.94 15.44  1.00 -13% - 11%
        TermGroup1M   10.82  0.76 10.77  0.69 -13% -   13%
        OrHighHigh  3.31  0.37  3.29  0.37 -20% - 24%
        Respell 15.99  0.59 15.95  0.52 -6% - 6%
        TermBGroup1M 12.87  1.09 12.86  0.94 -14% - 17%
        Fuzzy1 24.38  1.19 24.39  0.84 -7% - 8%
        TermBGroup1M1P 17.67  1.33 17.79  1.14 -12% - 15%
        Fuzzy2  7.60  0.64  7.67  0.59 -14% - 18%
        Phrase  6.84  0.64  6.91  0.62 -15% - 21%
        SpanNear  1.90  0.24  1.92  0.22 -20% - 29%
        PKLookup 76.01  4.56 76.99  3.26 -8% - 12%
        SloppyPhrase  2.49  0.25  2.53  0.23 -16% - 23%
        AndHighMed 29.80  1.11 33.50  1.31  4% - 21%
        AndHighHigh 10.74  0.67 12.26  0.55  2% - 27%
        Show
        Simon Willnauer added a comment - - edited here is the same thing, only as a scorer that booleanweight picks. I like the size of the patch! Thanks for moving this into the weight. I had it separate to make BW less complex but this looks good though. In general i think Query.rewrite should be reserved for simplifying Queries, this is not a simpler query, just a faster scorer I disagree here, if this would be the case it should be called simplify(Query). In general its a rewrite method and should not be judged if it simplifies or not. Here are some benchmark results trunk vs. patch (10M medium wiki docs): Task QPS Trunk StdDev QPS Patch  StdDev Pct diff Prefix3 29.84 1.14 29.02 1.37 -10% - 5% IntNRQ  5.82 0.67 5.68  0.55 -20% - 20% Wildcard 15.96  0.77 15.62  0.63 -10% - 7% Term 79.10  4.32 77.43  3.25 -11% - 7% OrHighMed  15.67  0.94 15.44  1.00 -13% - 11% TermGroup1M   10.82  0.76 10.77  0.69 -13% -   13% OrHighHigh  3.31  0.37  3.29  0.37 -20% - 24% Respell 15.99  0.59 15.95  0.52 -6% - 6% TermBGroup1M 12.87  1.09 12.86  0.94 -14% - 17% Fuzzy1 24.38  1.19 24.39  0.84 -7% - 8% TermBGroup1M1P 17.67  1.33 17.79  1.14 -12% - 15% Fuzzy2  7.60  0.64  7.67  0.59 -14% - 18% Phrase  6.84  0.64  6.91  0.62 -15% - 21% SpanNear  1.90  0.24  1.92  0.22 -20% - 29% PKLookup 76.01  4.56 76.99  3.26 -8% - 12% SloppyPhrase  2.49  0.25  2.53  0.23 -16% - 23% AndHighMed 29.80  1.11 33.50  1.31  4% - 21% AndHighHigh 10.74  0.67 12.26  0.55  2% - 27%
        Hide
        Robert Muir added a comment -

        I disagree here, if this would be the case it should be called simplify(Query). In general its a rewrite method and should not be judged if it simplifies or not.

        I think this is really important to hash out: if we want to optimize query execution, we should do this totally internally at the lowest level possible.
        If the optimization is to use a specialized scorer, then I think the right place to do this is inside the Weight.

        I don't think we should create a bunch of queries that are really the same and rewrite to each other: because this is more 'exposed' to end users, e.g.
        highlighting, caching, and who knows what people are doing in their custom code.

        It also requires a heavy maintenance burden of duplicate logic and testing for explain, hashcode, equals, etc.

        Show
        Robert Muir added a comment - I disagree here, if this would be the case it should be called simplify(Query). In general its a rewrite method and should not be judged if it simplifies or not. I think this is really important to hash out: if we want to optimize query execution, we should do this totally internally at the lowest level possible. If the optimization is to use a specialized scorer, then I think the right place to do this is inside the Weight. I don't think we should create a bunch of queries that are really the same and rewrite to each other: because this is more 'exposed' to end users, e.g. highlighting, caching, and who knows what people are doing in their custom code. It also requires a heavy maintenance burden of duplicate logic and testing for explain, hashcode, equals, etc.
        Hide
        Simon Willnauer added a comment -

        I agree with you that in this case a scorer is more elegant. Yet, the rewrite process is a very powerful process that can be used for query optimization etc. from the outside. What I am trying to say is that this is not necessarily bound to "simplification". All your points are valid!

        Show
        Simon Willnauer added a comment - I agree with you that in this case a scorer is more elegant. Yet, the rewrite process is a very powerful process that can be used for query optimization etc. from the outside. What I am trying to say is that this is not necessarily bound to "simplification". All your points are valid!
        Hide
        Michael McCandless added a comment -

        That's a nice speedup for AND of terms!!

        Show
        Michael McCandless added a comment - That's a nice speedup for AND of terms!!
        Hide
        Robert Muir added a comment -

        Also, I am usually positive on backporting improvements to 3.x to get them back to the users quickly, but I think
        we should do this only in trunk.

        The reason is: 3.x is going to be very hairy here, since the computation of scoring (including score caches and shit) is
        conflated into the postings list matching for termquery, since it has no termstate to pull from, ...

        Show
        Robert Muir added a comment - Also, I am usually positive on backporting improvements to 3.x to get them back to the users quickly, but I think we should do this only in trunk. The reason is: 3.x is going to be very hairy here, since the computation of scoring (including score caches and shit) is conflated into the postings list matching for termquery, since it has no termstate to pull from, ...
        Hide
        Robert Muir added a comment -

        setting 4.0-only.

        if someone volunteers to comes up with some smart not-buggy and not-hairy patch, thats cool, I just can't visualize it now.

        Show
        Robert Muir added a comment - setting 4.0-only. if someone volunteers to comes up with some smart not-buggy and not-hairy patch, thats cool, I just can't visualize it now.
        Hide
        Simon Willnauer added a comment -

        next iteration. I added two util methods to get a TermsEnum and a ExactDocScorer from a TermWeight which allowed to make all the members private again. This looks little cleaner now and eventually if we cut over PhraseQuery to use BQ + PosIterators it can simply get all it needs from the TermsEnum like DocsAndPosEnum & totalTermFreq.

        I also tried to optimize the doNext() loop even further to save one more comparison to hopefully help hotspot even further. I need to benchmark this patch one more time but overall this looks close.

        I also added a CHANGES.TXT entry.

        Show
        Simon Willnauer added a comment - next iteration. I added two util methods to get a TermsEnum and a ExactDocScorer from a TermWeight which allowed to make all the members private again. This looks little cleaner now and eventually if we cut over PhraseQuery to use BQ + PosIterators it can simply get all it needs from the TermsEnum like DocsAndPosEnum & totalTermFreq. I also tried to optimize the doNext() loop even further to save one more comparison to hopefully help hotspot even further. I need to benchmark this patch one more time but overall this looks close. I also added a CHANGES.TXT entry.
        Hide
        Michael McCandless added a comment -

        Results from latest patch:

        Task QPS base StdDev base QPS bqterms StdDev bqterms Pct diff
        Fuzzy2 34.74 0.40 34.58 0.49 2%-2%
        Fuzzy1 23.85 0.25 23.78 0.31 2%-2%
        Respell 31.26 0.40 31.24 1.08 4%-4%
        PKLookup 137.63 2.79 138.19 6.90 6%-7%
        OrHighHigh 18.91 0.29 19.04 0.49 3%-4%
        Wildcard 50.35 1.12 50.82 1.32 3%-5%
        IntNRQ 9.14 0.38 9.24 0.42 7%-10%
        OrHighMed 10.38 0.36 10.55 0.39 5%-9%
        Prefix3 15.22 0.34 15.51 0.30 2%-6%
        TermBGroup1M1P 57.45 2.11 58.81 1.75 4%-9%
        SpanNear 4.28 0.28 4.40 0.08 5%-11%
        TermBGroup1M 46.36 1.38 47.93 1.88 3%-10%
        TermGroup1M 47.63 0.93 49.31 1.83 2%-9%
        Term 137.86 6.74 142.82 6.55 5%-13%
        SloppyPhrase 19.72 0.87 20.62 0.01 0%-9%
        Phrase 3.28 0.19 3.50 0.11 2%-16%
        AndHighMed 87.52 2.55 100.80 4.80 6%-24%
        AndHighHigh 14.92 0.55 17.33 0.91 6%-26%

        Looks great!

        Show
        Michael McCandless added a comment - Results from latest patch: Task QPS base StdDev base QPS bqterms StdDev bqterms Pct diff Fuzzy2 34.74 0.40 34.58 0.49 2% - 2% Fuzzy1 23.85 0.25 23.78 0.31 2% - 2% Respell 31.26 0.40 31.24 1.08 4% - 4% PKLookup 137.63 2.79 138.19 6.90 6% - 7% OrHighHigh 18.91 0.29 19.04 0.49 3% - 4% Wildcard 50.35 1.12 50.82 1.32 3% - 5% IntNRQ 9.14 0.38 9.24 0.42 7% - 10% OrHighMed 10.38 0.36 10.55 0.39 5% - 9% Prefix3 15.22 0.34 15.51 0.30 2% - 6% TermBGroup1M1P 57.45 2.11 58.81 1.75 4% - 9% SpanNear 4.28 0.28 4.40 0.08 5% - 11% TermBGroup1M 46.36 1.38 47.93 1.88 3% - 10% TermGroup1M 47.63 0.93 49.31 1.83 2% - 9% Term 137.86 6.74 142.82 6.55 5% - 13% SloppyPhrase 19.72 0.87 20.62 0.01 0% - 9% Phrase 3.28 0.19 3.50 0.11 2% - 16% AndHighMed 87.52 2.55 100.80 4.80 6% - 24% AndHighHigh 14.92 0.55 17.33 0.91 6% - 26% Looks great!
        Hide
        Michael Busch added a comment -

        Nice improvements!

        I'm wondering if you considered having ConjunctionTermScorer use the terms' IDF values to decide which iterator to advance when all are on the same docID? It should always be best to pick the rarest term.

        We've talked about doing that in the past, but it's hard to support this for any type of subclause, because you'd have to add the ability to estimate the IDFs of possible subclauses.

        But with this change it seems very feasible to try for BQs that only have TQ clauses.

        Show
        Michael Busch added a comment - Nice improvements! I'm wondering if you considered having ConjunctionTermScorer use the terms' IDF values to decide which iterator to advance when all are on the same docID? It should always be best to pick the rarest term. We've talked about doing that in the past, but it's hard to support this for any type of subclause, because you'd have to add the ability to estimate the IDFs of possible subclauses. But with this change it seems very feasible to try for BQs that only have TQ clauses.
        Hide
        Simon Willnauer added a comment -

        I'm wondering if you considered having ConjunctionTermScorer use the terms' IDF values to decide which iterator to advance when all are on the same docID? It should always be best to pick the rarest term.

        The ConjunctionTermScorer sorts the DocsEnums by their frequency in the ctor. The leader will always be the lowest frequent term in the set. is this what you mean here?

        We could even optimize the doNext loop and advance the lead to the last document we stepped out of the inner loop since this is guaranteed to be greater than the document the lead enum is on. I just wonder if we at some point step into the slowness of DocsEnum#advance(). It very important to make #advance(doc+1) as fast as #nextDoc() in order to keep our algs clean!

        Show
        Simon Willnauer added a comment - I'm wondering if you considered having ConjunctionTermScorer use the terms' IDF values to decide which iterator to advance when all are on the same docID? It should always be best to pick the rarest term. The ConjunctionTermScorer sorts the DocsEnums by their frequency in the ctor. The leader will always be the lowest frequent term in the set. is this what you mean here? We could even optimize the doNext loop and advance the lead to the last document we stepped out of the inner loop since this is guaranteed to be greater than the document the lead enum is on. I just wonder if we at some point step into the slowness of DocsEnum#advance(). It very important to make #advance(doc+1) as fast as #nextDoc() in order to keep our algs clean!
        Hide
        Simon Willnauer added a comment -

        I think this is ready. I will commit this tomorrow if nobody objects.

        Show
        Simon Willnauer added a comment - I think this is ready. I will commit this tomorrow if nobody objects.
        Hide
        Michael Busch added a comment -

        The ConjunctionTermScorer sorts the DocsEnums by their frequency in the ctor. The leader will always be the lowest frequent term in the set. is this what you mean here?

        Cool, yeah that's roughly what I meant. In general, it's best to always pick the lowest-df enum as leader:
        1) after initialization
        2) after a hit was found
        3) whenever a doc matched m out of n enums, 1 < m < n

        I think what you described covers situation 1), does it also cover 2) and 3)?

        Show
        Michael Busch added a comment - The ConjunctionTermScorer sorts the DocsEnums by their frequency in the ctor. The leader will always be the lowest frequent term in the set. is this what you mean here? Cool, yeah that's roughly what I meant. In general, it's best to always pick the lowest-df enum as leader: 1) after initialization 2) after a hit was found 3) whenever a doc matched m out of n enums, 1 < m < n I think what you described covers situation 1), does it also cover 2) and 3)?
        Hide
        Simon Willnauer added a comment -

        2) after a hit was found
        3) whenever a doc matched m out of n enums, 1 < m < n

        maybe I miss something here but really how can we know how many docs are left in an enum during 2. and 3. We could use the doc an enum has advanced to in 3. to also advance the leader but as I said in a comment above I am still afraid of the advance call since it can give a serious perf hit if docs are close together.

        Show
        Simon Willnauer added a comment - 2) after a hit was found 3) whenever a doc matched m out of n enums, 1 < m < n maybe I miss something here but really how can we know how many docs are left in an enum during 2. and 3. We could use the doc an enum has advanced to in 3. to also advance the leader but as I said in a comment above I am still afraid of the advance call since it can give a serious perf hit if docs are close together.
        Hide
        Simon Willnauer added a comment -

        I just committed the latest patch to trunk. should we open a new issue for disjunction?

        simon

        Show
        Simon Willnauer added a comment - I just committed the latest patch to trunk. should we open a new issue for disjunction? simon
        Hide
        Michael McCandless added a comment -

        should we open a new issue for disjunction?

        +1

        Show
        Michael McCandless added a comment - should we open a new issue for disjunction? +1
        Hide
        Paul Elschot added a comment -

        Sorry for not looking at this earlier, but in trunk now in ConjunctionTermScorer in the doNext method the lead TermScorer is not advanced when breaking to the advanceHead label even though the comment at the break states so.
        I would expect this to works correctly but it is probably not as efficient as intended.

        I think advancing the lead in program code at the place of the break comment could fix this.

        Show
        Paul Elschot added a comment - Sorry for not looking at this earlier, but in trunk now in ConjunctionTermScorer in the doNext method the lead TermScorer is not advanced when breaking to the advanceHead label even though the comment at the break states so. I would expect this to works correctly but it is probably not as efficient as intended. I think advancing the lead in program code at the place of the break comment could fix this.
        Hide
        Simon Willnauer added a comment - - edited

        I think advancing the lead in program code at the place of the break comment could fix this.

        Paul this works and looks as expected. Once we break to the advanceHead label we step out the inner do {} while; and advance the head. Maybe I don't understand your comment correctly?

        There is certainly space for improvement here. for instance could the head be advanced to the doc we break on but the advance call there actually yields a perf hit. Yet, we can play some tricks like if (DF / maxdoc > X) enum.advance( n ) else while(n > enum.nextDoc()); which I think I'll look into after vacation

        Show
        Simon Willnauer added a comment - - edited I think advancing the lead in program code at the place of the break comment could fix this. Paul this works and looks as expected. Once we break to the advanceHead label we step out the inner do {} while; and advance the head. Maybe I don't understand your comment correctly? There is certainly space for improvement here. for instance could the head be advanced to the doc we break on but the advance call there actually yields a perf hit. Yet, we can play some tricks like if (DF / maxdoc > X) enum.advance( n ) else while(n > enum.nextDoc()); which I think I'll look into after vacation
        Hide
        Paul Elschot added a comment -

        this works and looks as expected.

        Indeed so. I mistook the working of a labeled break for jumping to the beginning instead of to the end.
        (The last time I used a label was actually with another programming language...)

        Show
        Paul Elschot added a comment - this works and looks as expected. Indeed so. I mistook the working of a labeled break for jumping to the beginning instead of to the end. (The last time I used a label was actually with another programming language...)
        Hide
        Uwe Schindler added a comment -

        Indeed breaks in Java are no real gotos. The label must always be placed before the loop that you want to break out (it would give syntax error otherwise). In fact it simply exits the loop to this level and proceeds working after the loop. A simple break without a label is identical to a non-hierarchical loop with a autogenerated label in front of it.

        Show
        Uwe Schindler added a comment - Indeed breaks in Java are no real gotos. The label must always be placed before the loop that you want to break out (it would give syntax error otherwise). In fact it simply exits the loop to this level and proceeds working after the loop. A simple break without a label is identical to a non-hierarchical loop with a autogenerated label in front of it.

          People

          • Assignee:
            Simon Willnauer
            Reporter:
            Simon Willnauer
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development