Lucene - Core
  1. Lucene - Core
  2. LUCENE-6889

BooleanQuery.rewrite could easily optimize some simple cases

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.5, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Follow-up of SOLR-8251: APIs and user interfaces sometimes encourage to write BooleanQuery instances that are not optimal, for instance a typical case that happens often with Solr/Elasticsearch is to send a request that has a MatchAllDocsQuery as a query and some filter, which could be executed more efficiently by directly wrapping the filter into a ConstantScoreQuery.

      Here are some ideas of rewrite operations that BooleanQuery could perform:

      • remove FILTER clauses when they are also a MUST clause
      • rewrite queries of the form "+: #filter" to a ConstantScoreQuery(filter)
      • rewrite to a MatchNoDocsQuery when a clause that is a MUST or FILTER clause is also a MUST_NOT clause
      1. LUCENE-6889.patch
        26 kB
        Adrien Grand
      2. LUCENE-6889.patch
        25 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          Terry Smith added a comment -

          I like the last one but believe that the other two aren't correct.

          remove FILTER clauses when they are also a MUST clause

          Seeing as a FILTER is a non scoring MUST this just doesn't sound right. The FILTER could constrain the result set more than just the MUST alone.

          e.g. +foo #(+foo +bar)

          rewrite queries of the form +: #filter" to a ConstantScoreQuery(filter)

          I don't think you can drop a +:: without affecting the score, but you could drop a #: if the BooleanQuery has something else to force inclusion (other MUST, FILTER or some SHOULD with an appropriate minNumShouldMatch).

          For this case could Solr/ElasticSearch add the MatchAllDocs as a FILTER instead of a MUST to allow for this optimization?

          We could detect duplicate FILTER and MUST_NOT clauses as described in LUCENE-6787.

          Jira is turning star colon star (:) to a bold colon, so apologies if this doesn't read well through the web interface.

          Show
          Terry Smith added a comment - I like the last one but believe that the other two aren't correct. remove FILTER clauses when they are also a MUST clause Seeing as a FILTER is a non scoring MUST this just doesn't sound right. The FILTER could constrain the result set more than just the MUST alone. e.g. +foo #(+foo +bar) rewrite queries of the form + : #filter" to a ConstantScoreQuery(filter) I don't think you can drop a +: : without affecting the score, but you could drop a # : if the BooleanQuery has something else to force inclusion (other MUST, FILTER or some SHOULD with an appropriate minNumShouldMatch). For this case could Solr/ElasticSearch add the MatchAllDocs as a FILTER instead of a MUST to allow for this optimization? We could detect duplicate FILTER and MUST_NOT clauses as described in LUCENE-6787 . Jira is turning star colon star ( : ) to a bold colon, so apologies if this doesn't read well through the web interface.
          Hide
          Uwe Schindler added a comment - - edited

          Seeing as a FILTER is a non scoring MUST this just doesn't sound right. The FILTER could constrain the result set more than just the MUST alone

          We were talking about a FILTER and a identical MUST clause on the base level without any additional brackets/bbols inside. In those cases the FILTER does not do more than the MUST clause already did.

          rewrite queries of the form +*:* #filter to a ConstantScoreQuery(filter)

          This works, because that is the reason for this issue and was done exactly like that since Lucene 4.0 in FilteredQuery, which did exactly that: FilteredQuery(MatchAllDocs,Filter) => ConstantScoreQuery(Filter). You just have to copy the boost of the matchall docs to the constant score query. Of course you have to make sure, you don't break coordination factors, which is not a problem for filters (they have no effect on coordination factor).

          Show
          Uwe Schindler added a comment - - edited Seeing as a FILTER is a non scoring MUST this just doesn't sound right. The FILTER could constrain the result set more than just the MUST alone We were talking about a FILTER and a identical MUST clause on the base level without any additional brackets/bbols inside. In those cases the FILTER does not do more than the MUST clause already did. rewrite queries of the form +*:* #filter to a ConstantScoreQuery(filter) This works, because that is the reason for this issue and was done exactly like that since Lucene 4.0 in FilteredQuery, which did exactly that: FilteredQuery(MatchAllDocs,Filter) => ConstantScoreQuery(Filter) . You just have to copy the boost of the matchall docs to the constant score query. Of course you have to make sure, you don't break coordination factors, which is not a problem for filters (they have no effect on coordination factor).
          Hide
          Terry Smith added a comment -

          Ah, that makes sense. I didn't realize the first scenario was dropping the MUST when the FILTER and MUST wrapped identical clauses or that the second scenario also included boost handling to avoid the scoring issue. Given that, this sounds like a great optimization.

          I'll summarize the rules below, mind shouting out if I still misunderstand?

          Rule 1

          #a +a -> +a
          

          Rule 2

          +*:*^b #f -> ConstantScoreQuery(f)^b
          

          Rule 3

          -a +a -> MatchNoDocsQuery
          -a #a -> MatchNoDocsQuery
          
          Show
          Terry Smith added a comment - Ah, that makes sense. I didn't realize the first scenario was dropping the MUST when the FILTER and MUST wrapped identical clauses or that the second scenario also included boost handling to avoid the scoring issue. Given that, this sounds like a great optimization. I'll summarize the rules below, mind shouting out if I still misunderstand? Rule 1 #a +a -> +a Rule 2 +*:*^b #f -> ConstantScoreQuery(f)^b Rule 3 -a +a -> MatchNoDocsQuery -a #a -> MatchNoDocsQuery
          Hide
          Adrien Grand added a comment -

          Here is a patch that does the following rewrites:

          Removal of FILTER clauses that are also MUST clauses

          #a +a -> +a
          

          FilteredQuery rewrite when the query is a MatchAllDocsQuery

          +*:*^b #f -> ConstantScoreQuery(f)^b
          

          Removal of filters on MatchAllDocsQuery if they are a MUST clause as well

          +a #*:* -> +a
          

          Deduplication of FILTER and MUST_NOT clauses

          +a #f #f -f -f -> +a #f -f
          

          They have the nice property of being able to execute things that we used to execute as a disjunction or a conjunction as a simple term query.

          I also wanted to rewrite queries to a MatchAllDocsQuery when there was an intersection between required and prohibited clauses (Terry's rule 3) or when the mininumShouldMatch is greater than the number of SHOULD clauses but this broke weight normalization. We can probably solve the MUST_NOT/MUST intersection at the Scorer level but I propose to defer it to another issue.

          The patch includes unit tests for the above rewrite rules as well as a random test that makes sure that the same set of matches and scores are produced if no rewriting is performed.

          Show
          Adrien Grand added a comment - Here is a patch that does the following rewrites: Removal of FILTER clauses that are also MUST clauses #a +a -> +a FilteredQuery rewrite when the query is a MatchAllDocsQuery +*:*^b #f -> ConstantScoreQuery(f)^b Removal of filters on MatchAllDocsQuery if they are a MUST clause as well +a #*:* -> +a Deduplication of FILTER and MUST_NOT clauses +a #f #f -f -f -> +a #f -f They have the nice property of being able to execute things that we used to execute as a disjunction or a conjunction as a simple term query. I also wanted to rewrite queries to a MatchAllDocsQuery when there was an intersection between required and prohibited clauses (Terry's rule 3) or when the mininumShouldMatch is greater than the number of SHOULD clauses but this broke weight normalization. We can probably solve the MUST_NOT/MUST intersection at the Scorer level but I propose to defer it to another issue. The patch includes unit tests for the above rewrite rules as well as a random test that makes sure that the same set of matches and scores are produced if no rewriting is performed.
          Hide
          Hoss Man added a comment -

          maybe i'm reading the patch wrong, but it looks like the "actuallyRewritten" check from the recursive rewriting will return before several of the optimizations. Shouldn't things like "remove duplicate FILTER and MUST_NOT clauses" and "remove FILTER clauses that are also MUST clauses" still be tested/done against the rewritten sub-clauses?

          likewise isn't doing the "remove FILTER clauses that are also MUST clauses" optimization still worthwhile even if the "remove duplicate FILTER and MUST_NOT clauses" optimization finds & removes things? (it also looks like it has a short-circuit return)

          ... as well as a random test that makes sure that the same set of matches and scores are produced if no rewriting is performed.

          why not randomize the docs/fields in the index as well? At first glance one concern i have is that no doc has a single term more then once, so spotting subtle score discrepancies between the query and it's optimize version may never come into play with this test.

          other small concerns about the current random query generation:

          • is rarely() really appropriate for the BoostQuery wrapping? is that something that really makes sense to increase depending on wether the test is nightly? ... seems like something more straight forward like 0==TestUtil.nextInt(random(), 0, 10) would make more sense for these tests
          • randomizing setMinimumNumberShouldMatch between 0 and numClauses means that it's going to be very rare for the minimumNumberShouldMatch setting to actually impact the query unless there are also a lot of random SHOULD clauses (ie: if there are 5 clauses but only 2 SHOULD clauses there's only a 2:5 chance of that setting getting a random value that actually affects anything) ... probably better to count the actual # of SHOULD clauses generated randomly and then randomize the setting between 0 and #+1.
          • MatchNoDocsQuery should probably be included in the randomization to better coverage of all the optimization situations.
          Show
          Hoss Man added a comment - maybe i'm reading the patch wrong, but it looks like the "actuallyRewritten" check from the recursive rewriting will return before several of the optimizations. Shouldn't things like "remove duplicate FILTER and MUST_NOT clauses" and "remove FILTER clauses that are also MUST clauses" still be tested/done against the rewritten sub-clauses? likewise isn't doing the "remove FILTER clauses that are also MUST clauses" optimization still worthwhile even if the "remove duplicate FILTER and MUST_NOT clauses" optimization finds & removes things? (it also looks like it has a short-circuit return) ... as well as a random test that makes sure that the same set of matches and scores are produced if no rewriting is performed. why not randomize the docs/fields in the index as well? At first glance one concern i have is that no doc has a single term more then once, so spotting subtle score discrepancies between the query and it's optimize version may never come into play with this test. other small concerns about the current random query generation: is rarely() really appropriate for the BoostQuery wrapping? is that something that really makes sense to increase depending on wether the test is nightly? ... seems like something more straight forward like 0==TestUtil.nextInt(random(), 0, 10) would make more sense for these tests randomizing setMinimumNumberShouldMatch between 0 and numClauses means that it's going to be very rare for the minimumNumberShouldMatch setting to actually impact the query unless there are also a lot of random SHOULD clauses (ie: if there are 5 clauses but only 2 SHOULD clauses there's only a 2:5 chance of that setting getting a random value that actually affects anything) ... probably better to count the actual # of SHOULD clauses generated randomly and then randomize the setting between 0 and #+1. MatchNoDocsQuery should probably be included in the randomization to better coverage of all the optimization situations.
          Hide
          Adrien Grand added a comment -

          but it looks like the "actuallyRewritten" check from the recursive rewriting will return before several of the optimizations

          likewise isn't doing the "remove FILTER clauses that are also MUST clauses" optimization still worthwhile even if the "remove duplicate FILTER and MUST_NOT clauses" optimization finds & removes things? (it also looks like it has a short-circuit return)

          It works fine because rewrite's contract is that it should be called until the return value is the same as the provided query. So for instance maybe one optimization will be performed on the first call, then another one, etc.

          Other comments make sense to me, I'll update the patch.

          Show
          Adrien Grand added a comment - but it looks like the "actuallyRewritten" check from the recursive rewriting will return before several of the optimizations likewise isn't doing the "remove FILTER clauses that are also MUST clauses" optimization still worthwhile even if the "remove duplicate FILTER and MUST_NOT clauses" optimization finds & removes things? (it also looks like it has a short-circuit return) It works fine because rewrite's contract is that it should be called until the return value is the same as the provided query. So for instance maybe one optimization will be performed on the first call, then another one, etc. Other comments make sense to me, I'll update the patch.
          Hide
          Hoss Man added a comment -

          It works fine because rewrite's contract is that it should be called until the return value is the same as the provided query...

          Ah ... yeah, I totally forgot about that. sorry for the noise.

          Show
          Hoss Man added a comment - It works fine because rewrite's contract is that it should be called until the return value is the same as the provided query... Ah ... yeah, I totally forgot about that. sorry for the noise.
          Hide
          Adrien Grand added a comment -

          Here is an updated patch. I did not add MatchNoDocsQuery to the randomly-generated queries because it makes the IndexSearcher that has rewriting disabled fail since calling createWeight directly on a MatchNoDocsQuery fails with an UnsupportedOperationException: "Query does not implement createWeight".

          Show
          Adrien Grand added a comment - Here is an updated patch. I did not add MatchNoDocsQuery to the randomly-generated queries because it makes the IndexSearcher that has rewriting disabled fail since calling createWeight directly on a MatchNoDocsQuery fails with an UnsupportedOperationException: "Query does not implement createWeight".
          Hide
          ASF subversion and git services added a comment -

          Commit 1717757 from Adrien Grand in branch 'dev/trunk'
          [ https://svn.apache.org/r1717757 ]

          LUCENE-6889: Add some basic rewrite rules to BooleanQuery that can make it run significantly faster in some cases.

          Show
          ASF subversion and git services added a comment - Commit 1717757 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1717757 ] LUCENE-6889 : Add some basic rewrite rules to BooleanQuery that can make it run significantly faster in some cases.
          Hide
          ASF subversion and git services added a comment -

          Commit 1717782 from Adrien Grand in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1717782 ]

          LUCENE-6889: Add some basic rewrite rules to BooleanQuery that can make it run significantly faster in some cases.

          Show
          ASF subversion and git services added a comment - Commit 1717782 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1717782 ] LUCENE-6889 : Add some basic rewrite rules to BooleanQuery that can make it run significantly faster in some cases.
          Hide
          Stephan Lagraulet added a comment -

          Hello Adrien Grand do you think this change can be included on the 5.4 branch for the upcoming 5.4.1 release?

          Show
          Stephan Lagraulet added a comment - Hello Adrien Grand do you think this change can be included on the 5.4 branch for the upcoming 5.4.1 release?
          Hide
          Adrien Grand added a comment -

          A 5.4.1 release candidate is currently being voted, so it is too late to fold more changes in, unless they are critical like a corruption fix. However nothing prevents us from releasing 5.5 soon.

          Show
          Adrien Grand added a comment - A 5.4.1 release candidate is currently being voted, so it is too late to fold more changes in, unless they are critical like a corruption fix. However nothing prevents us from releasing 5.5 soon.
          Hide
          Stephan Lagraulet added a comment -

          It depends what means "soon". Can it be included in a 5.4.2 release or do you prefer to push an early release of 5.5 ?

          Show
          Stephan Lagraulet added a comment - It depends what means "soon". Can it be included in a 5.4.2 release or do you prefer to push an early release of 5.5 ?
          Hide
          Adrien Grand added a comment -

          I would much prefer push for a 5.5 release. Introducing bugs in bugfix releases is pretty bad so we strive at only backporting safe changes to bugfix releases.

          Show
          Adrien Grand added a comment - I would much prefer push for a 5.5 release. Introducing bugs in bugfix releases is pretty bad so we strive at only backporting safe changes to bugfix releases.
          Hide
          Hoss Man added a comment -

          Improvements/Optimizations are also not typically candidates for inclusion in bug fix releases except in special circumstances (ie: sometimes while working on an optimization it's discovered that the new code also fixes a bug, and backporting the optimization to the bug fix branch may be the safest way to fix the bug there as well)

          Show
          Hoss Man added a comment - Improvements/Optimizations are also not typically candidates for inclusion in bug fix releases except in special circumstances (ie: sometimes while working on an optimization it's discovered that the new code also fixes a bug, and backporting the optimization to the bug fix branch may be the safest way to fix the bug there as well)

            People

            • Assignee:
              Unassigned
              Reporter:
              Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development