Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 7.0
    • Fix Version/s: 7.0
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      A couple of BooleanQuery rewrites / optimizations.

      First, as discussed on the user group, a BooleanQuery with a query that is both a SHOULD and a FILTER can be rewritten as a single MUST query, but care must be taken to decrement minShouldMatch by 1.

      Another case is if a query is both required (MUST or FILTER) and MUST_NOT at the same time, it can be converted to a MatchNoDocsQuery (although I haven't discussed this yet so hopefully I'm not missing something!).

      1. LUCENE-7146.patch
        6 kB
        Spyros Kapnissis
      2. LUCENE-7146-MatchAllMustNot.patch
        4 kB
        Uwe Schindler
      3. LUCENE-7146-simplific1.patch
        1 kB
        Uwe Schindler
      4. LUCENE-7146-simplific1.patch
        1 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          spyk Spyros Kapnissis added a comment -

          A patch for review, with corresponding tests.

          Show
          spyk Spyros Kapnissis added a comment - A patch for review, with corresponding tests.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1ac9609caedbf739379bdabdac909f77fee2f5c6 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1ac9609 ]

          LUCENE-7416: Rewrite optimizations for BooleanQuery.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1ac9609caedbf739379bdabdac909f77fee2f5c6 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1ac9609 ] LUCENE-7416 : Rewrite optimizations for BooleanQuery.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 81c796a1fa30a9bec77712a8f8e188b347dc490a in lucene-solr's branch refs/heads/master from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=81c796a ]

          LUCENE-7416: Rewrite optimizations for BooleanQuery.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 81c796a1fa30a9bec77712a8f8e188b347dc490a in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=81c796a ] LUCENE-7416 : Rewrite optimizations for BooleanQuery.
          Hide
          jpountz Adrien Grand added a comment -

          Merged, thanks!

          Show
          jpountz Adrien Grand added a comment - Merged, thanks!
          Hide
          rcmuir Robert Muir added a comment -

          This seems to break testrandom?

          I am concerned about us making the wrong tradeoff here and breaking common cases by optimizing rare cases .

          Show
          rcmuir Robert Muir added a comment - This seems to break testrandom? I am concerned about us making the wrong tradeoff here and breaking common cases by optimizing rare cases .
          Hide
          rcmuir Robert Muir added a comment -

          Here is the fail I see on 6.x:

             [junit4] Started J0 PID(10069@localhost).
             [junit4] Suite: org.apache.lucene.search.TestBooleanRewrites
             [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestBooleanRewrites -Dtests.method=testRandom -Dtests.seed=506700689E36A59D -Dtests.slow=true -Dtests.locale=sr-ME -Dtests.timezone=Africa/Kinshasa -Dtests.asserts=true -Dtests.file.encoding=UTF-8
             [junit4] FAILURE 0.39s | TestBooleanRewrites.testRandom <<<
             [junit4]    > Throwable #1: java.lang.AssertionError: expected:<1.5392522811889648> but was:<1.4700312614440918>
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([506700689E36A59D:222B25672F5613EE]:0)
             [junit4]    > 	at org.apache.lucene.search.TestBooleanRewrites.assertEquals(TestBooleanRewrites.java:404)
             [junit4]    > 	at org.apache.lucene.search.TestBooleanRewrites.testRandom(TestBooleanRewrites.java:348)
             [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
             [junit4]   2> NOTE: test params are: codec=Asserting(Lucene62): {body=PostingsFormat(name=LuceneVarGapFixedInterval)}, docValues:{}, maxPointsInLeafNode=1393, maxMBSortInHeap=5.498728834322439, sim=ClassicSimilarity, locale=sr-ME, timezone=Africa/Kinshasa
             [junit4]   2> NOTE: Linux 4.4.0-31-generic amd64/Oracle Corporation 1.8.0_45 (64-bit)/cpus=8,threads=1,free=225724712,total=239075328
             [junit4]   2> NOTE: All tests run in this JVM: [TestBooleanRewrites]
             [junit4] Completed [1/1 (1!)] in 0.61s, 1 test, 1 failure <<< FAILURES!
          

          I suspect the change is safe for master branch but requires more work if we want it to not break scoring with stuff like coord() at play. Additionally given the optimization is fairly complicated, maybe we can have a few targeted tests if we really want to push it to 6.x, that would help alleviate some concerns.

          Show
          rcmuir Robert Muir added a comment - Here is the fail I see on 6.x: [junit4] Started J0 PID(10069@localhost). [junit4] Suite: org.apache.lucene.search.TestBooleanRewrites [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestBooleanRewrites -Dtests.method=testRandom -Dtests.seed=506700689E36A59D -Dtests.slow=true -Dtests.locale=sr-ME -Dtests.timezone=Africa/Kinshasa -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] FAILURE 0.39s | TestBooleanRewrites.testRandom <<< [junit4] > Throwable #1: java.lang.AssertionError: expected:<1.5392522811889648> but was:<1.4700312614440918> [junit4] > at __randomizedtesting.SeedInfo.seed([506700689E36A59D:222B25672F5613EE]:0) [junit4] > at org.apache.lucene.search.TestBooleanRewrites.assertEquals(TestBooleanRewrites.java:404) [junit4] > at org.apache.lucene.search.TestBooleanRewrites.testRandom(TestBooleanRewrites.java:348) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] 2> NOTE: test params are: codec=Asserting(Lucene62): {body=PostingsFormat(name=LuceneVarGapFixedInterval)}, docValues:{}, maxPointsInLeafNode=1393, maxMBSortInHeap=5.498728834322439, sim=ClassicSimilarity, locale=sr-ME, timezone=Africa/Kinshasa [junit4] 2> NOTE: Linux 4.4.0-31-generic amd64/Oracle Corporation 1.8.0_45 (64-bit)/cpus=8,threads=1,free=225724712,total=239075328 [junit4] 2> NOTE: All tests run in this JVM: [TestBooleanRewrites] [junit4] Completed [1/1 (1!)] in 0.61s, 1 test, 1 failure <<< FAILURES! I suspect the change is safe for master branch but requires more work if we want it to not break scoring with stuff like coord() at play. Additionally given the optimization is fairly complicated, maybe we can have a few targeted tests if we really want to push it to 6.x, that would help alleviate some concerns.
          Hide
          thetaphi Uwe Schindler added a comment -

          I suspect the change is safe for master branch but requires more work if we want it to not break scoring with stuff like coord() at play. Additionally given the optimization is fairly complicated, maybe we can have a few targeted tests if we really want to push it to 6.x, that would help alleviate some concerns.

          +1 I had the same issues in mind with 6.x (coord factor), although filter clauses should not affect coord at all - maybe another bug

          I'd also like to see special targeted tests on those complicated scoring differences. In general the patch is fine:

          • The part that return MatchNoDocs is safe, because it does not affect score at all (no results). It is just a bit of "set" logic, although I have the feeling the whole thing might be easier with Java 8 streams, because no coping of sets is needed - you just use a simple lambda/method reference expression to find the set's overlap
          • The tricky part is the other optimization. I'd would like to see more thorough tests here!

          Should I rewrite the first part to a stream/lambda?

          Show
          thetaphi Uwe Schindler added a comment - I suspect the change is safe for master branch but requires more work if we want it to not break scoring with stuff like coord() at play. Additionally given the optimization is fairly complicated, maybe we can have a few targeted tests if we really want to push it to 6.x, that would help alleviate some concerns. +1 I had the same issues in mind with 6.x (coord factor), although filter clauses should not affect coord at all - maybe another bug I'd also like to see special targeted tests on those complicated scoring differences. In general the patch is fine: The part that return MatchNoDocs is safe, because it does not affect score at all (no results). It is just a bit of "set" logic, although I have the feeling the whole thing might be easier with Java 8 streams, because no coping of sets is needed - you just use a simple lambda/method reference expression to find the set's overlap The tricky part is the other optimization. I'd would like to see more thorough tests here! Should I rewrite the first part to a stream/lambda?
          Hide
          thetaphi Uwe Schindler added a comment -

          Here my simplification for the first part without crazy hash set copying. This just joins both sets, uses distict (its just top optimze) and then anyOf(Predicate) with a predicate that does a set.contains() on the mustnot.

          Show
          thetaphi Uwe Schindler added a comment - Here my simplification for the first part without crazy hash set copying. This just joins both sets, uses distict (its just top optimze) and then anyOf(Predicate) with a predicate that does a set.contains() on the mustnot.
          Hide
          dsmiley David Smiley added a comment -

          Nice bit of Java Streams API there Uwe. But I think it can be more efficient – no need to combine to a common set (via distinct) when ultimately we want to know if either the FILTER or MUST set has a MUST_NOT clause. They can be checked separately.

          Show
          dsmiley David Smiley added a comment - Nice bit of Java Streams API there Uwe. But I think it can be more efficient – no need to combine to a common set (via distinct) when ultimately we want to know if either the FILTER or MUST set has a MUST_NOT clause. They can be checked separately.
          Hide
          thetaphi Uwe Schindler added a comment -

          Yeah. Remove the distinct() and done.

          Show
          thetaphi Uwe Schindler added a comment - Yeah. Remove the distinct() and done.
          Hide
          thetaphi Uwe Schindler added a comment -

          I'd also think we should inverse the check. Must Not clauses are generally seldom. So do 2 checks here with anyOf, which is linear.

          Show
          thetaphi Uwe Schindler added a comment - I'd also think we should inverse the check. Must Not clauses are generally seldom. So do 2 checks here with anyOf, which is linear.
          Hide
          thetaphi Uwe Schindler added a comment -

          Patch that inverses the check.

          Show
          thetaphi Uwe Schindler added a comment - Patch that inverses the check.
          Hide
          jpountz Adrien Grand added a comment -

          The part that rewrites boolean queries into a MatchNoDocsQuery if some clauses are bothe excluded and required is indeed broken on 6.x because of queryNorm. I'm leaning towards not having this rewrite rule on 6.x to keep things simple rather than checking the coord values. This is already complicated enough. I'll do that for now to get the tests passing again.

          Uwe Schindler +1 to your patch

          Show
          jpountz Adrien Grand added a comment - The part that rewrites boolean queries into a MatchNoDocsQuery if some clauses are bothe excluded and required is indeed broken on 6.x because of queryNorm. I'm leaning towards not having this rewrite rule on 6.x to keep things simple rather than checking the coord values. This is already complicated enough. I'll do that for now to get the tests passing again. Uwe Schindler +1 to your patch
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit da6a502f261ab1e482eb6709cbd8f26cd77c7fb5 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=da6a502 ]

          LUCENE-7416: Rewriting `BooleanQuery` to a `MatchNoDocsQuery` is not safe because of `queryNorm`.

          Show
          jira-bot ASF subversion and git services added a comment - Commit da6a502f261ab1e482eb6709cbd8f26cd77c7fb5 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=da6a502 ] LUCENE-7416 : Rewriting `BooleanQuery` to a `MatchNoDocsQuery` is not safe because of `queryNorm`.
          Hide
          jpountz Adrien Grand added a comment -

          I am concerned about us making the wrong tradeoff here and breaking common cases by optimizing rare cases .

          I'm not sure about these rules in particular, but I know some existing rules are useful, such as the one that rewrites to a CSQ around the filters when the single MUST clause is a MatchAllDocsQuery. I think one reason why this happens is that apps have a fixed filter and then a configurable query that depends on the user and that defaults to a MatchAllDocsQuery.

          I can't think of one reason why someone would put a clause both is MUST and MUST_NOT, or both SHOULD and FILTER but I wouldn't be surprised that this happens sometimes. I think the testRandom test should be quite good at finding cases that we did not think about, but otherwise another approach I'd be fine with would be to back this change out and wait until we have concrete examples of users building such queries before adding rewrite rules for them.

          Show
          jpountz Adrien Grand added a comment - I am concerned about us making the wrong tradeoff here and breaking common cases by optimizing rare cases . I'm not sure about these rules in particular, but I know some existing rules are useful, such as the one that rewrites to a CSQ around the filters when the single MUST clause is a MatchAllDocsQuery. I think one reason why this happens is that apps have a fixed filter and then a configurable query that depends on the user and that defaults to a MatchAllDocsQuery. I can't think of one reason why someone would put a clause both is MUST and MUST_NOT, or both SHOULD and FILTER but I wouldn't be surprised that this happens sometimes. I think the testRandom test should be quite good at finding cases that we did not think about, but otherwise another approach I'd be fine with would be to back this change out and wait until we have concrete examples of users building such queries before adding rewrite rules for them.
          Hide
          rcmuir Robert Muir added a comment -

          Well my primary concern is on 6.x (where the test failed). 6.x is our stable branch but also its more complex due to coord() and queryNorm(). Without those two bad guys around, then these kinds of optimizations are less risky. Still, we should be careful with testing.

          Show
          rcmuir Robert Muir added a comment - Well my primary concern is on 6.x (where the test failed). 6.x is our stable branch but also its more complex due to coord() and queryNorm(). Without those two bad guys around, then these kinds of optimizations are less risky. Still, we should be careful with testing.
          Hide
          thetaphi Uwe Schindler added a comment - - edited

          The part that rewrites boolean queries into a MatchNoDocsQuery if some clauses are bothe excluded and required is indeed broken on 6.x because of queryNorm.

          That was my second guess But I agree with your question that checking the query norm for a query that does not match anything is not needed (even on 6.x), because something that does not match anything does not produce any score that could have wrongly normalized
          Although it is a should clause in a top-level BQ, then it affects norm (I think that's the issue, right?)
          I will commit the simplification patch on master later on.

          Show
          thetaphi Uwe Schindler added a comment - - edited The part that rewrites boolean queries into a MatchNoDocsQuery if some clauses are bothe excluded and required is indeed broken on 6.x because of queryNorm. That was my second guess But I agree with your question that checking the query norm for a query that does not match anything is not needed (even on 6.x), because something that does not match anything does not produce any score that could have wrongly normalized Although it is a should clause in a top-level BQ, then it affects norm (I think that's the issue, right?) I will commit the simplification patch on master later on.
          Hide
          thetaphi Uwe Schindler added a comment - - edited

          I'm not sure about these rules in particular, but I know some existing rules are useful, such as the one that rewrites to a CSQ around the filters when the single MUST clause is a MatchAllDocsQuery. I think one reason why this happens is that apps have a fixed filter and then a configurable query that depends on the user and that defaults to a MatchAllDocsQuery.

          This was a huge issue for Solr where people used q=*:* as query and only applied filter claused (fq), the common case for faceted navigation without query.

          Anyways, we can add another consition to the above set: If any MUST_NOT clause is a MatchAllDocs() we can return MatchNoDocs, too

          Show
          thetaphi Uwe Schindler added a comment - - edited I'm not sure about these rules in particular, but I know some existing rules are useful, such as the one that rewrites to a CSQ around the filters when the single MUST clause is a MatchAllDocsQuery. I think one reason why this happens is that apps have a fixed filter and then a configurable query that depends on the user and that defaults to a MatchAllDocsQuery. This was a huge issue for Solr where people used q=*:* as query and only applied filter claused ( fq ), the common case for faceted navigation without query. Anyways, we can add another consition to the above set: If any MUST_NOT clause is a MatchAllDocs() we can return MatchNoDocs, too
          Hide
          thetaphi Uwe Schindler added a comment -

          Here the patch adding the MatchAllDocs must not query. What do you think? Maybe its not needed because there is a shortcut otherwhere

          Show
          thetaphi Uwe Schindler added a comment - Here the patch adding the MatchAllDocs must not query. What do you think? Maybe its not needed because there is a shortcut otherwhere
          Hide
          thetaphi Uwe Schindler added a comment -

          The inverse (any required clause is a MatchNoDocsQuery) is not needed, because that's already shortcutted when building scorers (socrer == null for any required clause shortcuts).

          Show
          thetaphi Uwe Schindler added a comment - The inverse (any required clause is a MatchNoDocsQuery) is not needed, because that's already shortcutted when building scorers (socrer == null for any required clause shortcuts).
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 35382a815566f8f64e7b26dd4c0ac1129d860887 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=35382a8 ]

          Revert "LUCENE-7416: Rewrite optimizations for BooleanQuery."

          This reverts commit 1ac9609caedbf739379bdabdac909f77fee2f5c6.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 35382a815566f8f64e7b26dd4c0ac1129d860887 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=35382a8 ] Revert " LUCENE-7416 : Rewrite optimizations for BooleanQuery." This reverts commit 1ac9609caedbf739379bdabdac909f77fee2f5c6.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 63b2e8052f4f8bdb28e04d026e8388a0a77fd970 in lucene-solr's branch refs/heads/master from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=63b2e80 ]

          LUCENE-7416: Make 7.0 only.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 63b2e8052f4f8bdb28e04d026e8388a0a77fd970 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=63b2e80 ] LUCENE-7416 : Make 7.0 only.
          Hide
          jpountz Adrien Grand added a comment -

          Well my primary concern is on 6.x (where the test failed). 6.x is our stable branch but also its more complex due to coord() and queryNorm(). Without those two bad guys around, then these kinds of optimizations are less risky. Still, we should be careful with testing.

          Agreed. I backed it out on the 6.x branch.

          Show
          jpountz Adrien Grand added a comment - Well my primary concern is on 6.x (where the test failed). 6.x is our stable branch but also its more complex due to coord() and queryNorm(). Without those two bad guys around, then these kinds of optimizations are less risky. Still, we should be careful with testing. Agreed. I backed it out on the 6.x branch.
          Hide
          thetaphi Uwe Schindler added a comment -

          Adrien Grand: did you see my latest patch? Should I add this optimization in master, too?

          Show
          thetaphi Uwe Schindler added a comment - Adrien Grand : did you see my latest patch? Should I add this optimization in master, too?
          Hide
          jpountz Adrien Grand added a comment -

          Uwe Schindler Your change makes sense to me!

          Show
          jpountz Adrien Grand added a comment - Uwe Schindler Your change makes sense to me!
          Hide
          thetaphi Uwe Schindler added a comment -

          OK I will commit in a minute.

          Show
          thetaphi Uwe Schindler added a comment - OK I will commit in a minute.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 646b6bfd2c23df36f911a99fd2807b85a961a36b in lucene-solr's branch refs/heads/master from Uwe Schindler
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=646b6bf ]

          LUCENE-7416: Simplify MatchNoDocs rewrite in BQ (using Java 8 streams); add another special case: MUST_NOT with MatchAllDocsQuery also produces no results

          Show
          jira-bot ASF subversion and git services added a comment - Commit 646b6bfd2c23df36f911a99fd2807b85a961a36b in lucene-solr's branch refs/heads/master from Uwe Schindler [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=646b6bf ] LUCENE-7416 : Simplify MatchNoDocs rewrite in BQ (using Java 8 streams); add another special case: MUST_NOT with MatchAllDocsQuery also produces no results

            People

            • Assignee:
              Unassigned
              Reporter:
              spyk Spyros Kapnissis
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development