Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7337

MultiTermQuery are sometimes rewritten into an empty boolean query

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.2, 7.0
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      MultiTermQuery are sometimes rewritten to an empty boolean query (depending on the rewrite method), it can happen when no expansions are found on a fuzzy query for instance.
      It can be problematic when the multi term query is boosted.
      For instance consider the following query:

      `((title:bar~1)^100 text:bar)`

      This is a boolean query with two optional clauses. The first one is a fuzzy query on the field title with a boost of 100.
      If there is no expansion for "title:bar~1" the query is rewritten into:

      `(()^100 text:bar)`

      ... and when expansions are found:

      `((title:bars | title:bar)^100 text:bar)`

      The scoring of those two queries will differ because the normalization factor and the norm for the first query will be equal to 1 (the boost is ignored because the empty boolean query is not taken into account for the computation of the normalization factor) whereas the second query will have a normalization factor of 10,000 (100*100) and a norm equal to 0.01.

      This kind of discrepancy can happen in a single index because the expansions for the fuzzy query are done at the segment level. It can also happen when multiple indices are requested (Solr/ElasticSearch case).

      A simple fix would be to replace the empty boolean query produced by the multi term query with a MatchNoDocsQuery but I am not sure that it's the best way to fix. WDYT ?

      1. LUCENE-7337.patch
        9 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Bulk close resolved issues after 6.2.0 release.

        Show
        mikemccand Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit a3fc7efbccfa547add864e58268e40960bff571b in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a3fc7ef ]

        LUCENE-7337: empty boolean query now rewrites to MatchNoDocsQuery instead of vice/versa

        Show
        jira-bot ASF subversion and git services added a comment - Commit a3fc7efbccfa547add864e58268e40960bff571b in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a3fc7ef ] LUCENE-7337 : empty boolean query now rewrites to MatchNoDocsQuery instead of vice/versa
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 7b5d7b396254998c0d4d1a6139134639aea1904f in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7b5d7b3 ]

        LUCENE-7337: empty boolean query now rewrites to MatchNoDocsQuery instead of vice/versa

        Show
        jira-bot ASF subversion and git services added a comment - Commit 7b5d7b396254998c0d4d1a6139134639aea1904f in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7b5d7b3 ] LUCENE-7337 : empty boolean query now rewrites to MatchNoDocsQuery instead of vice/versa
        Hide
        jim.ferenczi Jim Ferenczi added a comment - - edited

        Wooo thanks Michael McCandless

        I think getting proper distributed queries working is really out of scope here: that would really require a distributed rewrite to work correctly.

        Agreed. Returning 1 or 0 for the queryNorm would not solve the problem anyway and I think it's more important to make empty-clause boolean query behaves exactly the same as the MatchNoDocsQuery.

        Show
        jim.ferenczi Jim Ferenczi added a comment - - edited Wooo thanks Michael McCandless I think getting proper distributed queries working is really out of scope here: that would really require a distributed rewrite to work correctly. Agreed. Returning 1 or 0 for the queryNorm would not solve the problem anyway and I think it's more important to make empty-clause boolean query behaves exactly the same as the MatchNoDocsQuery.
        Hide
        mikemccand Michael McCandless added a comment -

        OK, here's a patch, giving MatchNoDocsQuery its own Weight
        that returns 0 for queryNorm, and fixing an empty BooleanQuery
        to rewrite to it.

        Scoring wise, this behaves the same as an empty-clause boolean query,
        and I think this will make LUCENE-7276 much easier!

        It can also happen when multiple indices are requested (Solr/ElasticSearch case).

        I think getting proper distributed queries working is really out of
        scope here: that would really require a distributed rewrite to work
        correctly.

        I think this patch is a good baby-step.

        Show
        mikemccand Michael McCandless added a comment - OK, here's a patch, giving MatchNoDocsQuery its own Weight that returns 0 for queryNorm , and fixing an empty BooleanQuery to rewrite to it. Scoring wise, this behaves the same as an empty-clause boolean query, and I think this will make LUCENE-7276 much easier! It can also happen when multiple indices are requested (Solr/ElasticSearch case). I think getting proper distributed queries working is really out of scope here: that would really require a distributed rewrite to work correctly. I think this patch is a good baby-step.
        Hide
        dsmiley David Smiley added a comment -

        I do really like your idea of having an empty clause BQ rewrite to MatchNoDocsQuery: I think we should have one, unambiguous query class that's used for this "matches nothing" rewrite case, if we can get the scoring to work out correctly!

        +1 ! Empty BQ is weird.

        Show
        dsmiley David Smiley added a comment - I do really like your idea of having an empty clause BQ rewrite to MatchNoDocsQuery: I think we should have one, unambiguous query class that's used for this "matches nothing" rewrite case, if we can get the scoring to work out correctly! +1 ! Empty BQ is weird.
        Hide
        mikemccand Michael McCandless added a comment -

        Really, the max score that MatchNoDocsQuery can return is undefined right, since it returns nothing. (i.e. max value over an empty set of elements is not defined).

        Maybe, instead of adding a new query that also matches no documents, we could just enhance the existing one so you could pass it the norm factor you'd like it to "use"?

        I do really like your idea of having an empty clause BQ rewrite to MatchNoDocsQuery: I think we should have one, unambiguous query class that's used for this "matches nothing" rewrite case, if we can get the scoring to work out correctly!

        Show
        mikemccand Michael McCandless added a comment - Really, the max score that MatchNoDocsQuery can return is undefined right, since it returns nothing. (i.e. max value over an empty set of elements is not defined). Maybe, instead of adding a new query that also matches no documents, we could just enhance the existing one so you could pass it the norm factor you'd like it to "use"? I do really like your idea of having an empty clause BQ rewrite to MatchNoDocsQuery : I think we should have one, unambiguous query class that's used for this "matches nothing" rewrite case, if we can get the scoring to work out correctly!
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        A simple fix would be to replace the empty boolean query produced by the multi term query with a MatchNoDocsQuery but I am not sure that it's the best way to fix.

        I am not sure of this statement anymore. Conceptually a MatchNoDocsQuery and a BooleanQuery with no clause are similar. Though what I proposed assumed that the value for normalization of the MatchNoDocsQuery is 1. I think that doing this would bring confusion since this value is supposed to reflect the max score that the query can get (which is 0 in this case). Currently a boolean query or a disjunction query with no clause return 0 for the normalization. I think it's the expected behavior even though this breaks the distributed case as explained in my previous comment.
        For empty queries that are the result of an expansion (multi term query) maybe we could add yet another special query, something like MatchNoExpansionQuery that would use a ConstantScoreWeight ? I am proposing this because this would make the distinction between a query that match no documents no matter what the context is and a query that match no documents because of the context (useful for the distributed case).

        Show
        jim.ferenczi Jim Ferenczi added a comment - A simple fix would be to replace the empty boolean query produced by the multi term query with a MatchNoDocsQuery but I am not sure that it's the best way to fix. I am not sure of this statement anymore. Conceptually a MatchNoDocsQuery and a BooleanQuery with no clause are similar. Though what I proposed assumed that the value for normalization of the MatchNoDocsQuery is 1. I think that doing this would bring confusion since this value is supposed to reflect the max score that the query can get (which is 0 in this case). Currently a boolean query or a disjunction query with no clause return 0 for the normalization. I think it's the expected behavior even though this breaks the distributed case as explained in my previous comment. For empty queries that are the result of an expansion (multi term query) maybe we could add yet another special query, something like MatchNoExpansionQuery that would use a ConstantScoreWeight ? I am proposing this because this would make the distinction between a query that match no documents no matter what the context is and a query that match no documents because of the context (useful for the distributed case).
        Hide
        mikemccand Michael McCandless added a comment -

        A simple fix would be to replace the empty boolean query produced by the multi term query with a MatchNoDocsQuery but I am not sure that it's the best way to fix.

        +1

        Or more generally can we have an empty-clause BQ rewrite to MatchNoDocsQuery? I had folded this into my attempt to fix the world's-hardest-toString-issue (LUCENE-7276) but it was too many changes to try at once, so breaking it out here is great.

        However, before we can do this, we need to fix MatchNoDocsQuery to not rewrite to an empty BQ else we get a never-terminating rewrite

        Show
        mikemccand Michael McCandless added a comment - A simple fix would be to replace the empty boolean query produced by the multi term query with a MatchNoDocsQuery but I am not sure that it's the best way to fix. +1 Or more generally can we have an empty-clause BQ rewrite to MatchNoDocsQuery ? I had folded this into my attempt to fix the world's-hardest-toString-issue ( LUCENE-7276 ) but it was too many changes to try at once, so breaking it out here is great. However, before we can do this, we need to fix MatchNoDocsQuery to not rewrite to an empty BQ else we get a never-terminating rewrite

          People

          • Assignee:
            Unassigned
            Reporter:
            jim.ferenczi Jim Ferenczi
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development