Lucene - Core
  1. Lucene - Core
  2. LUCENE-933

QueryParser can produce empty sub BooleanQueries when Analyzer proudces no tokens for input

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      as triggered by SOLR-261, if you have a query like this...

      +foo:BBB +(yak:AAA baz:CCC)

      ...where the analyzer produces no tokens for the "yak:AAA" or "baz:CCC" portions of the query (posisbly because they are stop words) the resulting query produced by the QueryParser will be...

      +foo:BBB +()

      ...that is a BooleanQuery with two required clauses, one of which is an empty BooleanQuery with no clauses.

      this does not appear to be "good" behavior.

      In general, QueryParser should be smarter about what it does when parsing encountering parens whose contents result in an empty BooleanQuery – but what exactly it should do in the following situations...

      a) +foo:BBB +()
      b) +foo:BBB ()
      c) +foo:BBB -()

      ...is up for interpretation. I would think situation (b) clearly lends itself to dropping the sub-BooleanQuery completely. situation (c) may also lend itself to that solution, since semanticly it means "don't allow a match on any queries in the empty set of queries". .... I have no idea what the "right" thing to do for situation (a) is.

        Issue Links

          Activity

          Hoss Man created issue -
          Hoss Man made changes -
          Field Original Value New Value
          Link This issue blocks SOLR-261 [ SOLR-261 ]
          Hide
          Hoss Man added a comment -

          linking to origin of report from SOLR

          Show
          Hoss Man added a comment - linking to origin of report from SOLR
          Hide
          Doron Cohen added a comment -

          > a) +foo:BBB +()
          > I have no idea what the "right" thing to do for situation (a) is.

          Interestingly, see TestQueryParser.testQPA():
          assertQueryEquals("term +stop term", qpAnalyzer, "term term");
          assertQueryEquals("term -stop term", qpAnalyzer, "term term");

          So today already requiring word W to not/appear become a non-requirement in case W is a stopword.

          Currently adding any of these would cause failure:
          assertQueryEquals("term +(stop) term", qpAnalyzer, "term term");
          assertQueryEquals("term -(stop) term", qpAnalyzer, "term term");
          assertQueryEquals("term +(stop stop) term", qpAnalyzer, "term term");
          assertQueryEquals("term -(stop stop) term", qpAnalyzer, "term term");

          I feel comfortable with applying the logic we have for a single (stop)word on a group of (stop)words, i.e. making the added lines pass.

          Interestingly, consider this query:
          A B +(+C -C)
          Regularly it would have no match, because
          X AND NOT X == FALSE
          but if C is a stopword, with the fixed logic the query would become:
          A B
          and might have matches.
          Now is that a glitch? I'd like to think not.

          Show
          Doron Cohen added a comment - > a) +foo:BBB +() > I have no idea what the "right" thing to do for situation (a) is. Interestingly, see TestQueryParser.testQPA(): assertQueryEquals("term +stop term", qpAnalyzer, "term term"); assertQueryEquals("term -stop term", qpAnalyzer, "term term"); So today already requiring word W to not/appear become a non-requirement in case W is a stopword. Currently adding any of these would cause failure: assertQueryEquals("term +(stop) term", qpAnalyzer, "term term"); assertQueryEquals("term -(stop) term", qpAnalyzer, "term term"); assertQueryEquals("term +(stop stop) term", qpAnalyzer, "term term"); assertQueryEquals("term -(stop stop) term", qpAnalyzer, "term term"); I feel comfortable with applying the logic we have for a single (stop)word on a group of (stop)words, i.e. making the added lines pass. Interestingly, consider this query: A B +(+C -C) Regularly it would have no match, because X AND NOT X == FALSE but if C is a stopword, with the fixed logic the query would become: A B and might have matches. Now is that a glitch? I'd like to think not.
          Hide
          Hoss Man added a comment -

          > I feel comfortable with applying the logic we have for a single (stop)word on a group of
          > (stop)words, i.e. making the added lines pass.

          +1

          > Interestingly, consider this query:
          > A B +(+C -C)

          perhaps an alternate way to view this problem would be to ask: what should QueryParser do, if asked to parse this string...
          A B +()

          ...if the answer is "treat it like 'A B'" then i think we're okay with the approach you described above. if the answer is "an empty query doesn't match anything, so requiring a match on a clause which is an empty query should result in the outer query matching nothing" then we've got a problem ... mainly that it contradicts the example you cited from TestQueryParser.testQPA() if you replace "an empty query" in the previous statement with "a query on a stop word"

          personally, i think it's okay to say "A B +(+C -C)" == "A B" if the analyzer doesn't produce any tokens for C.

          Show
          Hoss Man added a comment - > I feel comfortable with applying the logic we have for a single (stop)word on a group of > (stop)words, i.e. making the added lines pass. +1 > Interestingly, consider this query: > A B +(+C -C) perhaps an alternate way to view this problem would be to ask: what should QueryParser do, if asked to parse this string... A B +() ...if the answer is "treat it like 'A B'" then i think we're okay with the approach you described above. if the answer is "an empty query doesn't match anything, so requiring a match on a clause which is an empty query should result in the outer query matching nothing" then we've got a problem ... mainly that it contradicts the example you cited from TestQueryParser.testQPA() if you replace "an empty query" in the previous statement with "a query on a stop word" personally, i think it's okay to say "A B +(+C -C)" == "A B" if the analyzer doesn't produce any tokens for C.
          Doron Cohen made changes -
          Assignee Doron Cohen [ doronc ]
          Hide
          Doron Cohen added a comment -

          So an acceptable solution is:
          Query parser will ignore empty clauses (e.g. ' ( ) ' ) resulted from words filtering, the same as it already does for single words.

          A straightforward fix is for QueryParser to avoid adding null (inner) queries into (outer) clauses sets. (It makes sense, too.)

          However this has a side effect:
          For queries that became "empty" as result of filtering (stopping), QueryParser would now return null.

          This is an API semantics change, because applications that used to get a BooleanQuery with 0 clauses as parse result, would now get a null query.

          Here is a closer look on the behavior change:

          Original behavior:
          (1) parse(" ") == ParseException
          (2) parse("( )") == ParseException
          (3) parse("stop") == " "
          (actually a boolean query with 0 clauses)
          (4) parse("(stop)") == " "
          (actually a boolean query with 0 clauses)
          (5) parse("a stop b") == "a b"
          (6) parse("a (stop) b") == "a () b"
          (middle part is a boolean query with 0 clauses)
          (7) parse("a ((stop)) b") == "a () b"
          (again middle part is a boolean query with 0 clauses)

          Modified behavior:
          (3) parse("stop") == null
          (4) parse("(stop)") == null
          (6) parse("a (stop) b") == "a b"
          (7) parse("a ((stop)) b") == "a b"

          I think the modified behavior is the right one - applications can test a query for being null and realize that it is a no-op.

          However backwards compatibility is important - would this change break existing applications with annoying new NPEs?

          As an alternative, QueryParser parse() methods can be modified to return a phony empty BQ instead of returning null, for the sake of backwards compatibility.

          Thoughts?

          Show
          Doron Cohen added a comment - So an acceptable solution is: Query parser will ignore empty clauses (e.g. ' ( ) ' ) resulted from words filtering, the same as it already does for single words. A straightforward fix is for QueryParser to avoid adding null (inner) queries into (outer) clauses sets. (It makes sense, too.) However this has a side effect: For queries that became "empty" as result of filtering (stopping), QueryParser would now return null. This is an API semantics change, because applications that used to get a BooleanQuery with 0 clauses as parse result, would now get a null query. Here is a closer look on the behavior change: Original behavior: (1) parse(" ") == ParseException (2) parse("( )") == ParseException (3) parse("stop") == " " (actually a boolean query with 0 clauses) (4) parse("(stop)") == " " (actually a boolean query with 0 clauses) (5) parse("a stop b") == "a b" (6) parse("a (stop) b") == "a () b" (middle part is a boolean query with 0 clauses) (7) parse("a ((stop)) b") == "a () b" (again middle part is a boolean query with 0 clauses) Modified behavior: (3) parse("stop") == null (4) parse("(stop)") == null (6) parse("a (stop) b") == "a b" (7) parse("a ((stop)) b") == "a b" I think the modified behavior is the right one - applications can test a query for being null and realize that it is a no-op. However backwards compatibility is important - would this change break existing applications with annoying new NPEs? As an alternative, QueryParser parse() methods can be modified to return a phony empty BQ instead of returning null, for the sake of backwards compatibility. Thoughts?
          Hide
          Doron Cohen added a comment -

          Ok attaching two different fixes (as discussed above)
          (1) lucene-933_backwards_comapatible.patch
          (2) lucene-933_nullify.patch

          All tests pass with either of these.

          The "nullify" approach requires more changes especially tests as well as in MemoryIndex, so, after while fixing as required for tests to pass in this (nullifying) approach I cane to conclusion that it is better to continue to not return null queries as result of parsing, otherwise there'll be lots of "noise".

          So I would like to commit patch (1) - unless someone points a problem that I missed.

          Show
          Doron Cohen added a comment - Ok attaching two different fixes (as discussed above) (1) lucene-933_backwards_comapatible.patch (2) lucene-933_nullify.patch All tests pass with either of these. The "nullify" approach requires more changes especially tests as well as in MemoryIndex, so, after while fixing as required for tests to pass in this (nullifying) approach I cane to conclusion that it is better to continue to not return null queries as result of parsing, otherwise there'll be lots of "noise". So I would like to commit patch (1) - unless someone points a problem that I missed.
          Doron Cohen made changes -
          Attachment lucene-933_backwards_comapatible.patch [ 12360237 ]
          Attachment lucene-933_nullify.patch [ 12360238 ]
          Hide
          Doron Cohen added a comment -

          committed the bakwards-compatible patch (parsed query is not null).

          Show
          Doron Cohen added a comment - committed the bakwards-compatible patch (parsed query is not null).
          Doron Cohen made changes -
          Lucene Fields [New] [Patch Available]
          Resolution Fixed [ 1 ]
          Status Open [ 1 ] Resolved [ 5 ]
          Hide
          Hoss Man added a comment -

          woops ... sorry doron, i actually reviewed these patches the other day, but aparently i got side tracked and never commented.

          i think you made the right choice with the backwards_comapatible.patch

          Show
          Hoss Man added a comment - woops ... sorry doron, i actually reviewed these patches the other day, but aparently i got side tracked and never commented. i think you made the right choice with the backwards_comapatible.patch
          Hide
          Doron Cohen added a comment -

          great, thanks Hoss!

          Show
          Doron Cohen added a comment - great, thanks Hoss!
          Mark Thomas made changes -
          Workflow jira [ 12406438 ] Default workflow, editable Closed status [ 12562556 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12562556 ] jira [ 12583519 ]

            People

            • Assignee:
              Doron Cohen
              Reporter:
              Hoss Man
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development