Lucene - Core
  1. Lucene - Core
  2. LUCENE-167

[PATCH] QueryParser not handling queries containing AND and OR

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/queryparser
    • Labels:
      None
    • Environment:

      Operating System: Linux
      Platform: PC

      Description

      The QueryParser does not seem to handle boolean queries containing AND and OR
      operators correctly:
      e.g.
      a AND b OR c AND d gets parsed as +a +b +c +d.

      The attached patch fixes this by changing the vector of boolean clauses into a
      vector of vectors of boolean clauses in the addClause method of the query
      parser. A new sub-vector is created whenever an explicit OR operator is used.

      Queries using explicit AND/OR are grouped by precedence of AND over OR. That is
      a OR b AND c gets a OR (b AND c).

      Queries using implicit AND/OR (depending on the default operator) are handled as
      before (so one can still use a +b -c to create one boolean query, where b is
      required, c forbidden and a optional).

      It's less clear how a query using both explizit AND/OR and implicit operators
      should be handled.
      Since the patch groups on explicit OR operators a query
      a OR b c is read as a (b c)
      whereas
      a AND b c as +a +b c
      (given that default operator or is used).

      There's one issue left:
      The old query parser reads a query
      `a OR NOT b' as `a -b' which is the same as `a AND NOT b'.
      The modified query parser reads this as `a (-b)'.
      While this looks better (at least to me), it does not produce the result of a OR
      NOT b. Instead the (-b) part seems to be silently dropped.
      While I understand that this query is illegal (just searching for one negative
      term) I don't think that silently dropping this part is an appropriate way to
      deal with that. But I don't think that's a query parser issue.
      The only question is, if the query parser should take care of that.

      I attached the patch (made against 1.3rc3 but working for 1.3final as well) and
      a test program.
      The test program parses a number of queries with default-or and default-and
      operator and reparses the result of the toString method of the created query.
      It outputs the initial query, the parsed query with default or, the reparesed
      query, the parsed query with the default and it's reparsed query.
      If called with a -q option, it also run's the queries against an index
      consisting of all documentes containing one or none a b c or d.
      Using an unpatched and a patched version of lucene in the classpath one can look
      at the effect of the patch in detail.

        Issue Links

          Activity

          Morus Walter created issue -
          Hide
          Morus Walter added a comment -

          Created an attachment (id=9750)
          Patch for QueryParser.jj

          Show
          Morus Walter added a comment - Created an attachment (id=9750) Patch for QueryParser.jj
          Hide
          Morus Walter added a comment -

          Created an attachment (id=9751)
          Test program that shows the result of the query parser

          Show
          Morus Walter added a comment - Created an attachment (id=9751) Test program that shows the result of the query parser
          Hide
          Morus Walter added a comment -

          I extended the patch to avoid ArrayIndexOutOfBoundsExceptions
          in case of stopwords.

          Empty queries and queries containing only prohibited clauses return null
          now.
          Note that this also aplies to subqueries, e.g.
          `a (-b -c)' gets `a' since -b -c only contains prohibited clauses.

          Show
          Morus Walter added a comment - I extended the patch to avoid ArrayIndexOutOfBoundsExceptions in case of stopwords. Empty queries and queries containing only prohibited clauses return null now. Note that this also aplies to subqueries, e.g. `a (-b -c)' gets `a' since -b -c only contains prohibited clauses.
          Hide
          Morus Walter added a comment -

          Created an attachment (id=10051)
          new version of patch taking care of stopwords

          Show
          Morus Walter added a comment - Created an attachment (id=10051) new version of patch taking care of stopwords
          Hide
          Otis Gospodnetic added a comment -

          Thanks for the patch!
          However, I applied your patch locally, but a number of QueryParser unit tests
          are now failing.
          Are they passing for you?

          Here are some failures I'm seeing:

          [junit] Testcase: testSimple(org.apache.lucene.queryParser.TestQueryParser):
          FAILED
          [junit] Query /a OR !b/ yielded /a/, expecting /a -b/
          [junit] junit.framework.AssertionFailedError: Query /a OR !b/ yielded /a/,
          expecting /a -b/
          [junit] at
          org.apache.lucene.queryParser.TestQueryParser.assertQueryEquals(TestQueryParser.java:162)

          [junit] Testcase: testNumber(org.apache.lucene.queryParser.TestQueryParser):
          Caused an ERROR
          [junit] null
          [junit] java.lang.NullPointerException
          [junit] at
          org.apache.lucene.queryParser.TestQueryParser.assertQueryEquals(TestQueryParser.java:160)

          etc.

          Show
          Otis Gospodnetic added a comment - Thanks for the patch! However, I applied your patch locally, but a number of QueryParser unit tests are now failing. Are they passing for you? Here are some failures I'm seeing: [junit] Testcase: testSimple(org.apache.lucene.queryParser.TestQueryParser): FAILED [junit] Query /a OR !b/ yielded /a/, expecting /a -b/ [junit] junit.framework.AssertionFailedError: Query /a OR !b/ yielded /a/, expecting /a -b/ [junit] at org.apache.lucene.queryParser.TestQueryParser.assertQueryEquals(TestQueryParser.java:162) [junit] Testcase: testNumber(org.apache.lucene.queryParser.TestQueryParser): Caused an ERROR [junit] null [junit] java.lang.NullPointerException [junit] at org.apache.lucene.queryParser.TestQueryParser.assertQueryEquals(TestQueryParser.java:160) etc.
          Hide
          Morus Walter added a comment -

          Thanks for you effort Otis.

          I wasn't aware of the test cases and used only my own ones.

          When I run the tests now, I do get these errors.
          I had a closer look and find that these errors are due to intentional changes.

          First I changed the test cases to deal with null queries returned from the query
          parser, that is I changed the function assertQueryEquals to
          public void assertQueryEquals(String query, Analyzer a, String result)
          throws Exception {
          Query q = getQuery(query, a);
          String s = q != null ? q.toString("field") : "null";
          if (!s.equals(result))

          { fail("Query /" + query + "/ yielded /" + s + "/, expecting /" + result + "/"); }

          }

          (the change is the construction of the result string 'String s = ...').

          Then I get the following errors:

          [junit] Testcase:
          testSimple(org.apache.lucene.queryParser.TestQueryParser): FAILED
          [junit] Query /a OR !b/ yielded /a/, expecting /a -b/

          intentional:
          a OR !b means all documents that contain a or that don't contain b.
          So this cannot be expressed by a single boolean query since it would implement a
          AND !b. So a OR !b gets a OR (!b) and !b is a boolean query containing only
          prohibited terms, which cannot be searched.
          Therefor it's dropped.
          I asked how to deal with that on the mailing list
          (http://issues.apache.org/eyebrowse/ReadMsg?listName=lucene-user@jakarta.apache.org&msgNo=6497)
          and Otis suggested to return null for such queries,
          which means that it will get dropped.

          [junit] Testcase:
          testNumber(org.apache.lucene.queryParser.TestQueryParser): FAILED
          [junit] Query /3/ yielded /null/, expecting //
          ...
          [junit] Testcase: testQPA(org.apache.lucene.queryParser.TestQueryParser): FAILED
          [junit] Query /stop/ yielded /null/, expecting //

          Also intentional. Empty queries return null now, instead of a boolean
          query containing no query.
          Same question on the mailing list.

          [junit] Testcase: testBoost(org.apache.lucene.queryParser.TestQueryParser):
          FAILED
          [junit] null
          This is from
          StandardAnalyzer oneStopAnalyzer = new StandardAnalyzer(new String[]

          {"on"}

          );
          QueryParser qp = new QueryParser("field", oneStopAnalyzer);
          Query q = qp.parse("on^1.0");
          assertNotNull(q);
          which is also a query containing only stopwords, that returns null now.

          So I think this raises the question again, what should be the result of an empty
          query?
          A boolean query containing no clauses doesn't make much sense to me (the only
          reason to use this would be compatiblity, since this is the old behavior).
          For queries (and subqueries) containing only stopwords, an alternative would be
          to raise an exception.

          Show
          Morus Walter added a comment - Thanks for you effort Otis. I wasn't aware of the test cases and used only my own ones. When I run the tests now, I do get these errors. I had a closer look and find that these errors are due to intentional changes. First I changed the test cases to deal with null queries returned from the query parser, that is I changed the function assertQueryEquals to public void assertQueryEquals(String query, Analyzer a, String result) throws Exception { Query q = getQuery(query, a); String s = q != null ? q.toString("field") : "null"; if (!s.equals(result)) { fail("Query /" + query + "/ yielded /" + s + "/, expecting /" + result + "/"); } } (the change is the construction of the result string 'String s = ...'). Then I get the following errors: [junit] Testcase: testSimple(org.apache.lucene.queryParser.TestQueryParser): FAILED [junit] Query /a OR !b/ yielded /a/, expecting /a -b/ intentional: a OR !b means all documents that contain a or that don't contain b. So this cannot be expressed by a single boolean query since it would implement a AND !b. So a OR !b gets a OR (!b) and !b is a boolean query containing only prohibited terms, which cannot be searched. Therefor it's dropped. I asked how to deal with that on the mailing list ( http://issues.apache.org/eyebrowse/ReadMsg?listName=lucene-user@jakarta.apache.org&msgNo=6497 ) and Otis suggested to return null for such queries, which means that it will get dropped. [junit] Testcase: testNumber(org.apache.lucene.queryParser.TestQueryParser): FAILED [junit] Query /3/ yielded /null/, expecting // ... [junit] Testcase: testQPA(org.apache.lucene.queryParser.TestQueryParser): FAILED [junit] Query /stop/ yielded /null/, expecting // Also intentional. Empty queries return null now, instead of a boolean query containing no query. Same question on the mailing list. [junit] Testcase: testBoost(org.apache.lucene.queryParser.TestQueryParser): FAILED [junit] null This is from StandardAnalyzer oneStopAnalyzer = new StandardAnalyzer(new String[] {"on"} ); QueryParser qp = new QueryParser("field", oneStopAnalyzer); Query q = qp.parse("on^1.0"); assertNotNull(q); which is also a query containing only stopwords, that returns null now. So I think this raises the question again, what should be the result of an empty query? A boolean query containing no clauses doesn't make much sense to me (the only reason to use this would be compatiblity, since this is the old behavior). For queries (and subqueries) containing only stopwords, an alternative would be to raise an exception.
          Hide
          cutting@apache.org added a comment -

          It is convenient to have an empty query string parsed as an empty BooleanQuery.
          That makes it much simpler to, e.g., construct a query from a form, where some
          fields may be empty. One needn't check for a null result from the parser, but
          can always unconditionally add the parsed query to the full query being composed.

          Back-compatiblity is also a good reason. Changing this will likely break a lot
          of applications when they upgrade, which is never nice.

          Show
          cutting@apache.org added a comment - It is convenient to have an empty query string parsed as an empty BooleanQuery. That makes it much simpler to, e.g., construct a query from a form, where some fields may be empty. One needn't check for a null result from the parser, but can always unconditionally add the parsed query to the full query being composed. Back-compatiblity is also a good reason. Changing this will likely break a lot of applications when they upgrade, which is never nice.
          Hide
          Jean-François Halleux added a comment -

          I had the same question when I submitted the last two patches to the
          QueryParser: what should an empty query return? This was the cause of some
          internal problems for handling stopwords within the QueryParser.

          Returning null from BooleanQuery broke "only" broke three tests in
          TestQueryParser, but fixed a lot of abnormal behaviors. Testing for an empty
          string appeared more of a dirty hack, but would do the trick too.

          I noticed that they were several patches in the queue for the QueryParser. Is
          there anybody that is taking care of them?

          Jean-Francois Halleux

          Show
          Jean-François Halleux added a comment - I had the same question when I submitted the last two patches to the QueryParser: what should an empty query return? This was the cause of some internal problems for handling stopwords within the QueryParser. Returning null from BooleanQuery broke "only" broke three tests in TestQueryParser, but fixed a lot of abnormal behaviors. Testing for an empty string appeared more of a dirty hack, but would do the trick too. I noticed that they were several patches in the queue for the QueryParser. Is there anybody that is taking care of them? Jean-Francois Halleux
          Hide
          Otis Gospodnetic added a comment -

          One of the Lucene developers will take care of them.
          I noticed there are several of them, which means there will be problems applying
          them.
          I am currently very busy, and will not get to this in at least another week or two.

          Show
          Otis Gospodnetic added a comment - One of the Lucene developers will take care of them. I noticed there are several of them, which means there will be problems applying them. I am currently very busy, and will not get to this in at least another week or two.
          Hide
          Morus Walter added a comment -

          Two comments on Dougs arguments:

          • an empty query will not result in an empty (or null) query but raise a parser
            exception. So if it should, this would have to be changed.
            The only situation where empty (or null) queries occur is, when a query only
            contains stopwords (or, with my patch, contains only prohibited terms).
          • while it may be conveniant in some cases to have an empty query as the result,
            I would like to distinguish between empty queries and queries returning an empty
            result. So if an empty query is returned I'd like to have a simple way of
            testing for that.
            I suggest a 'isEmpty()' method for the query object, which can be implemented in
            the query object always returning false and be overwritten for boolean queries,
            where it returns true unless one of clauses is a nonempty query.
            Currently you can only use the toString() method to test for empty queries or
            have to distinguish the type of query object and if its boolean query look into
            it's details yourself.
          Show
          Morus Walter added a comment - Two comments on Dougs arguments: an empty query will not result in an empty (or null) query but raise a parser exception. So if it should, this would have to be changed. The only situation where empty (or null) queries occur is, when a query only contains stopwords (or, with my patch, contains only prohibited terms). while it may be conveniant in some cases to have an empty query as the result, I would like to distinguish between empty queries and queries returning an empty result. So if an empty query is returned I'd like to have a simple way of testing for that. I suggest a 'isEmpty()' method for the query object, which can be implemented in the query object always returning false and be overwritten for boolean queries, where it returns true unless one of clauses is a nonempty query. Currently you can only use the toString() method to test for empty queries or have to distinguish the type of query object and if its boolean query look into it's details yourself.
          Hide
          Otis Gospodnetic added a comment -

          How about this.
          It seems to me that the optimal way to approach this problem at this point is to
          change the patch to return an empty Query instead of a null one, after all.
          This way we get everything that Doug mentioned (no need to test results, and we
          remain compatible with previous versions). This way we also fix the bug that
          you identified in the first place, which is incorrect parsing of AND/OR query
          expressions, and we also get the solution to only prohibited terms being present
          in the query.

          What we do not get is the way of figuring out why a returned Query is empty...
          and you (Morus) even say that blank Queries will raise parser exceptions.

          So I suggest we patch QueryParser.jj without this last bit for now, and add
          support for that later.

          Show
          Otis Gospodnetic added a comment - How about this. It seems to me that the optimal way to approach this problem at this point is to change the patch to return an empty Query instead of a null one, after all. This way we get everything that Doug mentioned (no need to test results, and we remain compatible with previous versions). This way we also fix the bug that you identified in the first place, which is incorrect parsing of AND/OR query expressions, and we also get the solution to only prohibited terms being present in the query. What we do not get is the way of figuring out why a returned Query is empty... and you (Morus) even say that blank Queries will raise parser exceptions. So I suggest we patch QueryParser.jj without this last bit for now, and add support for that later.
          Hide
          Morus Walter added a comment -

          Ok for me.
          As I said, I'd just like to have an easy way of checking if a query is empty in
          this case.

          If you want me to change the patch, let me know. That no big deal.

          Show
          Morus Walter added a comment - Ok for me. As I said, I'd just like to have an easy way of checking if a query is empty in this case. If you want me to change the patch, let me know. That no big deal.
          Hide
          Erik Hatcher added a comment -

          I tried applying the latest QueryParser.jj patch but it failed. Could you update this patch to the current
          svn version of QueryParser.jj? And we need to have some test cases to prove this works as expected.
          I'll apply it.

          There are other ways to achieve this with JavaCC which I'm exploring myself in a custom query parser.

          Show
          Erik Hatcher added a comment - I tried applying the latest QueryParser.jj patch but it failed. Could you update this patch to the current svn version of QueryParser.jj? And we need to have some test cases to prove this works as expected. I'll apply it. There are other ways to achieve this with JavaCC which I'm exploring myself in a custom query parser.
          Hide
          Morus Walter added a comment -

          OK, I'll look into this.
          But it might take some time (1-2 weeks).

          If you're going to fix it otherwise, let me know.
          Of course it should be able to do the operator precenedence on the javacc level,
          though I'm not sure about the details of the needed recursion.
          But that might end in a major rewrite of qp, something I wanted to avoid, when I
          wrote the patch.

          Show
          Morus Walter added a comment - OK, I'll look into this. But it might take some time (1-2 weeks). If you're going to fix it otherwise, let me know. Of course it should be able to do the operator precenedence on the javacc level, though I'm not sure about the details of the needed recursion. But that might end in a major rewrite of qp, something I wanted to avoid, when I wrote the patch.
          Hide
          Erik Hatcher added a comment -

          I have created a PrecedenceQueryParser in Lucene's trunk repository. This implements the precedence
          with AND/OR as desired with AND taking a higher precedence, and also seems to deal nicely with the
          edge cases mentioned in the comments of this issue such that A OR B C translates to A B C.

          Let me know if PQP is acceptable and I'll make the switch and rename it to QueryParser moving the old
          QueryParser to OldQueryParser and deprecating it in case anyone needs the previous behavior.

          Show
          Erik Hatcher added a comment - I have created a PrecedenceQueryParser in Lucene's trunk repository. This implements the precedence with AND/OR as desired with AND taking a higher precedence, and also seems to deal nicely with the edge cases mentioned in the comments of this issue such that A OR B C translates to A B C. Let me know if PQP is acceptable and I'll make the switch and rename it to QueryParser moving the old QueryParser to OldQueryParser and deprecating it in case anyone needs the previous behavior.
          Jeff Turner made changes -
          Field Original Value New Value
          issue.field.bugzillaimportkey 25820 12314317
          Hide
          Christoph Bammann added a comment -

          Hi, I would like to know if this new Parser is integrated or even
          the standard QueryParser in current releases?

          Show
          Christoph Bammann added a comment - Hi, I would like to know if this new Parser is integrated or even the standard QueryParser in current releases?
          Hide
          Erik Hatcher added a comment -

          the PrecedenceQueryParser is in the contrib/miscellaneous codebase (in Lucene's repo) and in released "miscellaneous" JAR. But it has some issues that are documented in the test case, so it is definitely not ready for prime time.

          Show
          Erik Hatcher added a comment - the PrecedenceQueryParser is in the contrib/miscellaneous codebase (in Lucene's repo) and in released "miscellaneous" JAR. But it has some issues that are documented in the test case, so it is definitely not ready for prime time.
          Hide
          Graham Maloon added a comment -

          I see that very little has been done with this since 2005. Are there any plans to incorporate a fix into the current build? How can I get my hands on a copy of the fix that will work with 2.3.0?

          Show
          Graham Maloon added a comment - I see that very little has been done with this since 2005. Are there any plans to incorporate a fix into the current build? How can I get my hands on a copy of the fix that will work with 2.3.0?
          Luis Alves made changes -
          Link This issue is part of LUCENE-1823 [ LUCENE-1823 ]
          Hide
          Luis Alves added a comment -

          This feature is part of the new queryparser implementation
          It is disabled by the by default, to implement the lucene 2.4 behaviour.
          It can be enabled by removing the GroupQueryNodeProcessor from the StandardQueryNodeProcessorPipeline.

          In LUCENE-1823 we plan to expose it in a new syntax parser and also handle OR, AND, NOT as logical operators with precedence.

          Would be this enough to address this issue?

          Show
          Luis Alves added a comment - This feature is part of the new queryparser implementation It is disabled by the by default, to implement the lucene 2.4 behaviour. It can be enabled by removing the GroupQueryNodeProcessor from the StandardQueryNodeProcessorPipeline. In LUCENE-1823 we plan to expose it in a new syntax parser and also handle OR, AND, NOT as logical operators with precedence. Would be this enough to address this issue?
          Hide
          Morus Walter added a comment -

          From my point of view (as the one who opened the ticket) this seems sufficient, so I'm closing the ticket.

          Sorry, I was not aware that it is still open.

          Show
          Morus Walter added a comment - From my point of view (as the one who opened the ticket) this seems sufficient, so I'm closing the ticket. Sorry, I was not aware that it is still open.
          Morus Walter made changes -
          Status Open [ 1 ] Closed [ 6 ]
          Resolution Fixed [ 1 ]
          Mark Thomas made changes -
          Workflow jira [ 12324322 ] Default workflow, editable Closed status [ 12564440 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12564440 ] jira [ 12585717 ]
          Steve Rowe made changes -
          Affects Version/s unspecified [ 12310280 ]

            People

            • Assignee:
              Erik Hatcher
              Reporter:
              Morus Walter
            • Votes:
              2 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development