Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-800

Incorrect parsing by QueryParser.parse() when it encounters backslashes (always eats one backslash.)

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.2
    • Component/s: core/queryparser
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Test code and output follow. Tested Lucene 1.9 version only. Affects hose who would index/search for Lucene's reserved characters.

      Description: When an input search string has a sequence of N (java-escaped) backslashes, where N >= 2, the QueryParser will produce a query in which that sequence has N-1 backslashes.

      TEST CODE:
      Analyzer analyzer = new WhitespaceAnalyzer();
      String[] queryStrs =

      {"item:\\\\", "item:\\\\*", "(item:\\\\ item:ABCD\\\\))", "(item:\\\\ item:ABCD\\\\)"}

      ;
      for (String queryStr : queryStrs) {
      System.out.println("--------------------------------------");
      System.out.println("String queryStr = " + queryStr);
      Query luceneQuery = null;
      try

      { luceneQuery = new QueryParser("_default_", analyzer).parse(queryStr); System.out.println("luceneQuery.toString() = " + luceneQuery.toString()); }

      catch (Exception e)

      { System.out.println(e.getClass().toString()); }

      }

      OUTPUT (with remarks in comment notation
      --------------------------------------
      String queryStr = item:
      luceneQuery.toString() = item:\ //One backslash has disappeared. Searcher will fail on this query.
      --------------------------------------
      String queryStr = item:
      *
      luceneQuery.toString() = item:* //One backslash has disappeared. This query will search for something unintended.
      --------------------------------------
      String queryStr = (item:
      item:ABCD
      ))
      luceneQuery.toString() = item:\ item:ABCD) //This should have thrown a ParseException because of an unescaped ')'. It did not.
      --------------------------------------
      String queryStr = (item:
      item:ABCD
      )
      class org.apache.lucene.queryParser.ParseException //...and this one should not have, but it did.

      1. Lucene-800.patch
        11 kB
        Michael Busch
      2. Lucene-800-more-tests.patch
        3 kB
        Doron Cohen

        Activity

        Hide
        doronc Doron Cohen added a comment -

        Michael, I've been looking into this and think I made some progress. Are you just starting, or do you have it solved already?

        Show
        doronc Doron Cohen added a comment - Michael, I've been looking into this and think I made some progress. Are you just starting, or do you have it solved already?
        Hide
        michaelbusch Michael Busch added a comment -

        Hi Dilip,

        the backslash is the escape character in Lucene's queryparser syntax. So if you want to search for a backslash you have to escape it. That means that the first two examples you provides are working as expected:

        item:
        -> item:\ is correct
        item:
        * -> item:* is correct too

        If you want to search for two backslashes you have to escape both, meaning you have to put four backslashes in the query string:
        item:\\\\* -> item:
        *

        But you indeed found two other problems. You are right, the last example should not throw a ParseException.
        In (item:
        item:ABCD
        ) the queryparser falsely thinks that the closing parenthesis is escaped, but actually the backslash is the escaped character. I will provide a patch for this problem soon.

        And as you said the third example should throw a ParseException because there are too many closing parenthesis. There is already a patch for this problem in JIRA:
        http://issues.apache.org/jira/browse/LUCENE-372

        I will commit fixes for both problems soon.

        Thanks again, Dilip! Good catches

        Show
        michaelbusch Michael Busch added a comment - Hi Dilip, the backslash is the escape character in Lucene's queryparser syntax. So if you want to search for a backslash you have to escape it. That means that the first two examples you provides are working as expected: item: -> item:\ is correct item: * -> item:* is correct too If you want to search for two backslashes you have to escape both, meaning you have to put four backslashes in the query string: item:\\\\* -> item: * But you indeed found two other problems. You are right, the last example should not throw a ParseException. In (item: item:ABCD ) the queryparser falsely thinks that the closing parenthesis is escaped, but actually the backslash is the escaped character. I will provide a patch for this problem soon. And as you said the third example should throw a ParseException because there are too many closing parenthesis. There is already a patch for this problem in JIRA: http://issues.apache.org/jira/browse/LUCENE-372 I will commit fixes for both problems soon. Thanks again, Dilip! Good catches
        Hide
        michaelbusch Michael Busch added a comment -

        Doron,

        the problem here is that a backslash is a valid TERM_CHAR and an ESCAPE_CHAR at the same time. The fix is to exclude \ from the TERM_CHAR list. I tried this fix and it works fine for me. I'm going to attach a patch today. Would be great if you could review it before I commit it, Doron!

        Show
        michaelbusch Michael Busch added a comment - Doron, the problem here is that a backslash is a valid TERM_CHAR and an ESCAPE_CHAR at the same time. The fix is to exclude \ from the TERM_CHAR list. I tried this fix and it works fine for me. I'm going to attach a patch today. Would be great if you could review it before I commit it, Doron!
        Hide
        dnimkar Dilip Nimkar added a comment -

        In my test code, I took care of the difference between \ as the Java escape character and \ as the Lucene escape character.

        System.out.println(new QueryParser("default", analyzer).parse( "item:\\\\")) //note the 4 backslashes.
        should print on the console item:
        But it is printing item:\
        Same is the case with the second string in the test code.

        in general, the boolean test
        str.equals(QueryParser("default", analyzer).parse( str).toString())
        should always evaluate to true if the analyzer is not changing the string. But in our case it is evaluating to false.

        The behavior I have consitently found is that - "Whenever and wherever a java String contains an unbroken sequence of N escaped backslashes (that is, N pairs of unescaped backslashes, totalling 2N backslashes) where N>= 2, the parse() method creates a Query that has only n-1 escaped backslashes in the corresponding place. " If you have 20 escaped backslashes in a java string, the Lucene query will end up with 19.

        Thank you much for your time, attention and efforts.
        Thanks.

        Show
        dnimkar Dilip Nimkar added a comment - In my test code, I took care of the difference between \ as the Java escape character and \ as the Lucene escape character. System.out.println(new QueryParser(" default ", analyzer).parse( "item:\\\\")) //note the 4 backslashes. should print on the console item: But it is printing item:\ Same is the case with the second string in the test code. in general, the boolean test str.equals(QueryParser(" default ", analyzer).parse( str).toString()) should always evaluate to true if the analyzer is not changing the string. But in our case it is evaluating to false. The behavior I have consitently found is that - "Whenever and wherever a java String contains an unbroken sequence of N escaped backslashes (that is, N pairs of unescaped backslashes, totalling 2N backslashes) where N>= 2, the parse() method creates a Query that has only n-1 escaped backslashes in the corresponding place. " If you have 20 escaped backslashes in a java string, the Lucene query will end up with 19. Thank you much for your time, attention and efforts. Thanks.
        Hide
        michaelbusch Michael Busch added a comment -

        Dilip,

        are you using Lucene 1.9? The problem you are referring to (a sequence of N escaped backslashes) has been fixed in Lucene 2.1:
        http://issues.apache.org/jira/browse/LUCENE-573

        Could you test your code with the new version, please?

        However, the two other problems you pointed out and which I talked about in my previous comment are still there (but I'm working on it )

        Thanks,
        Michael

        Show
        michaelbusch Michael Busch added a comment - Dilip, are you using Lucene 1.9? The problem you are referring to (a sequence of N escaped backslashes) has been fixed in Lucene 2.1: http://issues.apache.org/jira/browse/LUCENE-573 Could you test your code with the new version, please? However, the two other problems you pointed out and which I talked about in my previous comment are still there (but I'm working on it ) Thanks, Michael
        Hide
        michaelbusch Michael Busch added a comment -

        just lowering the severity to minor

        Show
        michaelbusch Michael Busch added a comment - just lowering the severity to minor
        Hide
        michaelbusch Michael Busch added a comment -

        With this patch a query like
        (item:
        item:ABCD
        )
        does not throw a ParseException anymore. I excluded the backslash from the TERM_CHAR list, because a backslash should always be escaped.

        I also changed the list ESCAPED_CHAR. Every character that follows a backslash should be considered as escaped. Until now, the query \a would cause a ParseException, the query + would work fine, which is not consistent. So now every character following a backslash is an ESCAPED_CHAR. Any objections?

        All unit tests pass.

        Show
        michaelbusch Michael Busch added a comment - With this patch a query like (item: item:ABCD ) does not throw a ParseException anymore. I excluded the backslash from the TERM_CHAR list, because a backslash should always be escaped. I also changed the list ESCAPED_CHAR. Every character that follows a backslash should be considered as escaped. Until now, the query \a would cause a ParseException, the query + would work fine, which is not consistent. So now every character following a backslash is an ESCAPED_CHAR. Any objections? All unit tests pass.
        Hide
        doronc Doron Cohen added a comment -

        Hi Michael, I reviewed this fix and it looks good and correct.
        All tests are passing, including the new ones. (well, a few backwards compatibility tests fail - I would check that later - but it is unrelated to this fix).
        While reviewing I added a few test cases just to make sure - attached Lucene-800-more-tests.patch in case you find that worthy to add.
        Regards,
        Doron

        Show
        doronc Doron Cohen added a comment - Hi Michael, I reviewed this fix and it looks good and correct. All tests are passing, including the new ones. (well, a few backwards compatibility tests fail - I would check that later - but it is unrelated to this fix). While reviewing I added a few test cases just to make sure - attached Lucene-800-more-tests.patch in case you find that worthy to add. Regards, Doron
        Hide
        michaelbusch Michael Busch added a comment -

        Thanks Doron for reviewing and for the additional tests!

        I just committed this and LUCENE-372. Together these patches fix the two problems descibed in this issue.

        Show
        michaelbusch Michael Busch added a comment - Thanks Doron for reviewing and for the additional tests! I just committed this and LUCENE-372 . Together these patches fix the two problems descibed in this issue.

          People

          • Assignee:
            michaelbusch Michael Busch
            Reporter:
            dnimkar Dilip Nimkar
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development