Lucene - Core
  1. Lucene - Core
  2. LUCENE-573

Escaped quotes inside a phrase cause a ParseException

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.9
    • Fix Version/s: None
    • Component/s: core/queryparser
    • Labels:
      None
    • Environment:

      Debian Sarge, Sun JDK 1.4.2

    • Lucene Fields:
      Patch Available

      Description

      QueryParser cannot handle escaped quotes when inside a phrase. Escaped quotes not in a phrase are not a problem. This can be added to TestQueryParser.testEscaped() to demonstrate the issue - the second assert throws an exception:

      assertQueryEquals("a \\\"b c\\\" d", a, "a \"b c\" d");
      assertQueryEquals("\"a \\\"b c\\\" d\"", a, "\"a \"b c\" d\"");

      See also this thread:
      http://www.nabble.com/ParseException-with-escaped-quotes-in-a-phrase-t1647115.html

        Activity

        Hide
        Michael Busch added a comment -

        Tomislav,

        thanks for finding this problem in the queryparser. I attach a patch file that fixes this bug. Now the queryparser does not recognize an escaped quote inside a phrase as the phrase terminator anymore.

        I also updated the testcase org.apache.lucene.queryParser.TestQueryParser to test escaped quotes within phrases. This testcase fails with the old version of the queryparser and runs successful with the patched version. I added the following three tests to the testEscaped() method:

        (1) assertQueryEquals("a \\\"b c\\\" d", a, "a \"b c\" d");
        (2) assertQueryEquals("\"a
        +b c d\"", a, "\"a
        +b c d\"");
        (3) assertQueryEquals("\"a \\\"b c\\\" d\"", a, "\"a \\\"b c\\\" d\"");

        Please notice that (3) is different from your second suggested assert. You assume that the queryparser unescapes the quotes inside the phrase, but the queryparser does not unescape any escaped characters inside a phrase. You can see that in (2), where the escaped + (plus) character does not become unescaped.

        Michael

        Show
        Michael Busch added a comment - Tomislav, thanks for finding this problem in the queryparser. I attach a patch file that fixes this bug. Now the queryparser does not recognize an escaped quote inside a phrase as the phrase terminator anymore. I also updated the testcase org.apache.lucene.queryParser.TestQueryParser to test escaped quotes within phrases. This testcase fails with the old version of the queryparser and runs successful with the patched version. I added the following three tests to the testEscaped() method: (1) assertQueryEquals("a \\\"b c\\\" d", a, "a \"b c\" d"); (2) assertQueryEquals("\"a +b c d\"", a, "\"a +b c d\""); (3) assertQueryEquals("\"a \\\"b c\\\" d\"", a, "\"a \\\"b c\\\" d\""); Please notice that (3) is different from your second suggested assert. You assume that the queryparser unescapes the quotes inside the phrase, but the queryparser does not unescape any escaped characters inside a phrase. You can see that in (2), where the escaped + (plus) character does not become unescaped. Michael
        Hide
        Hoss Man added a comment -

        > You assume that the queryparser unescapes the quotes inside
        > the phrase, but the queryparser does not unescape any
        > escaped characters inside a phrase. You can see that in (2),
        > where the escaped + (plus) character does not become
        > unescaped.

        ...I'm not sure if this is a fair comaprison. Query parser may not unescape things like the + in this example....
        foo "bar +baz"
        ...but i believe that is because there is no reason for it to do so, if what you intended was to match on the + character without a preceding s\ you could have just written...
        foo "bar +baz"
        ...ie: the plus is already escaped by the nature of being inside the phrase bounding quotes

        this same justification does not hold for the double-quote character however ... there is no way to inlcude a " character inside of a phrase without escaping, so it seems like query parser should be unescaping it automatically for you.

        correct?

        Show
        Hoss Man added a comment - > You assume that the queryparser unescapes the quotes inside > the phrase, but the queryparser does not unescape any > escaped characters inside a phrase. You can see that in (2), > where the escaped + (plus) character does not become > unescaped. ...I'm not sure if this is a fair comaprison. Query parser may not unescape things like the + in this example.... foo "bar +baz" ...but i believe that is because there is no reason for it to do so, if what you intended was to match on the + character without a preceding s\ you could have just written... foo "bar +baz" ...ie: the plus is already escaped by the nature of being inside the phrase bounding quotes this same justification does not hold for the double-quote character however ... there is no way to inlcude a " character inside of a phrase without escaping, so it seems like query parser should be unescaping it automatically for you. correct?
        Hide
        Yonik Seeley added a comment -

        > it seems like query parser should be unescaping it automatically for you

        That's my take. \" should return the user "
        To enable representation of a real backslash followed by a quote, or a backslash at the end of a line, backslashes also need escaping (with another backslash... same as C, Java, etc).

        Now the question is if + should be returned as + or +
        If we were starting from scratch, I'd say remove the backslash (return +) since that's most consistent with escaping mechanisms in other languages, and it does the right thing if a user escapes something they don't need to. So does anyone currently search for backslash in a phrase query???

        Show
        Yonik Seeley added a comment - > it seems like query parser should be unescaping it automatically for you That's my take. \" should return the user " To enable representation of a real backslash followed by a quote, or a backslash at the end of a line, backslashes also need escaping (with another backslash... same as C, Java, etc). Now the question is if + should be returned as + or + If we were starting from scratch, I'd say remove the backslash (return +) since that's most consistent with escaping mechanisms in other languages, and it does the right thing if a user escapes something they don't need to. So does anyone currently search for backslash in a phrase query???
        Hide
        Michael Krkoska added a comment -

        I just experienced this bug in my search. The patch works for me, though I found it rather cumbersome to build lucene including javacc.

        Why is the priority of this bug low? What do we have to do to include tha patch in the release?

        Show
        Michael Krkoska added a comment - I just experienced this bug in my search. The patch works for me, though I found it rather cumbersome to build lucene including javacc. Why is the priority of this bug low? What do we have to do to include tha patch in the release?
        Hide
        Yonik Seeley added a comment -

        > What do we have to do to include tha patch in the release?

        IMO, the minimal amount that needs to be done is
        1) map \" to "
        2) map
        to \

        Show
        Yonik Seeley added a comment - > What do we have to do to include tha patch in the release? IMO, the minimal amount that needs to be done is 1) map \" to " 2) map to \
        Hide
        Michael Busch added a comment -

        Sorry, it took me a while to take care of this patch.

        You are right, my patch only prevents the ParseException to be thrown. However, it is still not possible to search for a phrase query that contains quotes.

        I think the simple solution is to call discardEscapeChar(String) for the phrase string. Then all escaped characters within a phrase become unescaped.

        Now the question is again (Yonik raised it already): Do we want this behavior or do we rather want to maintain backward compatibility for already existing phrase queries that include a backslash? If we prefer compatibility over consistency then one solution would be to add a new method discardEscapeChar(String, char[]) that only unescapes certain characters, in this case " and \.

        If we can make a decision here I'm happy to provide a new patch.

        Show
        Michael Busch added a comment - Sorry, it took me a while to take care of this patch. You are right, my patch only prevents the ParseException to be thrown. However, it is still not possible to search for a phrase query that contains quotes. I think the simple solution is to call discardEscapeChar(String) for the phrase string. Then all escaped characters within a phrase become unescaped. Now the question is again (Yonik raised it already): Do we want this behavior or do we rather want to maintain backward compatibility for already existing phrase queries that include a backslash? If we prefer compatibility over consistency then one solution would be to add a new method discardEscapeChar(String, char[]) that only unescapes certain characters, in this case " and \. If we can make a decision here I'm happy to provide a new patch.
        Hide
        Yonik Seeley added a comment -

        The number of people searching for '\' must be minuscule (to non-existent), so I'm personally OK with doing what makes the most sense.

        In general, I think it's more user friendly and less surprising to allow a backslash escape of something that doesn't need it (while removing the backslash).

        So
        => \
        \" => "
        + => + // not a necessary escape if in a quoted string, but consistent

        Then there are the "control" chars, \r\n\t, etc.

        At some point we might want to allow unicode escapes, and keeping backslash as a general escape char will facilitate this.

        Show
        Yonik Seeley added a comment - The number of people searching for '\' must be minuscule (to non-existent), so I'm personally OK with doing what makes the most sense. In general, I think it's more user friendly and less surprising to allow a backslash escape of something that doesn't need it (while removing the backslash). So => \ \" => " + => + // not a necessary escape if in a quoted string, but consistent Then there are the "control" chars, \r\n\t, etc. At some point we might want to allow unicode escapes, and keeping backslash as a general escape char will facilitate this.
        Hide
        Michael Busch added a comment -

        I actually found another problem in the current unescaping code:
        junit.framework.AssertionFailedError: Query /a\\\+b/ yielded /a
        +b/, expecting /a+b/

        The reason is that the method discardEscapeChar() cannot handle an escaped backslash followed by another escaped character.

        I attach a new patch that includes the following:

        • the QueryParser unescapes any escaped characters inside a phrase and can handle escaped quotes inside phrases
        • for consistency reasons unescaping now is also being done in quoted range queries
        • new implementation of discardEscapeChar() that solves the above mentioned problem (escaped backslash followed by another escaped character)
        • new tests for TestQueryParser that tests the changes

        All unit tests pass.

        Show
        Michael Busch added a comment - I actually found another problem in the current unescaping code: junit.framework.AssertionFailedError: Query /a\\\+b/ yielded /a +b/, expecting /a+b/ The reason is that the method discardEscapeChar() cannot handle an escaped backslash followed by another escaped character. I attach a new patch that includes the following: the QueryParser unescapes any escaped characters inside a phrase and can handle escaped quotes inside phrases for consistency reasons unescaping now is also being done in quoted range queries new implementation of discardEscapeChar() that solves the above mentioned problem (escaped backslash followed by another escaped character) new tests for TestQueryParser that tests the changes All unit tests pass.
        Hide
        Yonik Seeley added a comment -

        Thanks Michael, I haven't had a chance to look at the code yet, but I agree with your description of all the changes. So unless concerns arise, I'll commit this after review.

        Show
        Yonik Seeley added a comment - Thanks Michael, I haven't had a chance to look at the code yet, but I agree with your description of all the changes. So unless concerns arise, I'll commit this after review.
        Hide
        Yonik Seeley added a comment -

        Looks good, and I just committed this. Thanks Michael!

        Show
        Yonik Seeley added a comment - Looks good, and I just committed this. Thanks Michael!

          People

          • Assignee:
            Michael Busch
            Reporter:
            Tomislav Gountchev
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development