Lucene - Core
  1. Lucene - Core
  2. LUCENE-588

Escaped wildcard character in wildcard term not handled correctly

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/queryparser
    • Labels:
      None
    • Environment:

      Windows XP SP2

    • Lucene Fields:
      Patch Available

      Description

      If an escaped wildcard character is specified in a wildcard query, it is treated as a wildcard instead of a literal.
      e.g., t??t is converted by the QueryParser to t??t - the escape character is discarded.

      1. LUCENE-588.patch
        17 kB
        Robert Muir
      2. LUCENE-588.patch
        3 kB
        Robert Muir
      3. LUCENE-588.patch
        10 kB
        Terry Yang

        Issue Links

          Activity

          Hide
          Daniel Naber added a comment - - edited

          The problem is that the WildcardQuery itself doesn't have a concept of escaped characters. The escape characters are removed in QueryParser. This mean "t??t" will arrive as "t??t" in WildcardQuery and the second question mark is also interpreted as a wildcard.

          Show
          Daniel Naber added a comment - - edited The problem is that the WildcardQuery itself doesn't have a concept of escaped characters. The escape characters are removed in QueryParser. This mean "t??t" will arrive as "t??t" in WildcardQuery and the second question mark is also interpreted as a wildcard.
          Hide
          Daniel Naber added a comment -

          Also, the original report and my comment look confusing because Jira removes the backslash. Imagine a backslash in front of one of the question marks.

          Show
          Daniel Naber added a comment - Also, the original report and my comment look confusing because Jira removes the backslash. Imagine a backslash in front of one of the question marks.
          Hide
          Michael Busch added a comment -

          True... a solution might be to have the queryparser map escaped chars to some
          unused unicode codepoints. Then the WildcardQuery could distinguish escaped
          chars. I'd guess that other classes, like FuzzyQuery might have the same problem?

          The advantage of such a char mapping is that we can keep the String API and
          don't have to add special APIs to the Query objects for the queryparser.

          Show
          Michael Busch added a comment - True... a solution might be to have the queryparser map escaped chars to some unused unicode codepoints. Then the WildcardQuery could distinguish escaped chars. I'd guess that other classes, like FuzzyQuery might have the same problem? The advantage of such a char mapping is that we can keep the String API and don't have to add special APIs to the Query objects for the queryparser.
          Hide
          Sunil Kamath added a comment -

          The documentation does state that escaping of the "?" character by prepending a "\" character is supported.

          Show
          Sunil Kamath added a comment - The documentation does state that escaping of the "?" character by prepending a "\" character is supported.
          Hide
          Hoss Man added a comment -

          you're refering to the documentation for the querysyntax, used by the QueryParser ... which is in fact true: you can \ escape both * and ? as far as the QueryParser goes – but the WildcardQuery class doesn't support (nor does it's documentation advertise) any escape characters.

          In a nut shell: you can escape the characters so QueryParser doesn't consider them wildcards – which will influence whether or not QP builds a WIldcardQuery or a TermQuery, but WildcardQuery doesn't know or care about escape characters.

          Consider these examples, and assume a whitespace analyzer....

          parse("lucene") -> new TermQuery("lucene")
          parse("lu?ene") -> new WIldcardQuery("lu?ene")
          parse("lu\?ene") -> new TermQuery("lu?ene")
          parse("lu\?e?e") -> new WIldcardQuery("lu?e?e")
          parse("lu\?e\?e") -> new TermQuery("lu?e?e")
          

          that's why it works the way it does.

          as for how to improve it: It seems reasonable for WildcardQuery to have a boolean constructor arg indicating whether or not it should respect "standard" backslash escape sequences ... and then QueryParser could have an option to pass the "raw" string (with escapes) to this new constructor.

          Show
          Hoss Man added a comment - you're refering to the documentation for the querysyntax, used by the QueryParser ... which is in fact true: you can \ escape both * and ? as far as the QueryParser goes – but the WildcardQuery class doesn't support (nor does it's documentation advertise) any escape characters. In a nut shell: you can escape the characters so QueryParser doesn't consider them wildcards – which will influence whether or not QP builds a WIldcardQuery or a TermQuery, but WildcardQuery doesn't know or care about escape characters. Consider these examples, and assume a whitespace analyzer.... parse( "lucene" ) -> new TermQuery( "lucene" ) parse( "lu?ene" ) -> new WIldcardQuery( "lu?ene" ) parse( "lu\?ene" ) -> new TermQuery( "lu?ene" ) parse( "lu\?e?e" ) -> new WIldcardQuery( "lu?e?e" ) parse( "lu\?e\?e" ) -> new TermQuery( "lu?e?e" ) that's why it works the way it does. as for how to improve it: It seems reasonable for WildcardQuery to have a boolean constructor arg indicating whether or not it should respect "standard" backslash escape sequences ... and then QueryParser could have an option to pass the "raw" string (with escapes) to this new constructor.
          Hide
          Terry Yang added a comment -

          I wrote my first patch to this issue. if QueryParser knows the query is wildcard, it will directly pass the original query string to WildcardQuery which knows exactly which character is wildcard or not. i copied part of discardEscapeChar method from QueryParser because discardEscapeChar will throw ParseException which will causes WildcardQuery changed much. i am looking for a help/idea about what is the better way to process this exception?

          Show
          Terry Yang added a comment - I wrote my first patch to this issue. if QueryParser knows the query is wildcard, it will directly pass the original query string to WildcardQuery which knows exactly which character is wildcard or not. i copied part of discardEscapeChar method from QueryParser because discardEscapeChar will throw ParseException which will causes WildcardQuery changed much. i am looking for a help/idea about what is the better way to process this exception?
          Hide
          Michael Busch added a comment -

          I think we should add a new constructor to WildcardQuery like this:

            public WildcardQuery(Term term) {
              this(term, WildcardTermEnum.WILDCARD_CHAR, WildcardTermEnum.WILDCARD_STRING);
            }
            
            public WildcardQuery(Term term, char wildcardChar, char wildcardString) {
              super(term);
              this.wildcardChar = wildcardChar;
              this.wildcardString = wildcardString;
              
              this.termContainsWildcard = (term.text().indexOf(wildcardChar) != -1) 
                                          || (term.text().indexOf(wildcardString) != -1);
            }
          

          Then the WildcardQuery doesn't need to know anything about escaping and the QueryParser
          can just map wildcard characters that weren't escaped to some other unused chars and can
          unescape * and ? chars.

          The only disadvantage of this design would be that the WildcardQuery.toString() would not
          produce a String anymore that the QueryParser could parse. However, I think the requirement
          of Query.toString() to produce a parseable String is not very good anyways, because it adds
          a dependency between the Query classes and the QueryParser. I'd prefer to add a method
          like:

          String toQueryString(Query);
          

          to the QueryParser. Then Query.toString() wouldn't be tied anymore to a specific QueryParser
          implementation and syntax. Thoughts?

          Terry are you still around? Would you like to work on a new patch here?

          Show
          Michael Busch added a comment - I think we should add a new constructor to WildcardQuery like this: public WildcardQuery(Term term) { this (term, WildcardTermEnum.WILDCARD_CHAR, WildcardTermEnum.WILDCARD_STRING); } public WildcardQuery(Term term, char wildcardChar, char wildcardString) { super (term); this .wildcardChar = wildcardChar; this .wildcardString = wildcardString; this .termContainsWildcard = (term.text().indexOf(wildcardChar) != -1) || (term.text().indexOf(wildcardString) != -1); } Then the WildcardQuery doesn't need to know anything about escaping and the QueryParser can just map wildcard characters that weren't escaped to some other unused chars and can unescape * and ? chars. The only disadvantage of this design would be that the WildcardQuery.toString() would not produce a String anymore that the QueryParser could parse. However, I think the requirement of Query.toString() to produce a parseable String is not very good anyways, because it adds a dependency between the Query classes and the QueryParser. I'd prefer to add a method like: String toQueryString(Query); to the QueryParser. Then Query.toString() wouldn't be tied anymore to a specific QueryParser implementation and syntax. Thoughts? Terry are you still around? Would you like to work on a new patch here?
          Hide
          Luis Alves added a comment -

          The new QP has built in support for this.

          If someone has time to test it or write some testcases I can help fix any problems that might appear.
          The new query parser is in contrib/queryparser. See for more details LUCENE-1567.

          Show
          Luis Alves added a comment - The new QP has built in support for this. If someone has time to test it or write some testcases I can help fix any problems that might appear. The new query parser is in contrib/queryparser. See for more details LUCENE-1567 .
          Hide
          Michael Busch added a comment -

          We'll try to fix this in combination with LUCENE-1823.

          Show
          Michael Busch added a comment - We'll try to fix this in combination with LUCENE-1823 .
          Hide
          Robert Muir added a comment -

          In the flex branch, WildcardQuery (like RegexpQuery) is just a parser for AutomatonQuery.

          its pretty easy to add support for things like this if we still want it. Attached is a patch.

          Show
          Robert Muir added a comment - In the flex branch, WildcardQuery (like RegexpQuery) is just a parser for AutomatonQuery. its pretty easy to add support for things like this if we still want it. Attached is a patch.
          Hide
          Robert Muir added a comment -

          This one came up recently on the mailing list: see http://www.lucidimagination.com/search/document/637848ac365edda6/problem_escaping_question_marks

          Again its technically easy to fix in 4.0, but we just need to decide if WildcardQuery/QueryParser should just
          support escaped terms, instead of backwards compatibility stuff/Version/etc?

          Even if we make it "fully backwards compatible" some things will need to change,
          for instance Solr's "reverse wildcard" support should instead reverse the actual Automaton
          instead of the query string, this way it won't screw up the escapes... but this is
          really cleaner anyway.

          Show
          Robert Muir added a comment - This one came up recently on the mailing list: see http://www.lucidimagination.com/search/document/637848ac365edda6/problem_escaping_question_marks Again its technically easy to fix in 4.0, but we just need to decide if WildcardQuery/QueryParser should just support escaped terms, instead of backwards compatibility stuff/Version/etc? Even if we make it "fully backwards compatible" some things will need to change, for instance Solr's "reverse wildcard" support should instead reverse the actual Automaton instead of the query string, this way it won't screw up the escapes... but this is really cleaner anyway.
          Hide
          Yonik Seeley added a comment -

          Again its technically easy to fix in 4.0, but we just need to decide if WildcardQuery/QueryParser should just support escaped terms, instead of backwards compatibility stuff/Version/etc?

          People wouldn't be escaping wildcards and expecting them to still be treated as wildcards, so IMO back compat has no value here.

          Show
          Yonik Seeley added a comment - Again its technically easy to fix in 4.0, but we just need to decide if WildcardQuery/QueryParser should just support escaped terms, instead of backwards compatibility stuff/Version/etc? People wouldn't be escaping wildcards and expecting them to still be treated as wildcards, so IMO back compat has no value here.
          Hide
          Robert Muir added a comment -

          People wouldn't be escaping wildcards and expecting them to still be treated as wildcards, so IMO back compat has no value here.

          Ok, i would prefer to not create a mess and just change the behavior: add support for this to WildcardQuery and
          the QP (it is a trivial change there), and fix the reversing in Solr.

          I'll make a patch.

          Show
          Robert Muir added a comment - People wouldn't be escaping wildcards and expecting them to still be treated as wildcards, so IMO back compat has no value here. Ok, i would prefer to not create a mess and just change the behavior: add support for this to WildcardQuery and the QP (it is a trivial change there), and fix the reversing in Solr. I'll make a patch.
          Hide
          Robert Muir added a comment -

          here's a patch: i also fixed SolrQueryParser not to reverse the query string, but instead the actual automaton.
          This way, it has no problems with what is escaped and what isn't.

          additionally its ReverseWildcard tests needed some revamping... they were using Query.toString to check
          if a term was reversed or not... but with this approach its transparent and we don't work on Strings.

          But i couldn't figure out the contrib/queryparser... i left a test with @Ignore for now that fails.

          Show
          Robert Muir added a comment - here's a patch: i also fixed SolrQueryParser not to reverse the query string, but instead the actual automaton. This way, it has no problems with what is escaped and what isn't. additionally its ReverseWildcard tests needed some revamping... they were using Query.toString to check if a term was reversed or not... but with this approach its transparent and we don't work on Strings. But i couldn't figure out the contrib/queryparser... i left a test with @Ignore for now that fails.
          Hide
          Robert Muir added a comment -

          Committed revision 1031765

          Show
          Robert Muir added a comment - Committed revision 1031765

            People

            • Assignee:
              Robert Muir
              Reporter:
              Sunil Kamath
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development