Lucene - Core
  1. Lucene - Core
  2. LUCENE-949

AnalyzingQueryParser can't work with leading wildcards.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.2
    • Fix Version/s: 4.4, 5.0
    • Component/s: core/queryparser
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The getWildcardQuery mehtod in AnalyzingQueryParser.java need the following changes to accept leading wildcards:

      protected Query getWildcardQuery(String field, String termStr) throws ParseException
      {
      String useTermStr = termStr;
      String leadingWildcard = null;
      if ("*".equals(field))

      { if ("*".equals(useTermStr)) return new MatchAllDocsQuery(); }

      boolean hasLeadingWildcard = (useTermStr.startsWith("*") || useTermStr.startsWith("?")) ? true : false;

      if (!getAllowLeadingWildcard() && hasLeadingWildcard)
      throw new ParseException("'*' or '?' not allowed as first character in WildcardQuery");

      if (getLowercaseExpandedTerms())

      { useTermStr = useTermStr.toLowerCase(); }

      if (hasLeadingWildcard)

      { leadingWildcard = useTermStr.substring(0, 1); useTermStr = useTermStr.substring(1); }

      List tlist = new ArrayList();
      List wlist = new ArrayList();
      /*

      • somewhat a hack: find/store wildcard chars in order to put them back
      • after analyzing
        */
        boolean isWithinToken = (!useTermStr.startsWith("?") && !useTermStr.startsWith("*"));
        isWithinToken = true;
        StringBuffer tmpBuffer = new StringBuffer();
        char[] chars = useTermStr.toCharArray();
        for (int i = 0; i < useTermStr.length(); i++)
        {
        if (chars[i] == '?' || chars[i] == '*')
        Unknown macro: { if (isWithinToken) { tlist.add(tmpBuffer.toString()); tmpBuffer.setLength(0); } isWithinToken = false; }

        else

        Unknown macro: { if (!isWithinToken) { wlist.add(tmpBuffer.toString()); tmpBuffer.setLength(0); } isWithinToken = true; }

        tmpBuffer.append(chars[i]);
        }
        if (isWithinToken)

        { tlist.add(tmpBuffer.toString()); }

        else

        { wlist.add(tmpBuffer.toString()); }

      // get Analyzer from superclass and tokenize the term
      TokenStream source = getAnalyzer().tokenStream(field, new StringReader(useTermStr));
      org.apache.lucene.analysis.Token t;

      int countTokens = 0;
      while (true)
      {
      try

      { t = source.next(); }

      catch (IOException e)

      { t = null; }

      if (t == null)

      { break; }

      if (!"".equals(t.termText()))
      {
      try

      { tlist.set(countTokens++, t.termText()); }

      catch (IndexOutOfBoundsException ioobe)

      { countTokens = -1; }

      }
      }
      try

      { source.close(); }

      catch (IOException e)

      { // ignore }

      if (countTokens != tlist.size())

      { /* * this means that the analyzer used either added or consumed * (common for a stemmer) tokens, and we can't build a WildcardQuery */ throw new ParseException("Cannot build WildcardQuery with analyzer " + getAnalyzer().getClass() + " - tokens added or lost"); }

      if (tlist.size() == 0)

      { return null; }

      else if (tlist.size() == 1)
      {
      if (wlist.size() == 1)
      {
      /*

      • if wlist contains one wildcard, it must be at the end,
      • because: 1) wildcards at 1st position of a term by
      • QueryParser where truncated 2) if wildcard was not in end,
      • there would be two or more tokens
        */
        StringBuffer sb = new StringBuffer();
        if (hasLeadingWildcard) { // adding leadingWildcard sb.append(leadingWildcard); }
        sb.append((String) tlist.get(0));
        sb.append(wlist.get(0).toString());
        return super.getWildcardQuery(field, sb.toString());
        }
        else if (wlist.size() == 0 && hasLeadingWildcard)
        {
        /*
        * if wlist contains no wildcard, it must be at 1st position
        */
        StringBuffer sb = new StringBuffer();
        if (hasLeadingWildcard)
        { // adding leadingWildcard sb.append(leadingWildcard); }

        sb.append((String) tlist.get(0));
        sb.append(wlist.get(0).toString());
        return super.getWildcardQuery(field, sb.toString());
        }
        else

        { /* * we should never get here! if so, this method was called with * a termStr containing no wildcard ... */ throw new IllegalArgumentException("getWildcardQuery called without wildcard"); }

        }
        else
        {
        /*

      • the term was tokenized, let's rebuild to one token with wildcards
      • put back in postion
        */
        StringBuffer sb = new StringBuffer();
        if (hasLeadingWildcard) { // adding leadingWildcard sb.append(leadingWildcard); }

        for (int i = 0; i < tlist.size(); i++)

        Unknown macro: { sb.append((String) tlist.get(i)); if (wlist != null && wlist.size() > i) { sb.append((String) wlist.get(i)); } }

        return super.getWildcardQuery(field, sb.toString());
        }
        }

      1. LUCENE-949.patch
        24 kB
        Steve Rowe
      2. LUCENE-949.patch
        33 kB
        Tim Allison
      3. LUCENE-949.patch
        10 kB
        Tim Allison

        Activity

        Hide
        Steve Rowe added a comment -

        Bulk close resolved 4.4 issues

        Show
        Steve Rowe added a comment - Bulk close resolved 4.4 issues
        Hide
        Steve Rowe added a comment -

        Although this is labelled as a Major Bug, it feels more like a new feature to me, so I'm not motivated to backport this to 4.3.1.

        Show
        Steve Rowe added a comment - Although this is labelled as a Major Bug, it feels more like a new feature to me, so I'm not motivated to backport this to 4.3.1.
        Hide
        Steve Rowe added a comment -

        Committed to trunk and branch_4x.

        Thanks Tim!

        Show
        Steve Rowe added a comment - Committed to trunk and branch_4x. Thanks Tim!
        Hide
        Tim Allison added a comment -

        Steve, no problem on the delay. Thank you for your help! Changes sound great. Thank you.

        Show
        Tim Allison added a comment - Steve, no problem on the delay. Thank you for your help! Changes sound great. Thank you.
        Hide
        Steve Rowe added a comment -

        One other change I forgot to mention, Tim: I substituted MockAnalyzer where you used StandardAnalyzer in the test code - this allowed me to remove the analyzers-common dependency you introduced (and also the memory dependency, which didn't seem to be used for anything in your patch).

        Show
        Steve Rowe added a comment - One other change I forgot to mention, Tim: I substituted MockAnalyzer where you used StandardAnalyzer in the test code - this allowed me to remove the analyzers-common dependency you introduced (and also the memory dependency, which didn't seem to be used for anything in your patch).
        Hide
        Steve Rowe added a comment -

        Hi Tim Allison,

        Sorry it took so long, I've attached a patch based on your patch with some fixes:

        • Removed tabs.
        • Restored license header and class javadoc to AnalyzingQueryParser.java (your patch removed them for some reason?).
        • Converted all code indentation to 2 spaces per level (you had a lot of 3 space per level indentation).
        • Converted the wildcardPattern to allow anything to be escaped, not just backslashes and wildcard chars '?' and '*'. Also removed the optional backslashes from group 2 (the actual wildcards) - when iterating over wildcardPattern matches, your patch would throw away any number of real wildcards following an escaped wildcard. I added a test for this.
        • When multiple output tokens are produced (and there should only be one), now reporting all of them in the exception message instead of just the first two.
        • Removed all references to "chunklet" in favor of "output token" - this non-standard terminology made the code harder to read.
        • Changed descriptions of multiple output tokens to not necessarily be as the result of splitting (e.g. synonyms).
        • In analyzeSingleChunk(), moved exception throwing to the source of problems.

        I also added a CHANGES.txt entry.

        Tim, let me know if you think my changes are okay - if so, I think it's ready to commit.

        Show
        Steve Rowe added a comment - Hi Tim Allison , Sorry it took so long, I've attached a patch based on your patch with some fixes: Removed tabs. Restored license header and class javadoc to AnalyzingQueryParser.java (your patch removed them for some reason?). Converted all code indentation to 2 spaces per level (you had a lot of 3 space per level indentation). Converted the wildcardPattern to allow anything to be escaped, not just backslashes and wildcard chars '?' and '*'. Also removed the optional backslashes from group 2 (the actual wildcards) - when iterating over wildcardPattern matches, your patch would throw away any number of real wildcards following an escaped wildcard. I added a test for this. When multiple output tokens are produced (and there should only be one), now reporting all of them in the exception message instead of just the first two. Removed all references to "chunklet" in favor of "output token" - this non-standard terminology made the code harder to read. Changed descriptions of multiple output tokens to not necessarily be as the result of splitting (e.g. synonyms). In analyzeSingleChunk() , moved exception throwing to the source of problems. I also added a CHANGES.txt entry. Tim, let me know if you think my changes are okay - if so, I think it's ready to commit.
        Hide
        Tim Allison added a comment -

        Refactored a bit and added a few more tests.

        Show
        Tim Allison added a comment - Refactored a bit and added a few more tests.
        Hide
        Steve Rowe added a comment -

        Tim, thanks, I'll take a look at your new patch later today.

        FYI, you shouldn't remove old patches - when you upload a file with the same name, the older versions still appear, but their names appear in grey, and you can see the date/time each was uploaded. See e.g. SOLR-3251, where I've uploaded the same-named patch multiple times.

        Show
        Steve Rowe added a comment - Tim, thanks, I'll take a look at your new patch later today. FYI, you shouldn't remove old patches - when you upload a file with the same name, the older versions still appear, but their names appear in grey, and you can see the date/time each was uploaded. See e.g. SOLR-3251 , where I've uploaded the same-named patch multiple times.
        Hide
        Tim Allison added a comment -

        Thank you very much for the feedback. I refactored a bit and added escaped wildcard handling. Let me know how this looks.

        Show
        Tim Allison added a comment - Thank you very much for the feedback. I refactored a bit and added escaped wildcard handling. Let me know how this looks.
        Hide
        Steve Rowe added a comment -

        One more minor thing: the javadoc on getWildcardQuery() needs fixing - probably previously should have been "but not for <code>user*</code>" to illustrate not being called for prefix queries, but with your patch it's just plain wrong:

        * Example: will be called for <code>H?user</code> or for <code>H*user</code> 
        * but not for <code>*user</code>.
        
        Show
        Steve Rowe added a comment - One more minor thing: the javadoc on getWildcardQuery() needs fixing - probably previously should have been " but not for <code>user*</code> " to illustrate not being called for prefix queries, but with your patch it's just plain wrong: * Example: will be called for <code>H?user</code> or for <code>H*user</code> * but not for <code>*user</code>.
        Hide
        Steve Rowe added a comment -

        Hi Timothy, I agree with Robert, the patch reduces line count while improving clarity and adding functionality - sweet!

        A few nitpicks (+1 otherwise):

        • You don't handle escaped metachars ? and * in the query, though the original doesn't either, so this shouldn't stop the patch from being committed.
        • The (?s) has no effect in "(?s)([^\\?\\*]+)", since it only affects the expansion of the dot metachar, which isn't present (see http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#DOTALL): an inverted charset will always include newlines unless the charset to be inverted explicitly includes them. Of course, this is all moot, since the query parser will already have split on whitespace before sending termStr to getWildcardQuery()
        • You can drop the following line in normalizeMustBeSingleTerm(): nonWildcardMatcher.reset(termStr);, since nonWildcardMatcher is never reused.
        • The presence of term breaking chars is not the only condition under which analysis of a single term can produce multiple terms (e.g. synonyms, multiple readings), so the following exception message should be generalized: "There is a term breaking character between %s and %s".
        • In your new test, the following assertion message seems to have too many words? "Testing wildcard with wildcard with initial wildcard not allowed"
        Show
        Steve Rowe added a comment - Hi Timothy, I agree with Robert, the patch reduces line count while improving clarity and adding functionality - sweet! A few nitpicks (+1 otherwise): You don't handle escaped metachars ? and * in the query, though the original doesn't either, so this shouldn't stop the patch from being committed. The (?s) has no effect in "(?s)( [^\\?\\*] +)" , since it only affects the expansion of the dot metachar, which isn't present (see http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#DOTALL ): an inverted charset will always include newlines unless the charset to be inverted explicitly includes them. Of course, this is all moot, since the query parser will already have split on whitespace before sending termStr to getWildcardQuery() You can drop the following line in normalizeMustBeSingleTerm() : nonWildcardMatcher.reset(termStr); , since nonWildcardMatcher is never reused. The presence of term breaking chars is not the only condition under which analysis of a single term can produce multiple terms (e.g. synonyms, multiple readings), so the following exception message should be generalized: "There is a term breaking character between %s and %s". In your new test, the following assertion message seems to have too many words? "Testing wildcard with wildcard with initial wildcard not allowed"
        Hide
        Robert Muir added a comment -

        Thank you Timothy. the patch looks very good to me, thanks also for adding tests!

        A few questions:

        • The regex simplification looks good to me, but I'm not a regex expert. Maybe someone that is better with regex like Steve Rowe can have a look. If nobody objects after a few days I'm inclined to move forward though.
        • What about the case where someone has escaped wildcards? I'm not sure whats even happening today in this case... perhaps it already has surprising behavior and should really be a separate bug, or maybe its working and I just dont see it. I doubt its tested though... but it seems the regex would need to accomodate that?
        • The String.format() invocations should probably pass getLocale() from the superclass as the first argument, rather than depending on the default locale. Since they are locale sensitive I think its best to use the one that someone configured on the queryparser (e.g. via setLocale)
        Show
        Robert Muir added a comment - Thank you Timothy. the patch looks very good to me, thanks also for adding tests! A few questions: The regex simplification looks good to me, but I'm not a regex expert. Maybe someone that is better with regex like Steve Rowe can have a look. If nobody objects after a few days I'm inclined to move forward though. What about the case where someone has escaped wildcards? I'm not sure whats even happening today in this case... perhaps it already has surprising behavior and should really be a separate bug, or maybe its working and I just dont see it. I doubt its tested though... but it seems the regex would need to accomodate that? The String.format() invocations should probably pass getLocale() from the superclass as the first argument, rather than depending on the default locale. Since they are locale sensitive I think its best to use the one that someone configured on the queryparser (e.g. via setLocale)
        Hide
        Tim Allison added a comment -

        First patch. Let me know if this actually works.

        Show
        Tim Allison added a comment - First patch. Let me know if this actually works.
        Hide
        Robert Muir added a comment -

        Hello Timothy, can you turn these changes into a patch?

        See http://wiki.apache.org/lucene-java/HowToContribute#Creating_a_patch

        Thanks!

        Show
        Robert Muir added a comment - Hello Timothy, can you turn these changes into a patch? See http://wiki.apache.org/lucene-java/HowToContribute#Creating_a_patch Thanks!
        Hide
        Tim Allison added a comment - - edited

        This allows for leading wildcards in the AnalyzingQueryParser if setAllowLeadingWildcard is set to true.

        Show
        Tim Allison added a comment - - edited This allows for leading wildcards in the AnalyzingQueryParser if setAllowLeadingWildcard is set to true.
        Hide
        David Smiley added a comment -

        Reopening at the request of Tim Allison who intends to attach a patch. Regular users can't re-open JIRA issues, apparently.

        Show
        David Smiley added a comment - Reopening at the request of Tim Allison who intends to attach a patch. Regular users can't re-open JIRA issues, apparently.
        Hide
        David Herrera added a comment -

        Sorry, is my first post here and I didn't know special characters like

        *

        is bold. For that mistake, in my previous comment there one bold line that I didn't want to be bold. This is what I want to put:

        '*ucene' is converted into WildcardQuery like this 'ucene*'
        Show
        David Herrera added a comment - Sorry, is my first post here and I didn't know special characters like * is bold. For that mistake, in my previous comment there one bold line that I didn't want to be bold. This is what I want to put: '*ucene' is converted into WildcardQuery like this 'ucene*'
        Hide
        David Herrera added a comment -

        Hi.

        Is there some way to re-open and fix this behavior/bug in AnalyzingQueryParser?
        I have discover this opened (and closed 4 years later) bug. We are working with Lucene 3.2 and we use AnalyzingQueryParser because we need to parse with analyzer every query, even wildcard queries.

        This works great with most queries, and with the ones that don't work (for example in cases analyzer add/remove words and query have wildcards) we use QueryParser although it doesn't analyze wildcard queries.

        In our application there are some cases when we need to allow leading wildcard queries, and AnalyzingQueryParser fails although I set to true 'AllowLeadingWildcard' flag. Strings like 'ucene' is converted into WildcardQuery like this 'ucene'. This is another strange behavior, the ending wildcard.

        I know QueryParser doesn't have this leading wildcard bug, but I need to parse query (I am Spanish and we have special characters (ñ, ü, vocals with accent on them) and we parse indexed data, and to search we need to parse query too.

        Thanks in advance. Regards!

        Show
        David Herrera added a comment - Hi. Is there some way to re-open and fix this behavior/bug in AnalyzingQueryParser? I have discover this opened (and closed 4 years later) bug. We are working with Lucene 3.2 and we use AnalyzingQueryParser because we need to parse with analyzer every query, even wildcard queries. This works great with most queries, and with the ones that don't work (for example in cases analyzer add/remove words and query have wildcards) we use QueryParser although it doesn't analyze wildcard queries. In our application there are some cases when we need to allow leading wildcard queries, and AnalyzingQueryParser fails although I set to true 'AllowLeadingWildcard' flag. Strings like ' ucene' is converted into WildcardQuery like this 'ucene '. This is another strange behavior, the ending wildcard. I know QueryParser doesn't have this leading wildcard bug, but I need to parse query (I am Spanish and we have special characters (ñ, ü, vocals with accent on them) and we parse indexed data, and to search we need to parse query too. Thanks in advance. Regards!
        Hide
        Shai Erera added a comment -

        Way outdated code and it's not really clear what needs to be done. If this is still a problem, I suggest reopening and post a proper patch.

        Show
        Shai Erera added a comment - Way outdated code and it's not really clear what needs to be done. If this is still a problem, I suggest reopening and post a proper patch.

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Stefan Klein
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development