Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: Lucene.Net 2.9.2, Lucene.Net 2.9.4
    • Fix Version/s: None
    • Labels:
      None
    • Environment:

      Windows 7, Visual Studio 2010, .net 4.0

      Description

      The lucene 'QueryParser' doesn't analyze wildcard querys. The function 'GetPrefixQuery'(QueryParser.cs) returns the string without any analyzation.

      I have performed some queries to show the problem. The analyzer is the 'Contrib.Analyzers.DE.GermanAnalyzer'

      ---------- indexed word: 'Häuser'; in the index stemmed as: 'hau' ----------

      query: Hau*; hit: yes
      query: Hause*; hit: no; This should be a hit.....

      ---------- indexed word: 'Angebote'; in the index stemmed as: 'angebo' ----------

      query: Angebo*; hit: yes
      query: Angebot*; hit: no; This should be a hit.....
      query: Angebote*; hit: no; This should be a hit.....

      ---------- indexed word: 'Björn'; in the index stemmed as: 'bjor' ----------

      query: Bjor*; hit: yes
      query: Björ*; hit: no; This should be a hit.....

        Activity

        Hide
        Björn added a comment - - edited

        My test project requires .net 4.0

        Show
        Björn added a comment - - edited My test project requires .net 4.0
        Show
        Digy added a comment - http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
        Hide
        Björn added a comment -

        Quote: "The reason for skipping the Analyzer is that if you were searching for "dogs*" you would not want "dogs" first stemmed to "dog", since that would then match "dog*""

        This statement is senseless. If I use an analyzer for indexing the word "dogs" it is stemmed to "dog". So a search for "dogs*" will never be a hit......

        Show
        Björn added a comment - Quote: "The reason for skipping the Analyzer is that if you were searching for "dogs*" you would not want "dogs" first stemmed to "dog", since that would then match "dog*"" This statement is senseless. If I use an analyzer for indexing the word "dogs" it is stemmed to "dog". So a search for "dogs*" will never be a hit......
        Hide
        Björn added a comment -

        A supplement for my last comment: The 'StandardAnalyzer' doesn't stem 'dogs' to 'dog'. 'dogs' is indexed as 'dogs'. So this analyzer works correct. The 'GermanAnalyzer' doesn't work correct.

        Show
        Björn added a comment - A supplement for my last comment: The 'StandardAnalyzer' doesn't stem 'dogs' to 'dog'. 'dogs' is indexed as 'dogs'. So this analyzer works correct. The 'GermanAnalyzer' doesn't work correct.
        Hide
        Digy added a comment -

        Björn,

        Lucene(java) had decided not to use analyzer with wildcard queries, and Lucene.Net follows its path. If you want your "GermanAnalyzer" to work , then download the source of Lucene.Net and make your own modification. Think of other Lucene(.Net+java) users and don't expect to change the default behaviours just because you are not comfortable with it.

        DIGY

        Show
        Digy added a comment - Björn, Lucene(java) had decided not to use analyzer with wildcard queries, and Lucene.Net follows its path. If you want your "GermanAnalyzer" to work , then download the source of Lucene.Net and make your own modification. Think of other Lucene(.Net+java) users and don't expect to change the default behaviours just because you are not comfortable with it. DIGY
        Hide
        Björn added a comment -

        Of course I have solved the problem for myself. But this isn't a good behaviour.... Especially because the FAQ entry is wrong. Not finding a word is a better behaviour than maybe finding too much terms? Ok....

        How can I explain a customer that he can't search a word that he can "see" in the index?

        Show
        Björn added a comment - Of course I have solved the problem for myself. But this isn't a good behaviour.... Especially because the FAQ entry is wrong. Not finding a word is a better behaviour than maybe finding too much terms? Ok.... How can I explain a customer that he can't search a word that he can "see" in the index?
        Hide
        Digy added a comment -

        Of course I have solved the problem for myself.

        Then no problem

        DIGY

        Show
        Digy added a comment - Of course I have solved the problem for myself. Then no problem DIGY
        Hide
        Prescott Nasser added a comment -

        We could potentially add a developer flag in the future if they want to flip the behavior. The behavior Bjorn is asking for has merit (as well as the current implementation).

        Fix from Bjorn:

        — C:\Users\xyt\AppData\Local\Temp\AnkhSVN\3179\QueryParser.27919.cs 20.04.2012 08:55:36
        +++ C:_VS.NET\se\Lucene-2_9_4\core\QueryParser\QueryParser.cs 17.04.2012 12:25:02

        /// <summary> Factory method for generating a query (similar to
        /// <see cref="GetWildcardQuery" />). Called when parser parses an input term
        /// token that uses prefix notation; that is, contains a single '*' wildcard
        /// character as its last character. Since this is a special case
        /// of generic wildcard term, and such a query can be optimized easily,
        /// this usually results in a different query object.
        /// <p/>
        /// Depending on settings, a prefix term may be lower-cased
        /// automatically. It will not go through the default Analyzer,
        /// however, since normal Analyzers are unlikely to work properly
        /// with wildcard templates.
        /// <p/>
        /// Can be overridden by extending classes, to provide custom handling for
        /// wild card queries, which may be necessary due to missing analyzer calls.
        ///
        /// </summary>
        /// <param name="field">Name of the field query will use.
        /// </param>
        /// <param name="termStr">Term token to use for building term for the query
        /// (<b>without</b> trailing '*' character!)
        ///
        /// </param>
        /// <returns> Resulting <see cref="Query" /> built for the term
        /// </returns>
        /// <exception cref="ParseException">throw in overridden method to disallow
        /// </exception>
        public /protected internal/ virtual Query GetPrefixQuery(System.String field, System.String termStr)
        {
        if (!allowLeadingWildcard && termStr.StartsWith("*"))
        throw new ParseException("'*' not allowed as first character in PrefixQuery");
        if (lowercaseExpandedTerms)

        { termStr = termStr.ToLower(); }
        • Term t = new Term(field, termStr);
          + Term t = null;
          + TermQuery q = null;
          + try
          + { + q = GetFieldQuery(field, termStr) as TermQuery; + }

          + catch(Exception ex)
          +

          { + }

        + if (q != null)
        +

        { + t = new Term(field, q.GetTerm().text); + }

        + else
        +

        { + t = new Term(field, termStr); + }

        return NewPrefixQuery(t);
        }

        Show
        Prescott Nasser added a comment - We could potentially add a developer flag in the future if they want to flip the behavior. The behavior Bjorn is asking for has merit (as well as the current implementation). Fix from Bjorn: — C:\Users\xyt\AppData\Local\Temp\AnkhSVN\3179\QueryParser.27919.cs 20.04.2012 08:55:36 +++ C:_VS.NET\se\Lucene-2_9_4\core\QueryParser\QueryParser.cs 17.04.2012 12:25:02 /// <summary> Factory method for generating a query (similar to /// <see cref="GetWildcardQuery" />). Called when parser parses an input term /// token that uses prefix notation; that is, contains a single '*' wildcard /// character as its last character. Since this is a special case /// of generic wildcard term, and such a query can be optimized easily, /// this usually results in a different query object. /// <p/> /// Depending on settings, a prefix term may be lower-cased /// automatically. It will not go through the default Analyzer, /// however, since normal Analyzers are unlikely to work properly /// with wildcard templates. /// <p/> /// Can be overridden by extending classes, to provide custom handling for /// wild card queries, which may be necessary due to missing analyzer calls. /// /// </summary> /// <param name="field">Name of the field query will use. /// </param> /// <param name="termStr">Term token to use for building term for the query /// (<b>without</b> trailing '*' character!) /// /// </param> /// <returns> Resulting <see cref="Query" /> built for the term /// </returns> /// <exception cref="ParseException">throw in overridden method to disallow /// </exception> public / protected internal / virtual Query GetPrefixQuery(System.String field, System.String termStr) { if (!allowLeadingWildcard && termStr.StartsWith("*")) throw new ParseException("'*' not allowed as first character in PrefixQuery"); if (lowercaseExpandedTerms) { termStr = termStr.ToLower(); } Term t = new Term(field, termStr); + Term t = null; + TermQuery q = null; + try + { + q = GetFieldQuery(field, termStr) as TermQuery; + } + catch(Exception ex) + { + } + if (q != null) + { + t = new Term(field, q.GetTerm().text); + } + else + { + t = new Term(field, termStr); + } return NewPrefixQuery(t); }
        Hide
        Itamar Syn-Hershko added a comment -

        @Prescott no it doesn't

        Many analyzers will analyze incorrectly not given a complete word, and I agree the example given may be senseless when talking FTS

        This is mainly why Lucene made this design decision, and I don't think Lucene.NET needs to deviate from it (not even by introducing a developer flag).

        Show
        Itamar Syn-Hershko added a comment - @Prescott no it doesn't Many analyzers will analyze incorrectly not given a complete word, and I agree the example given may be senseless when talking FTS This is mainly why Lucene made this design decision, and I don't think Lucene.NET needs to deviate from it (not even by introducing a developer flag).
        Hide
        Christopher Currens added a comment -

        I think this affects other languages more than it does English, well, at least it affects the German analyzer, since it does umlaut conversions. While I don't think design change to Lucene.NET is necessary, it might be beneficial to expose the logic that converts umlauts in terms, so that developers can manually sanitize the terms in the query themselves (even overriding methods in QueryParser) so they can get the same behavior. I think that might be a reasonable compromise, and only affects the GermanAnalyzer in Contrib.

        Show
        Christopher Currens added a comment - I think this affects other languages more than it does English, well, at least it affects the German analyzer, since it does umlaut conversions. While I don't think design change to Lucene.NET is necessary, it might be beneficial to expose the logic that converts umlauts in terms, so that developers can manually sanitize the terms in the query themselves (even overriding methods in QueryParser) so they can get the same behavior. I think that might be a reasonable compromise, and only affects the GermanAnalyzer in Contrib.

          People

          • Assignee:
            Unassigned
            Reporter:
            Björn
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development