Solr
  1. Solr
  2. SOLR-219

Determine if prefix, wildcard, fuzzy queries should be lowercased

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      Solr should be able to "do the right thing" when doing prefix/wildcard/fuzzy queries on fields with respect to lowercasing or not.

      1. lowercase_prefix.patch
        6 kB
        Yonik Seeley
      2. wildcardlowercase.patch
        7 kB
        Claus Brod

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment -

          Here's a demo patch that optionally lowercases prefix query by testing the analyzer for the fieldType. No tests, no wildcard/fuzzy implementation yet. This is for evaluation of approach.

          I delegated complete query construction to the fieldType (as opposed to just lowercasing the term) because I'm thinking ahead to more efficiently supporting other types of wildcard queries in the future based on the field type. As an example, foo could be turned into a simple term query if the field contained the right ngram filter.

          Show
          Yonik Seeley added a comment - Here's a demo patch that optionally lowercases prefix query by testing the analyzer for the fieldType. No tests, no wildcard/fuzzy implementation yet. This is for evaluation of approach. I delegated complete query construction to the fieldType (as opposed to just lowercasing the term) because I'm thinking ahead to more efficiently supporting other types of wildcard queries in the future based on the field type. As an example, foo could be turned into a simple term query if the field contained the right ngram filter.
          Hide
          Hoss Man added a comment -

          I'm not opposed to an approach like this ... but it seems like a slippery slope to go down, with hard coded test strings, and assumptions about how analyzers will behave in all cases beased on one test case.

          perhaps a simpler approach that requires less guess work would be adding the ability for Fields and FieldTypes to container arbitrary key/val pair options that can be accessed as a map, and document that SolrQueryParser looks at some of these to make query parsing decisions?

          <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
          <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          </analyzer>
          <option name="lowerCaseForPrefix">false</option>
          </fieldType>

          Show
          Hoss Man added a comment - I'm not opposed to an approach like this ... but it seems like a slippery slope to go down, with hard coded test strings, and assumptions about how analyzers will behave in all cases beased on one test case. perhaps a simpler approach that requires less guess work would be adding the ability for Fields and FieldTypes to container arbitrary key/val pair options that can be accessed as a map, and document that SolrQueryParser looks at some of these to make query parsing decisions? <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> <option name="lowerCaseForPrefix">false</option> </fieldType>
          Hide
          Michael Pelz-Sherman added a comment -

          IMHO, if this is implemented, it should be optional (via schema configuration) and NOT the default behavior. I personally much prefer having direct control over query case sensitivity on a per-field basis, thanks!

          Show
          Michael Pelz-Sherman added a comment - IMHO, if this is implemented, it should be optional (via schema configuration) and NOT the default behavior. I personally much prefer having direct control over query case sensitivity on a per-field basis, thanks!
          Hide
          Yonik Seeley added a comment -

          > I personally much prefer having direct control over query case sensitivity on a per-field basis, thanks!

          Sure, if Solr is going to get it incorrect.

          I'm inclined to wait until someone comes up with an analyzer where we can't figure out if it's case insensitive or not before adding more configuration complexity... for the sake of both solr developers and users.

          Show
          Yonik Seeley added a comment - > I personally much prefer having direct control over query case sensitivity on a per-field basis, thanks! Sure, if Solr is going to get it incorrect. I'm inclined to wait until someone comes up with an analyzer where we can't figure out if it's case insensitive or not before adding more configuration complexity... for the sake of both solr developers and users.
          Hide
          David Smiley added a comment -

          I'm totally with you Yonik. I was surprised today to see that my prefix queries (part of an auto-complete feature I'm adding to my app) were turning up nothing because I was using upper case characters. It's silly because Solr is otherwise smart enough in other basic queries yet not in this case.

          Show
          David Smiley added a comment - I'm totally with you Yonik. I was surprised today to see that my prefix queries (part of an auto-complete feature I'm adding to my app) were turning up nothing because I was using upper case characters. It's silly because Solr is otherwise smart enough in other basic queries yet not in this case.
          Hide
          Shalin Shekhar Mangar added a comment -

          Marking for 1.5

          Show
          Shalin Shekhar Mangar added a comment - Marking for 1.5
          Hide
          Claus Brod added a comment - - edited

          We also needed lowercase query support. We extended Yonik's patch to wildcard queries. Seems to work well in our environment. I added the patch as wildcardlowercase.patch; it's probably most useful for illustration purposes than for an industrial-strength final solution, but maybe it's useful for somebody.

          Needless to say we'd love to see official support for case-insensitive searches in 1.5

          Show
          Claus Brod added a comment - - edited We also needed lowercase query support. We extended Yonik's patch to wildcard queries. Seems to work well in our environment. I added the patch as wildcardlowercase.patch; it's probably most useful for illustration purposes than for an industrial-strength final solution, but maybe it's useful for somebody. Needless to say we'd love to see official support for case-insensitive searches in 1.5
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hide
          Patrick Allaert added a comment -

          Any plan to implement this?

          Show
          Patrick Allaert added a comment - Any plan to implement this?
          Hide
          Peter Sturge added a comment -

          Case Insensitive Search for Wildcard Queries

          Show
          Peter Sturge added a comment - Case Insensitive Search for Wildcard Queries
          Hide
          Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Hide
          Gunnar Wagenknecht added a comment -

          Any progress on the issue? We are also hit by this issue. Ideally, it would be nice if I could configure the analyzers to run for wildcard queries. For example, I still want to do lowercasing and character normalization (umlauts) for wildcard queries.

          Show
          Gunnar Wagenknecht added a comment - Any progress on the issue? We are also hit by this issue. Ideally, it would be nice if I could configure the analyzers to run for wildcard queries. For example, I still want to do lowercasing and character normalization (umlauts) for wildcard queries.
          Hide
          Mike Sokolov added a comment - - edited

          Is there a reason this issue can't be dealt with by including an appropriate MappingCharFilter in the field definition?

          Show
          Mike Sokolov added a comment - - edited Is there a reason this issue can't be dealt with by including an appropriate MappingCharFilter in the field definition?
          Hide
          Jan Høydahl added a comment -

          Agree with Gunnar that the problem is wider than lowercasing. How hard would it be to let each filter choose whether to work on prefix terms or not, and run them through analysis?

          A use case is for the Nordic characters æøåäö. A Norwegian name "Øyvind" would typically be normalized and indexed as "oeyvind", and when a swede searches for "Öyvin*", he'd get match if at least the mappingCharFilter and LowercaseFilter were allowed to run and turn the query into "oeyvin*".

          Show
          Jan Høydahl added a comment - Agree with Gunnar that the problem is wider than lowercasing. How hard would it be to let each filter choose whether to work on prefix terms or not, and run them through analysis? A use case is for the Nordic characters æøåäö. A Norwegian name "Øyvind" would typically be normalized and indexed as "oeyvind", and when a swede searches for "Öyvin*", he'd get match if at least the mappingCharFilter and LowercaseFilter were allowed to run and turn the query into "oeyvin*".
          Hide
          Robert Muir added a comment -

          a lot of analysis things like stemming are not prepared to deal with wildcard characters in the term, and returning multiple tokens (because a tokenizer splits on a * or whatever) makes no sense either

          in my opinion, a good solution here is to allow you to specify in your schema: this is the analysis chain for these multitermqueries, so it would be a different chain rather than "query" or "index" (similar to SOLR-2477 where I propose allowing you to specify one for "phrase"). The QP would use this chain for things like wildcards, and throw an exception if the analyzer returns more than one token from a wildcard term.

          This way you can use KeywordTokenizer + lowercase/fold characters or whatever, but in general doing things like WDF or synonyms makes no sense here. If you want to do things like stemming, thats fine, you can shoot yourself in the foot this way and we won't stop you.

          But in no case should we try to magically apply the analysis chain... too ambiguous what would happen.

          Show
          Robert Muir added a comment - a lot of analysis things like stemming are not prepared to deal with wildcard characters in the term, and returning multiple tokens (because a tokenizer splits on a * or whatever) makes no sense either in my opinion, a good solution here is to allow you to specify in your schema: this is the analysis chain for these multitermqueries, so it would be a different chain rather than "query" or "index" (similar to SOLR-2477 where I propose allowing you to specify one for "phrase"). The QP would use this chain for things like wildcards, and throw an exception if the analyzer returns more than one token from a wildcard term. This way you can use KeywordTokenizer + lowercase/fold characters or whatever, but in general doing things like WDF or synonyms makes no sense here. If you want to do things like stemming, thats fine, you can shoot yourself in the foot this way and we won't stop you. But in no case should we try to magically apply the analysis chain... too ambiguous what would happen.
          Hide
          Gunnar Wagenknecht added a comment -

          But in no case should we try to magically apply the analysis chain... too ambiguous what would happen.

          Agreed. I just need a way in the schema when configuring fields to say which analyzers should run for wildcard and/or prefix queries.

          Show
          Gunnar Wagenknecht added a comment - But in no case should we try to magically apply the analysis chain... too ambiguous what would happen. Agreed. I just need a way in the schema when configuring fields to say which analyzers should run for wildcard and/or prefix queries.
          Hide
          Jan Høydahl added a comment -

          I like your idea @Robert. It's explicit and backwards compat, and would allow us to shoot our issues as well as our feet

          Show
          Jan Høydahl added a comment - I like your idea @Robert. It's explicit and backwards compat, and would allow us to shoot our issues as well as our feet
          Hide
          Mike Sokolov added a comment -

          I wonder whether there should be some kind of explicit mapping from analysis "type" to query. If I write some new kind of query (say AnagramQuery - I'll post a patch if anyone wants it ), how do I specify whether its terms are analyzed with the wildcard chain or the phrase chain, or the default query chain? Can I make up my own new analysis type and map it to my query type?

          Show
          Mike Sokolov added a comment - I wonder whether there should be some kind of explicit mapping from analysis "type" to query. If I write some new kind of query (say AnagramQuery - I'll post a patch if anyone wants it ), how do I specify whether its terms are analyzed with the wildcard chain or the phrase chain, or the default query chain? Can I make up my own new analysis type and map it to my query type?
          Hide
          Robert Muir added a comment -

          Mike I don't totally understand the question: in general there are only several categories of queries supported by the queryparser:

          • Core queries like Term, Phrase, SloppyPhrase, MultiPhrase: these go thru the analyzer.
          • MultiTermQueries like WildcardQuery, PrefixQuery, FuzzyQuery, RegexpQuery, which are patterns that rewrite against the term index into some simpler form (e.g. into TermQueries)

          If you were to write an AnagramQuery, you would first have to add queryparser support anyway to it. But, if you want anagrams you could just write an anagram tokenfilter that sorts the characters in the termbuffer: then you wouldn't need to write a custom query, nor custom queryparser integration, and it would be fast.

          Show
          Robert Muir added a comment - Mike I don't totally understand the question: in general there are only several categories of queries supported by the queryparser: Core queries like Term, Phrase, SloppyPhrase, MultiPhrase: these go thru the analyzer. MultiTermQueries like WildcardQuery, PrefixQuery, FuzzyQuery, RegexpQuery, which are patterns that rewrite against the term index into some simpler form (e.g. into TermQueries) If you were to write an AnagramQuery, you would first have to add queryparser support anyway to it. But, if you want anagrams you could just write an anagram tokenfilter that sorts the characters in the termbuffer: then you wouldn't need to write a custom query, nor custom queryparser integration, and it would be fast.
          Hide
          Mike Sokolov added a comment -

          Yes, I've implemented anagram querying as you indicated, by sorting the letters, but the query I have in mind would allow some wildcards as well. An example comes up in scrabble with the blanks, and we've been asked to implement this for some dictionary sites. I was wondering if that could be implemented in Lucene as an FST: I suspect it could, but my brain went numb trying to come up with a regex as a way to get there, and then I ended up building using a direct hand-coded term scanning approach.

          Re: the question of mapping queries, I may very well be missing something here. Maybe I've misunderstood your plan: isn't it that Phrase-type queries go through the phrase-analyzer, TermQuery goes through the regular (query) analyzer, and MultiTermQueries go through the wildcard-analyzer?

          It just seemed to me that there might be new Queries written in the future that might not easily be categorized into one of those classes, or that it might not be obvious how to indicate which class is thr right one, and it could be handy to have a way to associate them with an analysis chain in the way you've described. Although it seems that my one example probably falls into the MTQ category and I guess would just pick up the wildcard analysis chain, which is probably the right thing.

          Show
          Mike Sokolov added a comment - Yes, I've implemented anagram querying as you indicated, by sorting the letters, but the query I have in mind would allow some wildcards as well. An example comes up in scrabble with the blanks, and we've been asked to implement this for some dictionary sites. I was wondering if that could be implemented in Lucene as an FST: I suspect it could, but my brain went numb trying to come up with a regex as a way to get there, and then I ended up building using a direct hand-coded term scanning approach. Re: the question of mapping queries, I may very well be missing something here. Maybe I've misunderstood your plan: isn't it that Phrase-type queries go through the phrase-analyzer, TermQuery goes through the regular (query) analyzer, and MultiTermQueries go through the wildcard-analyzer? It just seemed to me that there might be new Queries written in the future that might not easily be categorized into one of those classes, or that it might not be obvious how to indicate which class is thr right one, and it could be handy to have a way to associate them with an analysis chain in the way you've described. Although it seems that my one example probably falls into the MTQ category and I guess would just pick up the wildcard analysis chain, which is probably the right thing.
          Hide
          Robert Muir added a comment -

          It just seemed to me that there might be new Queries written in the future that might not easily be categorized into one of those classes

          I'm not worried about this to be honest... an inverted index has terms and positions so there are really only so many possibilities.
          I think its enough to say, here is the analysis chain for terms, for positions, and for multitermqueries that rewrite to these.

          Even if there were 200,000 new queries about to be added, it doesn't make sense to worry about that here, because first they would need queryparser support.

          Show
          Robert Muir added a comment - It just seemed to me that there might be new Queries written in the future that might not easily be categorized into one of those classes I'm not worried about this to be honest... an inverted index has terms and positions so there are really only so many possibilities. I think its enough to say, here is the analysis chain for terms, for positions, and for multitermqueries that rewrite to these. Even if there were 200,000 new queries about to be added, it doesn't make sense to worry about that here, because first they would need queryparser support.
          Hide
          Mike Sokolov added a comment -

          Fair enough - And by the way +1 on all this - I hated having to hack QueryParser just to prevent stop words getting stripped from phrases. "The the" and "The who" were problematic

          Show
          Mike Sokolov added a comment - Fair enough - And by the way +1 on all this - I hated having to hack QueryParser just to prevent stop words getting stripped from phrases. "The the" and "The who" were problematic
          Hide
          Yongtao Liu added a comment -

          Is it possible each filed can check whether LowerCaseFilterFactory filter used for this field? If so, each file can implement their own getPrefixQuery/getWildcardQuery to convert/not convert to low case.

          Show
          Yongtao Liu added a comment - Is it possible each filed can check whether LowerCaseFilterFactory filter used for this field? If so, each file can implement their own getPrefixQuery/getWildcardQuery to convert/not convert to low case.
          Hide
          Robert Muir added a comment -

          3.4 -> 3.5

          Show
          Robert Muir added a comment - 3.4 -> 3.5
          Hide
          Erick Erickson added a comment -

          I'm pretty sure this is taken care of by SOLR-2438, so marking closed.

          Show
          Erick Erickson added a comment - I'm pretty sure this is taken care of by SOLR-2438 , so marking closed.

            People

            • Assignee:
              Erick Erickson
              Reporter:
              Yonik Seeley
            • Votes:
              11 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development