Solr
  1. Solr
  2. SOLR-2438

Case Insensitive Search for Wildcard Queries

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      This patch adds support to allow case-insensitive queries on wildcard searches for configured TextField field types.

      This patch extends the excellent work done Yonik and Michael in SOLR-219.
      The approach here is different enough (imho) to warrant a separate JIRA issue.

      1. SOLR-2438-3x.patch
        45 kB
        Erick Erickson
      2. SOLR-2438.patch
        43 kB
        Erick Erickson
      3. SOLR-2438.patch
        42 kB
        Yonik Seeley
      4. SOLR-2438.patch
        39 kB
        Erick Erickson
      5. SOLR-2438_3x.patch
        36 kB
        Erick Erickson
      6. SOLR-2438-3x.patch
        34 kB
        Erick Erickson
      7. SOLR-2438.patch
        34 kB
        Erick Erickson
      8. SOLR-2438.patch
        35 kB
        Erick Erickson
      9. SOLR-2438.patch
        22 kB
        Erick Erickson
      10. SOLR-2438.patch
        16 kB
        Erick Erickson
      11. SOLR-2438.patch
        11 kB
        Erick Erickson
      12. SOLR-2438.patch
        5 kB
        Peter Sturge

        Issue Links

          Activity

          Hide
          Peter Sturge added a comment -

          Attached patch file

          Show
          Peter Sturge added a comment - Attached patch file
          Hide
          Peter Sturge added a comment -

          If you're like me, you may have often wondered why MyTerm, myterm, myter* and MyTer* can return different, and sometimes empty results.
          This patch addresses this for wildcard queries by adding an attribute to relevant solr.TextField entries in schema.xml.
          The new attribute is called: ignoreCaseForWildcards

          Example entry in schema.xml:

          schema.xml [excerpt]
          <fieldType name="text_lcws" class="solr.TextField" positionIncrementGap="100" ignoreCaseForWildcards="true">
            <analyzer type="index">
              <tokenizer class="solr.WhitespaceTokenizerFactory"/>
              <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            </analyzer>
          </fieldType>
          

          It's worth noting that this will lower-case text for ALL terms that match the field type - including synonyms and stemmers.

          For backward compatibility, the default behaviour is as before - i.e. a case sensitive wildcard search (ignoreCaseForWildcards=false).

          The patch was created against the lucene_solr_3_1 branch. I've not applied it yet on trunk.

          [caveat emptor] I freely admit I'm no schema expert, so commiters and community members may see use cases where this approach could pose problems. I'm all for feedback to enhance the functionality...

          The hope here is to re-ignite enthusiasm for case-insensitive wildcard searches in Solr - in line with the 'it just works' Solr philosophy.

          Enjoy!

          Show
          Peter Sturge added a comment - If you're like me, you may have often wondered why MyTerm, myterm, myter* and MyTer* can return different, and sometimes empty results. This patch addresses this for wildcard queries by adding an attribute to relevant solr.TextField entries in schema.xml. The new attribute is called: ignoreCaseForWildcards Example entry in schema.xml: schema.xml [excerpt] <fieldType name= "text_lcws" class= "solr.TextField" positionIncrementGap= "100" ignoreCaseForWildcards= " true " > <analyzer type= "index" > <tokenizer class= "solr.WhitespaceTokenizerFactory" /> <filter class= "solr.LowerCaseFilterFactory" /> </analyzer> <analyzer type= "query" > <tokenizer class= "solr.WhitespaceTokenizerFactory" /> <filter class= "solr.LowerCaseFilterFactory" /> <filter class= "solr.SynonymFilterFactory" synonyms= "synonyms.txt" ignoreCase= " true " expand= " true " /> </analyzer> </fieldType> It's worth noting that this will lower-case text for ALL terms that match the field type - including synonyms and stemmers. For backward compatibility, the default behaviour is as before - i.e. a case sensitive wildcard search ( ignoreCaseForWildcards=false ). The patch was created against the lucene_solr_3_1 branch. I've not applied it yet on trunk. [caveat emptor] I freely admit I'm no schema expert, so commiters and community members may see use cases where this approach could pose problems. I'm all for feedback to enhance the functionality... The hope here is to re-ignite enthusiasm for case-insensitive wildcard searches in Solr - in line with the 'it just works' Solr philosophy. Enjoy!
          Hide
          David Smiley added a comment -

          Nice Peter. So why did you create another JIRA issue instead of putting your patch on SOLR-219? This is yet another issue and there is already a quasi-community of commenters (including me) on that other issue).

          Show
          David Smiley added a comment - Nice Peter. So why did you create another JIRA issue instead of putting your patch on SOLR-219 ? This is yet another issue and there is already a quasi-community of commenters (including me) on that other issue).
          Hide
          Peter Sturge added a comment -

          As I mentioned above, the approach is a little bit different from SOLR-219, and its scope is [perhaps] more targeted at case-insensitive wildcards only.

          It's also a completely self-contained patch. I've found that when a JIRA issue contains lots of (>1) 'non-evolutionary' patches, it becomes difficult to know which patch is which.
          I agree that a new issue means commenters of 219 would need to look at this issue. I've added a link on SOLR-219 to relate it to this issue so it's easier to track.
          Hope this helps clarify.

          Show
          Peter Sturge added a comment - As I mentioned above, the approach is a little bit different from SOLR-219 , and its scope is [perhaps] more targeted at case-insensitive wildcards only. It's also a completely self-contained patch. I've found that when a JIRA issue contains lots of (>1) 'non-evolutionary' patches, it becomes difficult to know which patch is which. I agree that a new issue means commenters of 219 would need to look at this issue. I've added a link on SOLR-219 to relate it to this issue so it's easier to track. Hope this helps clarify.
          Hide
          Erick Erickson added a comment -

          I'd like to nudge this one forward, this has always bugged me and we spend a lot of time on the user's lists etc. explaining this. Plus, this is a nice simple patch that I can understand .

          So, I've assigned it to myself and I'd like to carry it to resolution, and for that I'd like any comments. As far as SOLR-219 is concerned, this seems like a more targeted fix as Peter mentioned. Does the 80/20 rule apply here? Can we consider this a step towards a full resolution but "good enough for now"? Or would introducing a parameter into the schema file be harder to undo later?

          This approach seems to put the whole question squarely on the user to determine if she wants to, while keeping current behavior unless she takes explicit action which I think is a good thing.

          Show
          Erick Erickson added a comment - I'd like to nudge this one forward, this has always bugged me and we spend a lot of time on the user's lists etc. explaining this. Plus, this is a nice simple patch that I can understand . So, I've assigned it to myself and I'd like to carry it to resolution, and for that I'd like any comments. As far as SOLR-219 is concerned, this seems like a more targeted fix as Peter mentioned. Does the 80/20 rule apply here? Can we consider this a step towards a full resolution but "good enough for now"? Or would introducing a parameter into the schema file be harder to undo later? This approach seems to put the whole question squarely on the user to determine if she wants to, while keeping current behavior unless she takes explicit action which I think is a good thing.
          Hide
          Robert Muir added a comment -

          I don't think we should use the default locale.

          Why not add an optional analysis chain for multitermqueries like SOLR-219?

          otherwise you do this, and then someone opens an issue about how accents arent being folded like their analyzer eihter.

          Show
          Robert Muir added a comment - I don't think we should use the default locale. Why not add an optional analysis chain for multitermqueries like SOLR-219 ? otherwise you do this, and then someone opens an issue about how accents arent being folded like their analyzer eihter.
          Hide
          Yonik Seeley added a comment -

          Why not add an optional analysis chain

          someone opens an issue about how accents arent being folded like their analyzer either.

          Good point.
          While an optional analysis chain to be used for prefix queries, etc, would be powerful in some ways, it's not very user friendly and makes the schema harder to read if we add it for all our fields. We should be able to figure out the right thing to do in pretty much all cases when our filters are used (i.e. some filters should be run on a prefix and some shouldn't).

          Show
          Yonik Seeley added a comment - Why not add an optional analysis chain someone opens an issue about how accents arent being folded like their analyzer either. Good point. While an optional analysis chain to be used for prefix queries, etc, would be powerful in some ways, it's not very user friendly and makes the schema harder to read if we add it for all our fields. We should be able to figure out the right thing to do in pretty much all cases when our filters are used (i.e. some filters should be run on a prefix and some shouldn't).
          Hide
          Robert Muir added a comment -

          While an optional analysis chain to be used for prefix queries, etc, would be powerful in some ways, it's not very user friendly and makes the schema harder to read if we add it for all our fields. We should be able to figure out the right thing to do in pretty much all cases when our filters are used (i.e. some filters should be run on a prefix and some shouldn't).

          I agree with this actually... but it kinda sucks from the plugin perspective.

          Can we have both? E.g. we have a pluggable chain, you can declare your own, but the default is 'AutoAnalyzerFactory' or something like that, which is a factory that deduces it from the query chain (e.g. for this i would say look for accent-folding and lowercase and declare success, if you want anything else go configure your own chain).

          Then the schema file wouldn't be messy either.

          Show
          Robert Muir added a comment - While an optional analysis chain to be used for prefix queries, etc, would be powerful in some ways, it's not very user friendly and makes the schema harder to read if we add it for all our fields. We should be able to figure out the right thing to do in pretty much all cases when our filters are used (i.e. some filters should be run on a prefix and some shouldn't). I agree with this actually... but it kinda sucks from the plugin perspective. Can we have both? E.g. we have a pluggable chain, you can declare your own, but the default is 'AutoAnalyzerFactory' or something like that, which is a factory that deduces it from the query chain (e.g. for this i would say look for accent-folding and lowercase and declare success, if you want anything else go configure your own chain). Then the schema file wouldn't be messy either.
          Hide
          Erick Erickson added a comment -

          I was thinking about trying to explain another type of analyzer a-la SOLR-219. And maybe a fourth type if SOLR-2477 were implemented. And it was giving me a headache, especially when that would require the users to understand what is arguably an expert option before they could handle the 80% case...

          So what about making this bug deal with the accents and case folding as Robert suggests and leave the pluggable stuff for 219 if anyone wants to take it on? And if we do it that way, should we make the user specify a per-field flag (suggestions welcome) or really make this the default and change the current behavior? This latter seems like a bad idea for 3.x, but possibly make the default to auto-detect in trunk?

          Show
          Erick Erickson added a comment - I was thinking about trying to explain another type of analyzer a-la SOLR-219 . And maybe a fourth type if SOLR-2477 were implemented. And it was giving me a headache, especially when that would require the users to understand what is arguably an expert option before they could handle the 80% case... So what about making this bug deal with the accents and case folding as Robert suggests and leave the pluggable stuff for 219 if anyone wants to take it on? And if we do it that way, should we make the user specify a per-field flag (suggestions welcome) or really make this the default and change the current behavior? This latter seems like a bad idea for 3.x, but possibly make the default to auto-detect in trunk?
          Hide
          Robert Muir added a comment -

          We can change the default just by respecting a version bump (if its < 3.5, we do nothing, and old configs still did what they did before)

          Show
          Robert Muir added a comment - We can change the default just by respecting a version bump (if its < 3.5, we do nothing, and old configs still did what they did before)
          Hide
          Yonik Seeley added a comment -

          Can we have both?

          Yeah, that sounds like the best option.

          Show
          Yonik Seeley added a comment - Can we have both? Yeah, that sounds like the best option.
          Hide
          Erick Erickson added a comment -

          bq: We can change the default just by respecting a version bump (if its < 3.5, we do nothing, and old configs still did what they did before)

          But from a user's perspective, they install 3.5 and the behavior changes. I'm willing to do it the way you suggest but it makes me nervous to do unilaterally.

          Show
          Erick Erickson added a comment - bq: We can change the default just by respecting a version bump (if its < 3.5, we do nothing, and old configs still did what they did before) But from a user's perspective, they install 3.5 and the behavior changes. I'm willing to do it the way you suggest but it makes me nervous to do unilaterally.
          Hide
          David Smiley added a comment -

          I really like the idea of automatically deducing the proper analysis steps by default (if the schema is >= 3.5), and having the option to specify an explicit analysis chain too. FYI there is similar analyzer detection that goes on for handling of ReverseWildcardFilterFactory with wildcards, and also in ExtendedDismax for detecting stop words. It's good to see committers have interest in this again, finally. If there's something I can do to help then let me know.

          Show
          David Smiley added a comment - I really like the idea of automatically deducing the proper analysis steps by default (if the schema is >= 3.5), and having the option to specify an explicit analysis chain too. FYI there is similar analyzer detection that goes on for handling of ReverseWildcardFilterFactory with wildcards, and also in ExtendedDismax for detecting stop words. It's good to see committers have interest in this again, finally. If there's something I can do to help then let me know.
          Hide
          Robert Muir added a comment -

          But from a user's perspective, they install 3.5 and the behavior changes. I'm willing to do it the way you suggest but it makes me nervous to do unilaterally.

          it doesn't ? because their old configuration file doesnt have matchVersion=3.5

          Show
          Robert Muir added a comment - But from a user's perspective, they install 3.5 and the behavior changes. I'm willing to do it the way you suggest but it makes me nervous to do unilaterally. it doesn't ? because their old configuration file doesnt have matchVersion=3.5
          Hide
          Erick Erickson added a comment -

          This is not at all ready for prime-time but I'm inviting comments on the approach. It turns out that all the hard work has already been done, see QueryParserBase. The attached patch is almost all tests...

          But I greatly fear that I'm grossly misusing
          QueryParserBase.lowercaseExpandedTerms, which looks like it's for parameters on the query line? Where did that come from anyway? Or what the heck is it supposed to be used for, anyone know?

          A couple of thing make me nervous about this approach. It depends in a pretty hard-coded way on detecting LowerCaseFilterFactory and LowerCaseTokenizerFactory, if anyone adds anything else in there it'll have to be re-coded. Is there a better way? It almost seems like a flag on the <field> definition as Peter suggested is a more robust way of going about things.

          Anyway, I'm getting way past the point of diminishing returns tonight, so I thought I'd at least throw this out for comment.

          Ignore everything with the ASCIIFoldingFilterFactory, I detect it but don't do anything with it yet.

          And I can't seem to make the reversed test work, even without the casing switch. Which means I should put it down for the evening, I'm obviously fried. Anybody feeling kind can uncomment the line that starts:

          // make me work

          and get the test class to work. It's probably trivial but I'm not seeing it.

          Show
          Erick Erickson added a comment - This is not at all ready for prime-time but I'm inviting comments on the approach. It turns out that all the hard work has already been done, see QueryParserBase. The attached patch is almost all tests... But I greatly fear that I'm grossly misusing QueryParserBase.lowercaseExpandedTerms, which looks like it's for parameters on the query line? Where did that come from anyway? Or what the heck is it supposed to be used for, anyone know? A couple of thing make me nervous about this approach. It depends in a pretty hard-coded way on detecting LowerCaseFilterFactory and LowerCaseTokenizerFactory, if anyone adds anything else in there it'll have to be re-coded. Is there a better way? It almost seems like a flag on the <field> definition as Peter suggested is a more robust way of going about things. Anyway, I'm getting way past the point of diminishing returns tonight, so I thought I'd at least throw this out for comment. Ignore everything with the ASCIIFoldingFilterFactory, I detect it but don't do anything with it yet. And I can't seem to make the reversed test work, even without the casing switch. Which means I should put it down for the evening, I'm obviously fried. Anybody feeling kind can uncomment the line that starts: // make me work and get the test class to work. It's probably trivial but I'm not seeing it.
          Hide
          Yonik Seeley added a comment -

          QueryParserBase.lowercaseExpandedTerms[...] Where did that come from anyway?

          Legacy. Solr has never used it.

          It almost seems like a flag on the <field> definition as Peter suggested is a more robust way of going about things.

          Since there's normally only one right way, making it configurable (from the users point of view) isn't optimal.
          That could be the job of the factory though!

          We could add flags to the factory to say "this filter should be run for prefix queries, etc", or a method to return that meta-data.
          Then an analyzer could be built that only included the right filters for the type query (prefix, wildcard, etc).

          Another related alternative is to have the factory actually do the processing on the term. Something like

          public enum QueryType { PREFIX, WILDCARD, ... }
          
          public String LowerCaseFilterFactory.processQueryTerm(String queryTerm, QueryType queryType) {
            return queryTerm.toLowerCase(Locale.US);  // Uhhh, what locale returns the same values as Character.toLowerCase()?  
          }
          
          Show
          Yonik Seeley added a comment - QueryParserBase.lowercaseExpandedTerms [...] Where did that come from anyway? Legacy. Solr has never used it. It almost seems like a flag on the <field> definition as Peter suggested is a more robust way of going about things. Since there's normally only one right way, making it configurable (from the users point of view) isn't optimal. That could be the job of the factory though! We could add flags to the factory to say "this filter should be run for prefix queries, etc", or a method to return that meta-data. Then an analyzer could be built that only included the right filters for the type query (prefix, wildcard, etc). Another related alternative is to have the factory actually do the processing on the term. Something like public enum QueryType { PREFIX, WILDCARD, ... } public String LowerCaseFilterFactory.processQueryTerm( String queryTerm, QueryType queryType) { return queryTerm.toLowerCase(Locale.US); // Uhhh, what locale returns the same values as Character .toLowerCase()? }
          Hide
          Erick Erickson added a comment -

          Here's a rough cut at what I think Yonik might have been talking about, comments?

          I haven't done a thing about efficiency here, just seeing if the new method in the FilterFactories (processQueryTerm) makes sense to y'all.

          One thing I'm not clear on: Would it make more sense to just instantiate a new instance of the filter and run each term through it rather than steal bits from the underlying Filters (see ASCIIFoldingFilterFactory and LowercaseFilterFactory for example). I just hate duplicated code but I'm not sure how efficient creating a new filter and running the token through would be for each and every token.

          Show
          Erick Erickson added a comment - Here's a rough cut at what I think Yonik might have been talking about, comments? I haven't done a thing about efficiency here, just seeing if the new method in the FilterFactories (processQueryTerm) makes sense to y'all. One thing I'm not clear on: Would it make more sense to just instantiate a new instance of the filter and run each term through it rather than steal bits from the underlying Filters (see ASCIIFoldingFilterFactory and LowercaseFilterFactory for example). I just hate duplicated code but I'm not sure how efficient creating a new filter and running the token through would be for each and every token.
          Hide
          Robert Muir added a comment -

          I just hate duplicated code but I'm not sure how efficient creating a new filter and running the token through would be for each and every token.

          Which is why it should be a real chain, not code-dup or conflation of tokenfilterfactories with being also queryparserfactories.

          The Analyzer etc takes care of all this reuse stuff already: the only issue people have with a chain is they want an 'auto' one by default... thats separate... thats CONFIGURATION

          Show
          Robert Muir added a comment - I just hate duplicated code but I'm not sure how efficient creating a new filter and running the token through would be for each and every token. Which is why it should be a real chain, not code-dup or conflation of tokenfilterfactories with being also queryparserfactories. The Analyzer etc takes care of all this reuse stuff already: the only issue people have with a chain is they want an 'auto' one by default... thats separate... thats CONFIGURATION
          Hide
          Erick Erickson added a comment -

          OK, this isn't nearly finished yet, but I wanted to run it by folks to see if the approach is what, particularly, Robert and Yonik have in mind.

          I'm assuming that the flex stuff is out of scope for this JIRA, right?

          Don't waste your time on details just yet, only the general approach.

          I'm thinking of allowing a flag to the fields to disable this functionality but make this the default, thoughts?

          Haven't even thought about back-porting to 3x, but it looks do-able on a quick glance.

          Show
          Erick Erickson added a comment - OK, this isn't nearly finished yet, but I wanted to run it by folks to see if the approach is what, particularly, Robert and Yonik have in mind. I'm assuming that the flex stuff is out of scope for this JIRA, right? Don't waste your time on details just yet, only the general approach. I'm thinking of allowing a flag to the fields to disable this functionality but make this the default, thoughts? Haven't even thought about back-porting to 3x, but it looks do-able on a quick glance.
          Hide
          Robert Muir added a comment -

          This is totally the general approach I wanted to see here!

          So then we are only left with the two configuration cases to think about:
          1. we have the backwards case where no chain is specified, and we want to provide the behavior of today. I think the simplest solution is to just use set the analyzer to keywordanalyzer for that? This might simplify logic as then this 'analyzer' for the field is never non-null, and QP just always analyzes these queries.
          2. we have the 'auto' case where no chain is specified, and we want to do the right thing automatically. so in that situation I think we always build a chain based off of the users CharFilter setup + keywordtokenizer + lowercasefilter (if lowercasetokenizer/filter is in use) + asciifoldingfilter (if its in use). anything more complicated and someone just declares their own chain.

          Show
          Robert Muir added a comment - This is totally the general approach I wanted to see here! So then we are only left with the two configuration cases to think about: 1. we have the backwards case where no chain is specified, and we want to provide the behavior of today. I think the simplest solution is to just use set the analyzer to keywordanalyzer for that? This might simplify logic as then this 'analyzer' for the field is never non-null, and QP just always analyzes these queries. 2. we have the 'auto' case where no chain is specified, and we want to do the right thing automatically. so in that situation I think we always build a chain based off of the users CharFilter setup + keywordtokenizer + lowercasefilter (if lowercasetokenizer/filter is in use) + asciifoldingfilter (if its in use). anything more complicated and someone just declares their own chain.
          Hide
          Erick Erickson added a comment -

          Yonik says that .../queryparser/classic/QueryParserBase.lowercaseExpandedTerms is something that was never used (but I do see test cases for it). Should I remove the variable, the getter/setter and all that kinda stuff? Are there any back compat issues (this is a public method after all).....

          I can easily be argued out of taking this out. This JIRA is about Solr, I'm not sure messing around in the classic query parser is appropriate when it's not germane to the JIRA...

          Show
          Erick Erickson added a comment - Yonik says that .../queryparser/classic/QueryParserBase.lowercaseExpandedTerms is something that was never used (but I do see test cases for it). Should I remove the variable, the getter/setter and all that kinda stuff? Are there any back compat issues (this is a public method after all)..... I can easily be argued out of taking this out. This JIRA is about Solr , I'm not sure messing around in the classic query parser is appropriate when it's not germane to the JIRA...
          Hide
          Robert Muir added a comment -

          Yonik says that .../queryparser/classic/QueryParserBase.lowercaseExpandedTerms is something that was never used (but I do see test cases for it).

          Its not used by solr, but its 'on' by default for lucene.

          I would say leave it be to make it easier to get this issue resolved for now?

          Show
          Robert Muir added a comment - Yonik says that .../queryparser/classic/QueryParserBase.lowercaseExpandedTerms is something that was never used (but I do see test cases for it). Its not used by solr, but its 'on' by default for lucene. I would say leave it be to make it easier to get this issue resolved for now?
          Hide
          Yonik Seeley added a comment -

          Its not used by solr, but its 'on' by default for lucene.

          Yep.

          I would say leave it be to make it easier to get this issue resolved for now?

          +1

          Show
          Yonik Seeley added a comment - Its not used by solr, but its 'on' by default for lucene. Yep. I would say leave it be to make it easier to get this issue resolved for now? +1
          Hide
          Erick Erickson added a comment -

          I think this patch is ready for scrutiny. Tests run successfully.

          I have yet to do several things:
          1> update README
          2> add an example to example/schema.xml
          3> this is going to take a writeup on the Wiki I think, explaining that there's another (optional) section to a <fieldType>. Any suggestions where that should go?

          Originally, I'd hoped to back-port it to 3.5, but the more I look at it the more I'd like it to bake a while before being officially released and target 3.6 instead for the back-port. Can one back-port something like this after the first RC is cut or is it better to wait until after the release? I can always commit this to trunk and open another JIRA to backport after 3.5 is released.

          Show
          Erick Erickson added a comment - I think this patch is ready for scrutiny. Tests run successfully. I have yet to do several things: 1> update README 2> add an example to example/schema.xml 3> this is going to take a writeup on the Wiki I think, explaining that there's another (optional) section to a <fieldType>. Any suggestions where that should go? Originally, I'd hoped to back-port it to 3.5, but the more I look at it the more I'd like it to bake a while before being officially released and target 3.6 instead for the back-port. Can one back-port something like this after the first RC is cut or is it better to wait until after the release? I can always commit this to trunk and open another JIRA to backport after 3.5 is released.
          Hide
          Uwe Schindler added a comment -

          Hi Erick, once the 3.5 release is branched off, you are free to commit it tot he 3.x branch.

          Show
          Uwe Schindler added a comment - Hi Erick, once the 3.5 release is branched off, you are free to commit it tot he 3.x branch.
          Hide
          Robert Muir added a comment -

          3.5 is branched. But just my opinion: we should never worry about this stuff.

          I don't think we should ever freeze trunk or our stable branch.

          If someone is working on a release candidate and hasn't branched, they can always branch
          from a specific revision that they were working with before.

          By the way Erick: nice work on the patch. I just took a quick glance (didn't test it),
          but only have one question.

          If the backwards compatibility path is to have legacyMultiTerm, can't we just control
          its default based upon the schema version (and bump that?). It seems awkward to have
          2 booleans that control the backwards: both legacyMultiTerm and luceneMatchVersion.

          I guess at the end of the day I think the schema variable you added is a better approach,
          because its not really a behavior of the lucene queryparser that changed, but a change
          to the schema.

          Show
          Robert Muir added a comment - 3.5 is branched. But just my opinion: we should never worry about this stuff. I don't think we should ever freeze trunk or our stable branch. If someone is working on a release candidate and hasn't branched, they can always branch from a specific revision that they were working with before. By the way Erick: nice work on the patch. I just took a quick glance (didn't test it), but only have one question. If the backwards compatibility path is to have legacyMultiTerm, can't we just control its default based upon the schema version (and bump that?). It seems awkward to have 2 booleans that control the backwards: both legacyMultiTerm and luceneMatchVersion. I guess at the end of the day I think the schema variable you added is a better approach, because its not really a behavior of the lucene queryparser that changed, but a change to the schema.
          Hide
          Simon Willnauer added a comment -

          3.5 is branched. But just my opinion: we should never worry about this stuff.

          +1

          Show
          Simon Willnauer added a comment - 3.5 is branched. But just my opinion: we should never worry about this stuff. +1
          Hide
          Erick Erickson added a comment -

          bq: I guess at the end of the day I think the schema variable you added is a better approach

          I was just thinking about this since someone on the user's list asked "can this be applied to 3.4" and couldn't make that all work without headaches, precisely because there were two variables to contend with. Alright, I'll make matchVersion determine the default value of legacyMultiTerm, which should allow this patch to be applied to pre 3.6 code lines at the user's risk.

          bq: 3.5 is branched. But just my opinion: we should never worry about this stuff.

          Right, if it had been more than a couple of days it'd have been another story, but I ran into a few surprises when running tests, so delaying for a couple of days to insure no chance of screwing up seemed prudent... thanks.

          Show
          Erick Erickson added a comment - bq: I guess at the end of the day I think the schema variable you added is a better approach I was just thinking about this since someone on the user's list asked "can this be applied to 3.4" and couldn't make that all work without headaches, precisely because there were two variables to contend with. Alright, I'll make matchVersion determine the default value of legacyMultiTerm, which should allow this patch to be applied to pre 3.6 code lines at the user's risk. bq: 3.5 is branched. But just my opinion: we should never worry about this stuff. Right, if it had been more than a couple of days it'd have been another story, but I ran into a few surprises when running tests, so delaying for a couple of days to insure no chance of screwing up seemed prudent... thanks.
          Hide
          Erick Erickson added a comment -

          OK, this patch does a better job with the matchVersion as per Muir. If nobody objects I'll commit it this week, probably not before Wednesday though. Then I should be able to do the backport to 3.6 shortly thereafter.

          I still have to run all the tests yet again, but I don't really expect much of a problem.

          Should SOLR 218, 219 and 757 all be closed as part of 2438?

          Show
          Erick Erickson added a comment - OK, this patch does a better job with the matchVersion as per Muir. If nobody objects I'll commit it this week, probably not before Wednesday though. Then I should be able to do the backport to 3.6 shortly thereafter. I still have to run all the tests yet again, but I don't really expect much of a problem. Should SOLR 218, 219 and 757 all be closed as part of 2438?
          Hide
          Erick Erickson added a comment -

          Here's what the 3x version would look like if anyone's interested. There's some refactoring that was done between 3.x and 4.0 that made reconciling these a bit of a pain.

          Still need to modify the CHANGES files.

          I'll commit these tomorrow sometime if nobody objects.

          Show
          Erick Erickson added a comment - Here's what the 3x version would look like if anyone's interested. There's some refactoring that was done between 3.x and 4.0 that made reconciling these a bit of a pain. Still need to modify the CHANGES files. I'll commit these tomorrow sometime if nobody objects.
          Hide
          Yonik Seeley added a comment -

          Let's iterate on this in trunk a while before we backport to 3x.
          It's always good when we can avoid configuration, and I think that's the case here.

          First of all, the name "legacyMultiTerm" just begs the question "with respect to what?".
          Second, things would only be non-back compatible if someone were using things incorrectly (i.e. a query of myfield:Foo when myfield contains a lowecase filter). It really doesn't seem like we need a "legacyMultiTerm" flag or a lucene match version for this.

          I'll work up a patch to make the analysis a bit more flexible too.

          Show
          Yonik Seeley added a comment - Let's iterate on this in trunk a while before we backport to 3x. It's always good when we can avoid configuration, and I think that's the case here. First of all, the name "legacyMultiTerm" just begs the question "with respect to what?". Second, things would only be non-back compatible if someone were using things incorrectly (i.e. a query of myfield:Foo when myfield contains a lowecase filter). It really doesn't seem like we need a "legacyMultiTerm" flag or a lucene match version for this. I'll work up a patch to make the analysis a bit more flexible too.
          Hide
          Erick Erickson added a comment -

          Too late, I checked all this in before I saw your comment, including into 3.6.

          Yeah, legacyMultiTerm is awkward, and I'm pretty confused by the lucene match version, so anything you want to do there is welcome. what I was trying for was the capability to
          1> keep from changing the behavior on someone installing over an older 3.x version
          2> allow them to keep the old behavior if necessary. I admit I have a hard time
          coming up with a use-case here, but at least it does allow them to sidestep
          the whole process if someone counts on the old behavior.

          At root I'm uncomfortable deciding for them what's best without a way of having the user say "no, you dolt, you didn't anticipate my use case!".

          Anyway, I opened up SOLR-2918

          Show
          Erick Erickson added a comment - Too late, I checked all this in before I saw your comment, including into 3.6. Yeah, legacyMultiTerm is awkward, and I'm pretty confused by the lucene match version, so anything you want to do there is welcome. what I was trying for was the capability to 1> keep from changing the behavior on someone installing over an older 3.x version 2> allow them to keep the old behavior if necessary. I admit I have a hard time coming up with a use-case here, but at least it does allow them to sidestep the whole process if someone counts on the old behavior. At root I'm uncomfortable deciding for them what's best without a way of having the user say "no, you dolt, you didn't anticipate my use case!". Anyway, I opened up SOLR-2918
          Hide
          Erick Erickson added a comment -

          Patches as of the commit

          Show
          Erick Erickson added a comment - Patches as of the commit
          Hide
          Erick Erickson added a comment -

          Trunk - r1206229
          3.6 - r1206258

          Note, Yonik has some improvements in mind, see: SOLR-2918

          Show
          Erick Erickson added a comment - Trunk - r1206229 3.6 - r1206258 Note, Yonik has some improvements in mind, see: SOLR-2918
          Hide
          Erick Erickson added a comment -

          I closed SOLR-219 and SOLR-757 as part of this issue.

          Show
          Erick Erickson added a comment - I closed SOLR-219 and SOLR-757 as part of this issue.
          Hide
          Yonik Seeley added a comment -

          I'm looking at TestFoldingMultitermQuery, and as far as I can tell there only seem to be tests for positive matches.
          For example, there are tests that myfield:A* match "abacus" on a case insensitive field, but there are no tests that ensure that it doesn't match on a case sensitive field?

          It was especially confusing since there are comments like:

          // test the wildcard queries find only one doc  where the query is uppercased and/or accented.
          

          That suggest that case sensitivity is also being tested (because of the "only one" phrase), but that's not the case. It seems like it should really say

          // test the wildcard queries find the doc even when the query is uppercased and/or accented.
          

          This patch also introduced a bug that causes regex queries to be lowercased when they shouldn't be. I've fixed it in my local copy.

          Show
          Yonik Seeley added a comment - I'm looking at TestFoldingMultitermQuery, and as far as I can tell there only seem to be tests for positive matches. For example, there are tests that myfield:A* match "abacus" on a case insensitive field, but there are no tests that ensure that it doesn't match on a case sensitive field? It was especially confusing since there are comments like: // test the wildcard queries find only one doc where the query is uppercased and/or accented. That suggest that case sensitivity is also being tested (because of the "only one" phrase), but that's not the case. It seems like it should really say // test the wildcard queries find the doc even when the query is uppercased and/or accented. This patch also introduced a bug that causes regex queries to be lowercased when they shouldn't be. I've fixed it in my local copy.
          Hide
          Uwe Schindler added a comment -

          I fixed the example schema in 3.x (which contained a merge error: double field definition). The 3.x code also used Arrays.copyOfRange() that is not available in Java 5. Both errors should have been detected when running tests before committing.

          Show
          Uwe Schindler added a comment - I fixed the example schema in 3.x (which contained a merge error: double field definition). The 3.x code also used Arrays.copyOfRange() that is not available in Java 5. Both errors should have been detected when running tests before committing.
          Hide
          Yonik Seeley added a comment -

          I see a couple of other issues:

          • hardcoded use of WhitespaceTokenizerFactory will break any fields that allow embedded whitespace (KeywordTokenizerFactory, etc)
          • LowercaseTokenizerFactory won't be detected as producing lowercase tokens
          • No CharFilters are run (i.e. things like MappingCharFilter that chan change chars)

          I'll take a crack at addressing these.

          Show
          Yonik Seeley added a comment - I see a couple of other issues: hardcoded use of WhitespaceTokenizerFactory will break any fields that allow embedded whitespace (KeywordTokenizerFactory, etc) LowercaseTokenizerFactory won't be detected as producing lowercase tokens No CharFilters are run (i.e. things like MappingCharFilter that chan change chars) I'll take a crack at addressing these.
          Hide
          Robert Muir added a comment -

          Seems like unless you specifically ask, the automagic mode should use keywordtokenizer?

          I think the relevant code for analyzing the terms will get throw an exception if the tokenizer produces more than 1 term from a wildcard or other MTQ...

          Show
          Robert Muir added a comment - Seems like unless you specifically ask, the automagic mode should use keywordtokenizer? I think the relevant code for analyzing the terms will get throw an exception if the tokenizer produces more than 1 term from a wildcard or other MTQ...
          Hide
          Robert Muir added a comment -

          And supporting other tokenizers by default would be hellaciousanyway, since you possibly have syntax in the text...

          Show
          Robert Muir added a comment - And supporting other tokenizers by default would be hellaciousanyway, since you possibly have syntax in the text...
          Hide
          Erick Erickson added a comment -

          Thanks, Uwe. I not only ran my new tests in IntelliJ, but I also checked out the code after committing and ran the entire test suite. Then, looking at my stored results again after your note just now I saw that I'd managed to blow right past the error in schema.xml. Siiiggghhhhh. It wasn't in my test cases so I guess I'd just stopped looking by then....My apologies, that shouldn't happen again...

          Show
          Erick Erickson added a comment - Thanks, Uwe. I not only ran my new tests in IntelliJ, but I also checked out the code after committing and ran the entire test suite. Then, looking at my stored results again after your note just now I saw that I'd managed to blow right past the error in schema.xml. Siiiggghhhhh. It wasn't in my test cases so I guess I'd just stopped looking by then....My apologies, that shouldn't happen again...
          Hide
          Yonik Seeley added a comment -

          Here's a patch with quite a few changes:

          • fixed cases that were lowercasing the terms when they shouldn't be
          • moved the multiTermAnalyzer methods up to TextField. This strategy isn't general enough to be used by other field types (think prefix query on a numeric field, etc)
          • added an interface, MultiTermAwareComponent, that can allow a factory to return another factory (or itself)
          • removed legacy_multiTerm... if someone wants that, they can directly configure a keyword tokenizer.
          • made TextField.getRangeQuery() fully functional
          • use KeywordTokenizer by default in the MultiTermAnalyzer
          • removed from the example schema... this should work for 99% of people and only expert users should care (and it could confuse new users about when they need to define a custom multiTermAnalyzer)
          • other minor improvements such as not looking up the fieldType twice all the time in SolrQueryParser
          • SOLR-1982 related - don't introspect all fieldTypes for every query parser instantiation... do it per-field only as required. Also, don't allow wildcard behavior of one field affect another.
          • added some new tests to catch certain cases, fixed some other test cases. A few more tests might be desirable.
          Show
          Yonik Seeley added a comment - Here's a patch with quite a few changes: fixed cases that were lowercasing the terms when they shouldn't be moved the multiTermAnalyzer methods up to TextField. This strategy isn't general enough to be used by other field types (think prefix query on a numeric field, etc) added an interface, MultiTermAwareComponent, that can allow a factory to return another factory (or itself) removed legacy_multiTerm... if someone wants that, they can directly configure a keyword tokenizer. made TextField.getRangeQuery() fully functional use KeywordTokenizer by default in the MultiTermAnalyzer removed from the example schema... this should work for 99% of people and only expert users should care (and it could confuse new users about when they need to define a custom multiTermAnalyzer) other minor improvements such as not looking up the fieldType twice all the time in SolrQueryParser SOLR-1982 related - don't introspect all fieldTypes for every query parser instantiation... do it per-field only as required. Also, don't allow wildcard behavior of one field affect another. added some new tests to catch certain cases, fixed some other test cases. A few more tests might be desirable.
          Hide
          Erick Erickson added a comment -

          Yonik:

          Splendid. It's amazing what happens when someone who really understands the code does the work! I took a quick glance over it, I'll be able to look more this evening...

          So what's next, should I carry it forward from here or are you going to commit the patch? And what about 3x?

          Show
          Erick Erickson added a comment - Yonik: Splendid. It's amazing what happens when someone who really understands the code does the work! I took a quick glance over it, I'll be able to look more this evening... So what's next, should I carry it forward from here or are you going to commit the patch? And what about 3x?
          Hide
          Yonik Seeley added a comment -

          So what's next, should I carry it forward from here

          Go for it!

          Show
          Yonik Seeley added a comment - So what's next, should I carry it forward from here Go for it!
          Hide
          Erick Erickson added a comment -

          Improved version of this issue. If you want to apply this patch to the source from before 25-Nov, you probably have to apply the patch dated 25-Nov first, both for trunk and 3.6

          Show
          Erick Erickson added a comment - Improved version of this issue. If you want to apply this patch to the source from before 25-Nov, you probably have to apply the patch dated 25-Nov first, both for trunk and 3.6
          Hide
          Erick Erickson added a comment -

          backport MultiTermAware version of this patch to 3.6. Again, before applying this patch you probably need to apply the 3x patch from 25-Nov.

          Show
          Erick Erickson added a comment - backport MultiTermAware version of this patch to 3.6. Again, before applying this patch you probably need to apply the 3x patch from 25-Nov.
          Hide
          Erick Erickson added a comment -

          Hmmm, it's clear that reading this patch is confusing. See the writeup at: http://wiki.apache.org/solr/MultitermQueryAnalysis

          Show
          Erick Erickson added a comment - Hmmm, it's clear that reading this patch is confusing. See the writeup at: http://wiki.apache.org/solr/MultitermQueryAnalysis
          Hide
          Michael McCandless added a comment -

          Hmm, working on LUCENE-995, where I had to regen QueryParser.java from QueryParser.jj on 3.x... I noticed that when this issue was committed (well, rev 1206258, up above), that QueryParser.java was changed yet QueryParser.jj was not...

          I'll fix that.

          But even more spooky, all tests passed with that loss:

          -  protected String analyzeMultitermTerm(String field, String part, Analyzer analyzerIn) {
          -    TokenStream source;
          -
          -    if (analyzerIn == null) analyzerIn = analyzer;
          -
          -    try {
          -      source = analyzerIn.tokenStream(field, new StringReader(part));
          -      source.reset();
          -    } catch (IOException e) {
          -      throw new RuntimeException("Unable to initialize TokenStream to analyze multiTerm term: " + part, e);
          -    }
          -
          -    CharTermAttribute termAtt = source.getAttribute(CharTermAttribute.class);
          -    String termRet = "";
          -
          -    try {
          -      if (!source.incrementToken())
          -        throw new IllegalArgumentException("analyzer returned no terms for multiTerm term: " + part);
          -      termRet = termAtt.toString();
          -      if (source.incrementToken())
          -        throw new IllegalArgumentException("analyzer returned too many terms for multiTerm term: " + part);
          -    } catch (IOException e) {
          -      throw new RuntimeException("error analyzing range part: " + part, e);
          -    }
          -
          -    try {
          -      source.end();
          -      source.close();
          -    } catch (IOException e) {
          -      throw new RuntimeException("Unable to end & close TokenStream after analyzing multiTerm term: " + part, e);
          -    }
          -
          -    return termRet;
          -  }
          -
          

          Do we have no test coverage for this method? Should this method be removed...? I'm confused.

          Show
          Michael McCandless added a comment - Hmm, working on LUCENE-995 , where I had to regen QueryParser.java from QueryParser.jj on 3.x... I noticed that when this issue was committed (well, rev 1206258, up above), that QueryParser.java was changed yet QueryParser.jj was not... I'll fix that. But even more spooky, all tests passed with that loss: - protected String analyzeMultitermTerm(String field, String part, Analyzer analyzerIn) { - TokenStream source; - - if (analyzerIn == null) analyzerIn = analyzer; - - try { - source = analyzerIn.tokenStream(field, new StringReader(part)); - source.reset(); - } catch (IOException e) { - throw new RuntimeException("Unable to initialize TokenStream to analyze multiTerm term: " + part, e); - } - - CharTermAttribute termAtt = source.getAttribute(CharTermAttribute.class); - String termRet = ""; - - try { - if (!source.incrementToken()) - throw new IllegalArgumentException("analyzer returned no terms for multiTerm term: " + part); - termRet = termAtt.toString(); - if (source.incrementToken()) - throw new IllegalArgumentException("analyzer returned too many terms for multiTerm term: " + part); - } catch (IOException e) { - throw new RuntimeException("error analyzing range part: " + part, e); - } - - try { - source.end(); - source.close(); - } catch (IOException e) { - throw new RuntimeException("Unable to end & close TokenStream after analyzing multiTerm term: " + part, e); - } - - return termRet; - } - Do we have no test coverage for this method? Should this method be removed...? I'm confused.
          Hide
          Michael McCandless added a comment -

          Reopening so we can figure out whether the backport was correct... we should at least see a test fail when that method is removed I think?

          Show
          Michael McCandless added a comment - Reopening so we can figure out whether the backport was correct... we should at least see a test fail when that method is removed I think?
          Hide
          Erick Erickson added a comment -

          Yikes!

          How did you "remove the method"? Comment it out and comment out the call to it or something else? I'll need to do whatever you did to reproduce it.

          And do you expect that this would also be an issue for trunk? As I remember, the two implementations are pretty similar so I'd at least be suspicious there too.

          I'm not very familiar (ok, not at all) with the parser generation step, so there may be a few more questions along the way.

          I should be able to take a quick look tonight sometime...

          Show
          Erick Erickson added a comment - Yikes! How did you "remove the method"? Comment it out and comment out the call to it or something else? I'll need to do whatever you did to reproduce it. And do you expect that this would also be an issue for trunk? As I remember, the two implementations are pretty similar so I'd at least be suspicious there too. I'm not very familiar (ok, not at all) with the parser generation step, so there may be a few more questions along the way. I should be able to take a quick look tonight sometime...
          Hide
          Yonik Seeley added a comment -

          And do you expect that this would also be an issue for trunk?

          Nope. For trunk, I previously refactored everything I could out of QueryParser.jj into QueryParserBase.java (and hence the trunk patch only needed to modified that file).

          Show
          Yonik Seeley added a comment - And do you expect that this would also be an issue for trunk? Nope. For trunk, I previously refactored everything I could out of QueryParser.jj into QueryParserBase.java (and hence the trunk patch only needed to modified that file).
          Hide
          Erick Erickson added a comment -

          OK, spooky is a kinder way to describe this than some I can think of. The method in lucene...QueryParser was simply dead code, I suspect I managed to put it there through the magic of modern IDEs. And hey, it compiles. Since this solution is a purely Solr construct in the first place, it has no business being in the lucene package at all.

          There's an identical method in solr...TextField that is the code actually used, so somehow I managed to put the same code in two places. Simply removing the code from lucene...QueryParser is the right thing to do here I think, all the tests pass when I did this manually.

          And, as Yonik says, there's nothing similar in trunk.

          So if I remember the patterns from lex and yacc, there's really nothing for me to do except close this JIRA, right? Mike can freely check in the newly generated QueryParser.java and I can remember to actually read the header that says things about this being a generated file that shouldn't be changed....

          Agree? Disagree?

          Nice catch Mike!

          Show
          Erick Erickson added a comment - OK, spooky is a kinder way to describe this than some I can think of. The method in lucene...QueryParser was simply dead code, I suspect I managed to put it there through the magic of modern IDEs. And hey, it compiles. Since this solution is a purely Solr construct in the first place, it has no business being in the lucene package at all. There's an identical method in solr...TextField that is the code actually used, so somehow I managed to put the same code in two places. Simply removing the code from lucene...QueryParser is the right thing to do here I think, all the tests pass when I did this manually. And, as Yonik says, there's nothing similar in trunk. So if I remember the patterns from lex and yacc, there's really nothing for me to do except close this JIRA, right? Mike can freely check in the newly generated QueryParser.java and I can remember to actually read the header that says things about this being a generated file that shouldn't be changed.... Agree? Disagree? Nice catch Mike!
          Hide
          Erick Erickson added a comment - - edited

          Michael:

          Can we close this? Or is there something I need to do here?

          Show
          Erick Erickson added a comment - - edited Michael: Can we close this? Or is there something I need to do here?
          Hide
          Erick Erickson added a comment -

          Removed methods inadvertently put in lucene...QueryParser.java/jj

          Thanks Mike for pointing this out!

          3x only, r: 1229291

          Show
          Erick Erickson added a comment - Removed methods inadvertently put in lucene...QueryParser.java/jj Thanks Mike for pointing this out! 3x only, r: 1229291
          Hide
          Erick Erickson added a comment -

          Removed dead code inadvertently put in 3x QueryParser.java. No functional changes, no patch.

          Show
          Erick Erickson added a comment - Removed dead code inadvertently put in 3x QueryParser.java. No functional changes, no patch.
          Hide
          Duto added a comment -

          I try it yesterday the 3.6-SNAPSHOT and I remark something strange :

          raw query parsed query comment
          name:LéCTROD* name:lectrod* fill good
          name:*LéCTROD name:lectrod that remove the wildcard !!!
          name:*LéCTROD* name:lectrod that remove all wildcards !!!

          I would like to know if it's normal that if the wildcard is on the first position on the raw query, the wildcard is remove on the parsed query ?

          schema.xml
          <types>
          	<fieldtype name="text_fr" class="solr.TextField">
          		<analyzer type="index">
          			<tokenizer class="solr.StandardTokenizerFactory"/>
          			<filter class="solr.StandardFilterFactory"/>
          			<filter class="solr.ASCIIFoldingFilterFactory"/>
          			<filter class="solr.LowerCaseFilterFactory"/>
          		</analyzer>
          		<analyzer type="query">
          			<tokenizer class="solr.StandardTokenizerFactory"/>
          			<filter class="solr.StandardFilterFactory"/>
          			<filter class="solr.ASCIIFoldingFilterFactory"/>
          			<filter class="solr.LowerCaseFilterFactory"/>
          		</analyzer>
          		<analyzer type="multiterm">
          			<tokenizer class="solr.StandardTokenizerFactory" />
          			<filter class="solr.StandardFilterFactory"/>
          			<filter class="solr.ASCIIFoldingFilterFactory"/>
          			<filter class="solr.LowerCaseFilterFactory"/>
          		</analyzer>
          	</fieldtype>
          </types>
          
          <fields>
          	<field name="name" type="text_fr" indexed="true" stored="true" multiValued="true"/>
          </fields>
          

          Duto

          Show
          Duto added a comment - I try it yesterday the 3.6-SNAPSHOT and I remark something strange : raw query parsed query comment name:LéCTROD* name:lectrod* fill good name:*LéCTROD name:lectrod that remove the wildcard !!! name:*LéCTROD* name:lectrod that remove all wildcards !!! I would like to know if it's normal that if the wildcard is on the first position on the raw query, the wildcard is remove on the parsed query ? schema.xml <types> <fieldtype name= "text_fr" class= "solr.TextField" > <analyzer type= "index" > <tokenizer class= "solr.StandardTokenizerFactory" /> <filter class= "solr.StandardFilterFactory" /> <filter class= "solr.ASCIIFoldingFilterFactory" /> <filter class= "solr.LowerCaseFilterFactory" /> </analyzer> <analyzer type= "query" > <tokenizer class= "solr.StandardTokenizerFactory" /> <filter class= "solr.StandardFilterFactory" /> <filter class= "solr.ASCIIFoldingFilterFactory" /> <filter class= "solr.LowerCaseFilterFactory" /> </analyzer> <analyzer type= "multiterm" > <tokenizer class= "solr.StandardTokenizerFactory" /> <filter class= "solr.StandardFilterFactory" /> <filter class= "solr.ASCIIFoldingFilterFactory" /> <filter class= "solr.LowerCaseFilterFactory" /> </analyzer> </fieldtype> </types> <fields> <field name= "name" type= "text_fr" indexed= " true " stored= " true " multiValued= " true " /> </fields> Duto
          Hide
          Erick Erickson added a comment -

          Duto:

          A couple of things. First, in the future could you post this kind of usage question to the users list? See: http://lucene.apache.org/solr/discussion.html. No big deal, but that way more people see the discussion and benefit.

          But to your question:
          Have you enabled leading wildcard? See the ReversedWildcardFilterFactory. Leading wildcards need some special handling because in the simple case, finding them means you have to examine every term in the field and can be very expensive.

          Second, you could get away with just using one analyzer since they're all the same, as
          <analyzer>
          .
          .
          .
          </analyzer>

          if no 'type=...' is specified, then the index and query and multiterm chains are use the analyzer definition.

          I doubt this issue is related to this JIRA, I think it's just the normal leading wildcard issues.

          Here's a discussion of this in some detail if you haven't seen it yet:
          http://www.lucidimagination.com/blog/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/

          Erick

          Show
          Erick Erickson added a comment - Duto: A couple of things. First, in the future could you post this kind of usage question to the users list? See: http://lucene.apache.org/solr/discussion.html . No big deal, but that way more people see the discussion and benefit. But to your question: Have you enabled leading wildcard? See the ReversedWildcardFilterFactory. Leading wildcards need some special handling because in the simple case, finding them means you have to examine every term in the field and can be very expensive. Second, you could get away with just using one analyzer since they're all the same, as <analyzer> . . . </analyzer> if no 'type=...' is specified, then the index and query and multiterm chains are use the analyzer definition. I doubt this issue is related to this JIRA, I think it's just the normal leading wildcard issues. Here's a discussion of this in some detail if you haven't seen it yet: http://www.lucidimagination.com/blog/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/ Erick
          Hide
          Karsten R. added a comment -

          Hi Erick,

          in your commit Revision 1206767 (2011-11-27) there is a change in
          org.apache.solr.search.SolrQueryParser#getReversedWildcardFilterFactory(FieldType)
          Before this commit the Map leadingWildcards was important, after this the Map leadingWildcards is only a cache. A cache, which is never used:

          ReversedWildcardFilterFactory fac = leadingWildcards.get(fieldType);
          if (fac == null && leadingWildcards.containsKey(fac)) {
             return fac;
          }
          

          If we want to use this cache it should be

          ReversedWildcardFilterFactory fac = leadingWildcards.get(fieldType);
          if (fac != null || leadingWildcards.containsKey(fieldType)) {
             return fac;
          }
          

          best regards
          Karsten

          Show
          Karsten R. added a comment - Hi Erick, in your commit Revision 1206767 (2011-11-27) there is a change in org.apache.solr.search.SolrQueryParser#getReversedWildcardFilterFactory(FieldType) Before this commit the Map leadingWildcards was important, after this the Map leadingWildcards is only a cache. A cache, which is never used: ReversedWildcardFilterFactory fac = leadingWildcards.get(fieldType); if (fac == null && leadingWildcards.containsKey(fac)) { return fac; } If we want to use this cache it should be ReversedWildcardFilterFactory fac = leadingWildcards.get(fieldType); if (fac != null || leadingWildcards.containsKey(fieldType)) { return fac; } best regards Karsten
          Hide
          Yonik Seeley added a comment -

          Thanks Karsten, I'll fold that change into my patch for SOLR-4093

          Show
          Yonik Seeley added a comment - Thanks Karsten, I'll fold that change into my patch for SOLR-4093

            People

            • Assignee:
              Erick Erickson
              Reporter:
              Peter Sturge
            • Votes:
              5 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development