Solr
  1. Solr
  2. SOLR-2921

Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-ALPHA
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: Schema and Analysis
    • Labels:
      None
    • Environment:

      All

      Description

      SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:

      • ASCIIFoldingFilterFactory
      • LowerCaseFilterFactory
      • LowerCaseTokenizerFactory
      • MappingCharFilterFactory
      • PersianCharFilterFactory

      When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!

      But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that might be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.

      Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a Filter, which is the right thing in this case.

      Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.

      ArabicNormalizationFilterFactory
      GreekLowerCaseFilterFactory
      HindiNormalizationFilterFactory
      ICUFoldingFilterFactory
      ICUNormalizer2FilterFactory
      ICUTransformFilterFactory
      IndicNormalizationFilterFactory
      ISOLatin1AccentFilterFactory
      PersianNormalizationFilterFactory
      RussianLowerCaseFilterFactory
      TurkishLowerCaseFilterFactory

      1. SOLR-2921_rest.patch
        14 kB
        Robert Muir
      2. SOLR-2921-3x.patch
        16 kB
        Erick Erickson
      3. SOLR-2921-3x.patch
        17 kB
        Erick Erickson
      4. SOLR-2921-3x.patch
        16 kB
        Erick Erickson
      5. SOLR-2921-trunk.patch
        14 kB
        Erick Erickson

        Activity

        Hide
        Erick Erickson added a comment -

        Cool! Thanks!

        Show
        Erick Erickson added a comment - Cool! Thanks!
        Hide
        Robert Muir added a comment -

        erick, here's a patch for the rest. I'll commit it shortly.

        Show
        Robert Muir added a comment - erick, here's a patch for the rest. I'll commit it shortly.
        Hide
        Erick Erickson added a comment -

        Let's open up any further issues in a new JIRA?

        Show
        Erick Erickson added a comment - Let's open up any further issues in a new JIRA?
        Hide
        Erick Erickson added a comment -

        3x r: 1303937
        Trunk r: 1303939

        Show
        Erick Erickson added a comment - 3x r: 1303937 Trunk r: 1303939
        Hide
        Robert Muir added a comment -

        Patch looks good: i think you should commit it and I'll follow up with the other ones.

        only one nitpick:

        -/** 
        +/**
          * Factory for {@link TurkishLowerCaseFilter}.
          * <pre class="prettyprint" >
          * &lt;fieldType name="text_trlwr" class="solr.TextField" positionIncrementGap="100"&gt;
        - *   &lt;analyzer&gt;
        - *     &lt;tokenizer class="solr.StandardTokenizerFactory"/&gt;
        - *     &lt;filter class="solr.TurkishLowerCaseFilterFactory"/&gt;
        - *   &lt;/analyzer&gt;
        - * &lt;/fieldType&gt;</pre> 
        + * &lt;analyzer&gt;
        + * &lt;tokenizer class="solr.StandardTokenizerFactory"/&gt;
        + * &lt;filter class="solr.TurkishLowerCaseFilterFactory"/&gt;
        + * &lt;/analyzer&gt;
        + * &lt;/fieldType&gt;</pre>
        + *
        

        Did your IDE do this? I don't think we should lose the indentation of the example there.

        Show
        Robert Muir added a comment - Patch looks good: i think you should commit it and I'll follow up with the other ones. only one nitpick: -/** +/** * Factory for {@link TurkishLowerCaseFilter}. * <pre class="prettyprint" > * &lt;fieldType name="text_trlwr" class="solr.TextField" positionIncrementGap="100"&gt; - * &lt;analyzer&gt; - * &lt;tokenizer class="solr.StandardTokenizerFactory"/&gt; - * &lt;filter class="solr.TurkishLowerCaseFilterFactory"/&gt; - * &lt;/analyzer&gt; - * &lt;/fieldType&gt;</pre> + * &lt;analyzer&gt; + * &lt;tokenizer class="solr.StandardTokenizerFactory"/&gt; + * &lt;filter class="solr.TurkishLowerCaseFilterFactory"/&gt; + * &lt;/analyzer&gt; + * &lt;/fieldType&gt;</pre> + * Did your IDE do this? I don't think we should lose the indentation of the example there.
        Hide
        Erick Erickson added a comment -

        Fixes test cases in analysis-extras so it runs from the command line not only in IntelliJ.

        Show
        Erick Erickson added a comment - Fixes test cases in analysis-extras so it runs from the command line not only in IntelliJ.
        Hide
        Erick Erickson added a comment -

        Hey, man, it worked from inside IntelliJ. I thought getting to the schema in core from analysis-extras (where TestFoldingMultitermExtrasQuery lives) was funky. I'll have a patch up for that shortly.

        Show
        Erick Erickson added a comment - Hey, man, it worked from inside IntelliJ. I thought getting to the schema in core from analysis-extras (where TestFoldingMultitermExtrasQuery lives) was funky. I'll have a patch up for that shortly.
        Hide
        Robert Muir added a comment -

        I'm not even going to try the following, I don't even know what to expect as proper results. If nobody steps up I'll split these out into another JIRA and hopefully someone with the appropriate knowledge (and keyboard) can volunteer:

        Just make progress with your patch and keep this issue open. I'll help with these.

        As far as the existing patch, i think the multitermtest for ICU will not work (because the support is in a contrib module: analysis-extras).
        If we want a test for that stuff i think it needs to sit under that contrib.

        Show
        Robert Muir added a comment - I'm not even going to try the following, I don't even know what to expect as proper results. If nobody steps up I'll split these out into another JIRA and hopefully someone with the appropriate knowledge (and keyboard) can volunteer: Just make progress with your patch and keep this issue open. I'll help with these. As far as the existing patch, i think the multitermtest for ICU will not work (because the support is in a contrib module: analysis-extras). If we want a test for that stuff i think it needs to sit under that contrib.
        Hide
        Erick Erickson added a comment -

        Here's a first cut at these. The tests in TestFoldingMultitermExtrasQuery are especially weak, any help here would be extremely welcome....

        Basically, I stole the patterns from the associated filters and removed the ones that failed for reasons I didn't understand. And I haven't checked the remaining all that carefully, I have some stuff coming up for most of the rest of today and wanted to get the first cut out in front of people.

        The attached patch applies against 3x, I'll need to tweak it for trunk but won't bother until after we finalize this.

        I also haven't run the full test suite, so this patch should NOT be committed yet.

        I'm not even going to try the following, I don't even know what to expect as proper results. If nobody steps up I'll split these out into another JIRA and hopefully someone with the appropriate knowledge (and keyboard) can volunteer:
        ArabicNormalizationFilterFactory
        HindiNormalizationFilterFactory
        IndicNormalizationFilterFactory
        PersianNormalizationFilterFactory
        ICUTransformFilterFactory

        Show
        Erick Erickson added a comment - Here's a first cut at these. The tests in TestFoldingMultitermExtrasQuery are especially weak, any help here would be extremely welcome.... Basically, I stole the patterns from the associated filters and removed the ones that failed for reasons I didn't understand. And I haven't checked the remaining all that carefully, I have some stuff coming up for most of the rest of today and wanted to get the first cut out in front of people. The attached patch applies against 3x, I'll need to tweak it for trunk but won't bother until after we finalize this. I also haven't run the full test suite, so this patch should NOT be committed yet. I'm not even going to try the following, I don't even know what to expect as proper results. If nobody steps up I'll split these out into another JIRA and hopefully someone with the appropriate knowledge (and keyboard) can volunteer: ArabicNormalizationFilterFactory HindiNormalizationFilterFactory IndicNormalizationFilterFactory PersianNormalizationFilterFactory ICUTransformFilterFactory
        Hide
        Robert Muir added a comment -

        I guess I'm at a loss as to how to write tests for the various filters and tokenizers I listed, which is why I'm reluctant to just make them MultTermAwareComponents.

        Well, none of them are going to work perfectly with wildcards etc anyway, and we already have:

        • tests that these filters do what they should
        • tests that this MultiTerm stuff works

        Looking at your 'quick cull' list i think these will 'generally work'. Sure I could list corner cases for maybe
        half of them right off the top of my head, but we already know its not going to work perfectly. These aren't bugs
        in the filters if they don't work 100% right when given query syntax instead of text, and its not bugs if they
        have context-sensitive rules which won't work the way people expect with patterns.

        So I don't think we need tests that show their half-way broken behavior? This is sort of a best-effort thing anyway.

        Show
        Robert Muir added a comment - I guess I'm at a loss as to how to write tests for the various filters and tokenizers I listed, which is why I'm reluctant to just make them MultTermAwareComponents. Well, none of them are going to work perfectly with wildcards etc anyway, and we already have: tests that these filters do what they should tests that this MultiTerm stuff works Looking at your 'quick cull' list i think these will 'generally work'. Sure I could list corner cases for maybe half of them right off the top of my head, but we already know its not going to work perfectly. These aren't bugs in the filters if they don't work 100% right when given query syntax instead of text, and its not bugs if they have context-sensitive rules which won't work the way people expect with patterns. So I don't think we need tests that show their half-way broken behavior? This is sort of a best-effort thing anyway.
        Hide
        Erick Erickson added a comment -

        There is some desire (and suggestions) to do some more of these for 3.6, so re-labeling

        Show
        Erick Erickson added a comment - There is some desire (and suggestions) to do some more of these for 3.6, so re-labeling
        Hide
        Erick Erickson added a comment -

        Mike:

        stemmers - not going to make them MultiTermAware. No way. No how. Not on my watch, one succinct example and I'm convinced.

        The beauty of the way Yonik and Robert directed this is that we can take care of the 80% case, not provide things that are that surprising and still have all the flexibility available to those who really need it. As Robert says, if they really want some "interesting" behavior, they can specify the complete chain.

        Robert:

        I guess I'm at a loss as to how to write tests for the various filters and tokenizers I listed, which is why I'm reluctant to just make them MultTermAwareComponents. Do you have any suggestions as to how I could get tests? I had enough surprises when I ran the tests in English that I'm reluctant to just plow ahead. As far as I understand, Arabic is caseless for instance.

        I totally agree with your point that making the analysis components cope with syntax is evil. Not going there either.

        Maybe the right action is to wait for someone to volunteer to be the guinea pig for the various filters, I suppose we could advertise for volunteers...

        Show
        Erick Erickson added a comment - Mike: stemmers - not going to make them MultiTermAware. No way. No how. Not on my watch, one succinct example and I'm convinced. The beauty of the way Yonik and Robert directed this is that we can take care of the 80% case, not provide things that are that surprising and still have all the flexibility available to those who really need it. As Robert says, if they really want some "interesting" behavior, they can specify the complete chain. Robert: I guess I'm at a loss as to how to write tests for the various filters and tokenizers I listed, which is why I'm reluctant to just make them MultTermAwareComponents. Do you have any suggestions as to how I could get tests? I had enough surprises when I ran the tests in English that I'm reluctant to just plow ahead. As far as I understand, Arabic is caseless for instance. I totally agree with your point that making the analysis components cope with syntax is evil. Not going there either. Maybe the right action is to wait for someone to volunteer to be the guinea pig for the various filters, I suppose we could advertise for volunteers...
        Hide
        Robert Muir added a comment -

        well Erick i think the ones you listed here are ok.

        There are cases where they won't work correctly, but trying to do
        multitermqueries with mappingcharfilter and asciifolding
        filter are already problematic (eg ? won't match œ because
        its now 'oe').

        Personally i think this is fine, but we should document
        that things don't work correctly all the time, and we
        should not make changes to analysis components to try
        to make them cope with multiterm queries syntax or
        anything (this would be bad design, it turns them into
        queryparsers).

        If the user cares about the corner cases, then they just
        specify the chain.

        Show
        Robert Muir added a comment - well Erick i think the ones you listed here are ok. There are cases where they won't work correctly, but trying to do multitermqueries with mappingcharfilter and asciifolding filter are already problematic (eg ? won't match œ because its now 'oe'). Personally i think this is fine, but we should document that things don't work correctly all the time, and we should not make changes to analysis components to try to make them cope with multiterm queries syntax or anything (this would be bad design, it turns them into queryparsers). If the user cares about the corner cases, then they just specify the chain.
        Hide
        Mike Sokolov added a comment -

        I spoke hastily, and it's true that stemmers are different from those other multi-token things. It would be kind of nice if it were possible to have a query for "do?s" actually match the a document containing "dogs", even when matching against a stemmed field, but I don't see how to do it without breaking all kinds of other things. Consider how messed up range queries would get. [dogs TO *] would match doge, doggone, and other words in [dog TO dogs] which would be totally counterintuitive.

        Show
        Mike Sokolov added a comment - I spoke hastily, and it's true that stemmers are different from those other multi-token things. It would be kind of nice if it were possible to have a query for "do?s" actually match the a document containing "dogs", even when matching against a stemmed field, but I don't see how to do it without breaking all kinds of other things. Consider how messed up range queries would get. [dogs TO *] would match doge, doggone, and other words in [dog TO dogs] which would be totally counterintuitive.
        Hide
        Robert Muir added a comment -

        I guess it begs the question of what use adding stemmers to the mix would be.

        I agree with Mike.

        Most stemmers are basically suffix-strippers and use heuristics like term length. They are not going to work with the syntax of various multitermqueries. no stemmer is going to stem dogs* to dog*. some might remove any non-alpha characters completely, and its not a bug that they do this. they are heuristical in nature and designed to work on natural language text... not syntax.

        Show
        Robert Muir added a comment - I guess it begs the question of what use adding stemmers to the mix would be. I agree with Mike. Most stemmers are basically suffix-strippers and use heuristics like term length. They are not going to work with the syntax of various multitermqueries. no stemmer is going to stem dogs* to dog*. some might remove any non-alpha characters completely, and its not a bug that they do this. they are heuristical in nature and designed to work on natural language text... not syntax.
        Hide
        Erick Erickson added a comment -

        Not synonyms - agreed.
        Not shinglers - agreed.
        Not anything that might produce multiple tokens - agreed.

        Stemmers... When do stemmers produce multiple tokens? My ignorance of all the possibilities knows no bounds. I was wondering if, in this case, stemmers really reduced to prefix queries. Maybe it's just a bad idea altogether, I guess it begs the question of what use adding stemmers to the mix would be. You want to match the root, just specify the root with an asterisk and be done with it. No need to introduce stemming into the MultiTermAwareComponent mix.

        But this kind of question is exactly why I have this JIRA in place, we can collect reasons I wouldn't think of and record them. Before I mess something up with well-intentioned-but-wrong "help".

        Show
        Erick Erickson added a comment - Not synonyms - agreed. Not shinglers - agreed. Not anything that might produce multiple tokens - agreed. Stemmers... When do stemmers produce multiple tokens? My ignorance of all the possibilities knows no bounds. I was wondering if, in this case, stemmers really reduced to prefix queries. Maybe it's just a bad idea altogether, I guess it begs the question of what use adding stemmers to the mix would be. You want to match the root, just specify the root with an asterisk and be done with it. No need to introduce stemming into the MultiTermAwareComponent mix. But this kind of question is exactly why I have this JIRA in place, we can collect reasons I wouldn't think of and record them. Before I mess something up with well-intentioned-but-wrong "help".
        Hide
        Mike Sokolov added a comment -

        No not stemmers. Not synonyms, not shinglers or anything that might produce multiple tokens.

        Show
        Mike Sokolov added a comment - No not stemmers. Not synonyms, not shinglers or anything that might produce multiple tokens.
        Hide
        Erick Erickson added a comment -

        Hmmm, what about stemmers? Prefix only?

        Show
        Erick Erickson added a comment - Hmmm, what about stemmers? Prefix only?

          People

          • Assignee:
            Erick Erickson
            Reporter:
            Erick Erickson
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development