Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-2921

Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-ALPHA
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: Schema and Analysis
    • Labels:
      None
    • Environment:

      All

      Description

      SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent:

      • ASCIIFoldingFilterFactory
      • LowerCaseFilterFactory
      • LowerCaseTokenizerFactory
      • MappingCharFilterFactory
      • PersianCharFilterFactory

      When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die!

      But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that might be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers.

      Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a Filter, which is the right thing in this case.

      Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know.

      ArabicNormalizationFilterFactory
      GreekLowerCaseFilterFactory
      HindiNormalizationFilterFactory
      ICUFoldingFilterFactory
      ICUNormalizer2FilterFactory
      ICUTransformFilterFactory
      IndicNormalizationFilterFactory
      ISOLatin1AccentFilterFactory
      PersianNormalizationFilterFactory
      RussianLowerCaseFilterFactory
      TurkishLowerCaseFilterFactory

        Attachments

        1. SOLR-2921_rest.patch
          14 kB
          Robert Muir
        2. SOLR-2921-3x.patch
          16 kB
          Erick Erickson
        3. SOLR-2921-3x.patch
          17 kB
          Erick Erickson
        4. SOLR-2921-3x.patch
          16 kB
          Erick Erickson
        5. SOLR-2921-trunk.patch
          14 kB
          Erick Erickson

          Activity

            People

            • Assignee:
              erickerickson Erick Erickson
              Reporter:
              erickerickson Erick Erickson
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: