Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5, 3.1, 4.0-ALPHA
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      This patch adds support for unicode collation (searching and sorting).
      Unicode collation is helpful in a search engine, for many languages you want things to match or sort differently.
      You might even want to use copyfield and support different sort orders/matching schemes if you need to support multiple languages.

      This is simply a factory for lucene's CollationKeyFilter, which indexes binary collation keys in a special format that preserves binary sort order.

      I've added support for creating a Collator in two ways:

      • system collator from a Locale spec (language + country + variant)
      • tailored collator from custom rules in a text file

      in no way is there an option to use the "default" locale of the jvm, (I consider this a bit dangerous)
      in this patch, it is mandatory to define the locale explicitly for a system collator.

      The required lucene-collation-2.9.1.jar is only 12KB.

      1. SOLR-1571.patch
        12 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1.0 release

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1.0 release
          Hide
          Hoss Man added a comment -

          Correcting Fix Version based on CHANGES.txt, see this thread for more details...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Show
          Hoss Man added a comment - Correcting Fix Version based on CHANGES.txt, see this thread for more details... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E
          Hide
          Shalin Shekhar Mangar added a comment -

          Committed revision 885338.

          Thanks Robert!

          Show
          Shalin Shekhar Mangar added a comment - Committed revision 885338. Thanks Robert!
          Hide
          Shalin Shekhar Mangar added a comment -

          Shalin, yes I think the ICUCollationFilter is much better (faster and smaller index, more languages), but should be a separate factory imo.
          I figured I would start with the JDK impl. since there is no external dependency, its the simplest.

          Sure, sounds good. I'll commit this soon.

          Show
          Shalin Shekhar Mangar added a comment - Shalin, yes I think the ICUCollationFilter is much better (faster and smaller index, more languages), but should be a separate factory imo. I figured I would start with the JDK impl. since there is no external dependency, its the simplest. Sure, sounds good. I'll commit this soon.
          Hide
          Robert Muir added a comment -

          Shalin, yes I think the ICUCollationFilter is much better (faster and smaller index, more languages), but should be a separate factory imo.
          I figured I would start with the JDK impl. since there is no external dependency, its the simplest.

          The icu impl has slightly different options and behavior, and doing something fancy like detecting which impl to use with reflection I don't much like either... if the ICU jar file was no longer in the classpath or its version changed, things could suddenly silently stop working correctly.

          Show
          Robert Muir added a comment - Shalin, yes I think the ICUCollationFilter is much better (faster and smaller index, more languages), but should be a separate factory imo. I figured I would start with the JDK impl. since there is no external dependency, its the simplest. The icu impl has slightly different options and behavior, and doing something fancy like detecting which impl to use with reflection I don't much like either... if the ICU jar file was no longer in the classpath or its version changed, things could suddenly silently stop working correctly.
          Hide
          Shalin Shekhar Mangar added a comment -

          I tried the patch. All tests pass.

          You know more about this topic than I do so if you feel ICUCollationFilter should be a separate issue, that is fine with me. As far as this patch is concerned, it is well baked and I'd be happy to commit it.

          Show
          Shalin Shekhar Mangar added a comment - I tried the patch. All tests pass. You know more about this topic than I do so if you feel ICUCollationFilter should be a separate issue, that is fine with me. As far as this patch is concerned, it is well baked and I'd be happy to commit it.
          Hide
          Robert Muir added a comment - - edited

          Hi, i wonder if anyone has any comments on this.

          I know this is an invisible/covert JIRA issue right now

          especially I am curious if the approach is sound, particularly regarding using the ICUCollationFilter instead.
          In my opinion, this should be a separate integration, even though it will index at a significantly faster speed with much smaller keys.
          The reason is that it is not compat with the JDK collation keys, and has different properties, such as the fact Collator is thread-safe in the JDK, but not thread-safe in ICU.
          Because of this, I decided to stick with the JDK impl initially.

          Show
          Robert Muir added a comment - - edited Hi, i wonder if anyone has any comments on this. I know this is an invisible/covert JIRA issue right now especially I am curious if the approach is sound, particularly regarding using the ICUCollationFilter instead. In my opinion, this should be a separate integration, even though it will index at a significantly faster speed with much smaller keys. The reason is that it is not compat with the JDK collation keys, and has different properties, such as the fact Collator is thread-safe in the JDK, but not thread-safe in ICU. Because of this, I decided to stick with the JDK impl initially.
          Hide
          Robert Muir added a comment -

          initial patch.

          Show
          Robert Muir added a comment - initial patch.

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Robert Muir
            • Votes:
              4 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development