Lucene - Core
  1. Lucene - Core
  2. LUCENE-2124

move JDK collation to core, ICU collation to ICU contrib

    Details

    • Lucene Fields:
      New

      Description

      As mentioned on the list, I propose we move the JDK-based CollationKeyFilter/CollationKeyAnalyzer, currently located in contrib/collation into core for collation support (language-sensitive sorting)

      These are not much code (the heavy duty stuff is already in core, IndexableBinaryString).

      And I would also like to move the ICUCollationKeyFilter/ICUCollationKeyAnalyzer (along with the jar file they depend on) also currently located in contrib/collation into a contrib/icu.

      This way, we can start looking at integrating other functionality from ICU into a fully-fleshed out icu contrib.

      1. LUCENE-2124.patch
        19 kB
        Robert Muir
      2. LUCENE-2124.patch
        16 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          attached is a patch to apply after running the following commands (so you can see the real changes):

          mkdir src/java/org/apache/lucene/collation
          svn add src/java/org/apache/lucene/collation
          mkdir src/test/org/apache/lucene/collation
          svn add src/test/org/apache/lucene/collation
          svn move contrib/collation/src/java/org/apache/lucene/collation/CollationKeyFilter.java src/java/org/apache/lucene/collation
          svn move contrib/collation/src/java/org/apache/lucene/collation/CollationKeyAnalyzer.java src/java/org/apache/lucene/collation
          svn move contrib/collation/src/test/org/apache/lucene/collation/CollationTestBase.java src/test/org/apache/lucene/collation
          svn move contrib/collation/src/test/org/apache/lucene/collation/TestCollationKeyFilter.java src/test/org/apache/lucene/collation
          svn move contrib/collation/src/test/org/apache/lucene/collation/TestCollationKeyAnalyzer.java src/test/org/apache/lucene/collation
          svn copy contrib/collation/src/java/org/apache/lucene/collation/package.html src/java/org/apache/lucene/collation
          mkdir -p contrib/icu/src/java/org/apache/lucene/collation contrib/icu/src/test/org/apache/lucene/collation contrib/icu/lib
          svn add contrib/icu
          svn move contrib/collation/src/java/org/apache/lucene/collation/ICUCollationKeyAnalyzer.java contrib/icu/src/java/org/apache/lucene/collation
          svn move contrib/collation/src/java/org/apache/lucene/collation/ICUCollationKeyFilter.java contrib/icu/src/java/org/apache/lucene/collation
          svn move contrib/collation/src/java/org/apache/lucene/collation/package.html contrib/icu/src/java/org/apache/lucene/collation
          svn move contrib/collation/src/test/org/apache/lucene/collation/TestICUCollationKeyAnalyzer.java contrib/icu/src/test/org/apache/lucene/collation
          svn move contrib/collation/src/test/org/apache/lucene/collation/TestICUCollationKeyFilter.java contrib/icu/src/test/org/apache/lucene/collation
          svn move contrib/collation/build.xml contrib/collation/pom.xml.template contrib/icu
          svn move contrib/collation/src/java/overview.html contrib/icu/src/java
          svn move contrib/collation/lib/icu4j-collation-4.0.jar contrib/icu/lib
          svn move contrib/collation/lib/ICU-LICENSE.txt contrib/icu/lib
          svn delete contrib/collation
          

          The only real changes I made were slight javadocs/build, and removal of the testFarsiRangeQueryParsing, it is tested via several other mechanisms, introduced a dependency to contrib/misc, and I don't feel so bad about taking it out since its in the examples in the javadoc, so it not like it removes the example.

          Show
          Robert Muir added a comment - attached is a patch to apply after running the following commands (so you can see the real changes): mkdir src/java/org/apache/lucene/collation svn add src/java/org/apache/lucene/collation mkdir src/test/org/apache/lucene/collation svn add src/test/org/apache/lucene/collation svn move contrib/collation/src/java/org/apache/lucene/collation/CollationKeyFilter.java src/java/org/apache/lucene/collation svn move contrib/collation/src/java/org/apache/lucene/collation/CollationKeyAnalyzer.java src/java/org/apache/lucene/collation svn move contrib/collation/src/test/org/apache/lucene/collation/CollationTestBase.java src/test/org/apache/lucene/collation svn move contrib/collation/src/test/org/apache/lucene/collation/TestCollationKeyFilter.java src/test/org/apache/lucene/collation svn move contrib/collation/src/test/org/apache/lucene/collation/TestCollationKeyAnalyzer.java src/test/org/apache/lucene/collation svn copy contrib/collation/src/java/org/apache/lucene/collation/package.html src/java/org/apache/lucene/collation mkdir -p contrib/icu/src/java/org/apache/lucene/collation contrib/icu/src/test/org/apache/lucene/collation contrib/icu/lib svn add contrib/icu svn move contrib/collation/src/java/org/apache/lucene/collation/ICUCollationKeyAnalyzer.java contrib/icu/src/java/org/apache/lucene/collation svn move contrib/collation/src/java/org/apache/lucene/collation/ICUCollationKeyFilter.java contrib/icu/src/java/org/apache/lucene/collation svn move contrib/collation/src/java/org/apache/lucene/collation/package.html contrib/icu/src/java/org/apache/lucene/collation svn move contrib/collation/src/test/org/apache/lucene/collation/TestICUCollationKeyAnalyzer.java contrib/icu/src/test/org/apache/lucene/collation svn move contrib/collation/src/test/org/apache/lucene/collation/TestICUCollationKeyFilter.java contrib/icu/src/test/org/apache/lucene/collation svn move contrib/collation/build.xml contrib/collation/pom.xml.template contrib/icu svn move contrib/collation/src/java/overview.html contrib/icu/src/java svn move contrib/collation/lib/icu4j-collation-4.0.jar contrib/icu/lib svn move contrib/collation/lib/ICU-LICENSE.txt contrib/icu/lib svn delete contrib/collation The only real changes I made were slight javadocs/build, and removal of the testFarsiRangeQueryParsing, it is tested via several other mechanisms, introduced a dependency to contrib/misc, and I don't feel so bad about taking it out since its in the examples in the javadoc, so it not like it removes the example.
          Hide
          Robert Muir added a comment -

          here is a patch (same instructions as before), but with the source changes to the website. I don't include the generated website changes to reduce confusion.

          all the tests pass, if there are no objections I will commit in a few days.

          Show
          Robert Muir added a comment - here is a patch (same instructions as before), but with the source changes to the website. I don't include the generated website changes to reduce confusion. all the tests pass, if there are no objections I will commit in a few days.
          Hide
          Robert Muir added a comment -

          just a reminder, tomorrow i would like to commit this if no one objects.

          this will move the contrib/collation JDK-based components to core, and later we should consider deprecating the alternatives that are not scalable.

          this will move the contrib/collation ICU based components to contrib/iCU, and this is where I want to bring the unicode 5.2 support.

          Thanks!

          Show
          Robert Muir added a comment - just a reminder, tomorrow i would like to commit this if no one objects. this will move the contrib/collation JDK-based components to core, and later we should consider deprecating the alternatives that are not scalable. this will move the contrib/collation ICU based components to contrib/iCU, and this is where I want to bring the unicode 5.2 support. Thanks!
          Hide
          Steve Rowe added a comment -

          this will move the contrib/collation JDK-based components to core

          +1

          and later we should consider deprecating the alternatives that are not scalable.

          The alternatives don't scale well, true, but they don't result in non-human-readable index terms, either, so for people that need human-readable index terms and who have a low-cardinality term set, maybe we should leave the alternatives in place?

          this will move the contrib/collation ICU based components to contrib/iCU, and this is where I want to bring the unicode 5.2 support.

          +1

          Show
          Steve Rowe added a comment - this will move the contrib/collation JDK-based components to core +1 and later we should consider deprecating the alternatives that are not scalable. The alternatives don't scale well, true, but they don't result in non-human-readable index terms, either, so for people that need human-readable index terms and who have a low-cardinality term set, maybe we should leave the alternatives in place? this will move the contrib/collation ICU based components to contrib/iCU, and this is where I want to bring the unicode 5.2 support. +1
          Hide
          Robert Muir added a comment -

          The alternatives don't scale well, true, but they don't result in non-human-readable index terms, either, so for people that need human-readable index terms and who have a low-cardinality term set, maybe we should leave the alternatives in place?

          yeah this is why i thought we can discuss the non-scalable alternatives separately. maybe we leave them alone, but for now i just want to make progress in contrib on a unicode 5.2 support. we can raise the issue later, I agree the non-scalable alternatives are more user-friendly too, because they work with the core queryparser for TermRangeQuery, etc.

          If we really want to deprecate these non-scalable alternatives in the future, we could consider making further improvements towards collation being a "first class citizen". Similar maybe to what happened with NumericRange. Just not sure how this would work yet...

          Show
          Robert Muir added a comment - The alternatives don't scale well, true, but they don't result in non-human-readable index terms, either, so for people that need human-readable index terms and who have a low-cardinality term set, maybe we should leave the alternatives in place? yeah this is why i thought we can discuss the non-scalable alternatives separately. maybe we leave them alone, but for now i just want to make progress in contrib on a unicode 5.2 support. we can raise the issue later, I agree the non-scalable alternatives are more user-friendly too, because they work with the core queryparser for TermRangeQuery, etc. If we really want to deprecate these non-scalable alternatives in the future, we could consider making further improvements towards collation being a "first class citizen". Similar maybe to what happened with NumericRange. Just not sure how this would work yet...
          Hide
          Simon Willnauer added a comment -

          Robert patch looks good to me!
          Go for it!

          Show
          Simon Willnauer added a comment - Robert patch looks good to me! Go for it!
          Hide
          Robert Muir added a comment -

          Committed revision 888780.

          I will keep this open until i regen the website and commit the changes.

          Show
          Robert Muir added a comment - Committed revision 888780. I will keep this open until i regen the website and commit the changes.
          Hide
          Robert Muir added a comment -

          website updated in revision 888803

          Show
          Robert Muir added a comment - website updated in revision 888803
          Hide
          Steve Rowe added a comment -

          Robert, I noticed something you missed in the move - here's a trivial patch:

          Index: contrib/icu/src/java/overview.html
          ===================================================================
          --- contrib/icu/src/java/overview.html  (revision 892657)
          +++ contrib/icu/src/java/overview.html  (working copy)
          @@ -34,7 +34,7 @@
             <code>CollationKey</code>s.  <code>icu4j-collation-4.0.jar</code>, 
             a trimmed-down version of <code>icu4j-4.0.jar</code> that contains only the 
             code and data needed to support collation, is included in Lucene's Subversion 
          -  repository at <code>contrib/collation/lib/</code>.
          +  repository at <code>contrib/icu/lib/</code>.
           </p>
           
           <h2>Use Cases</h2>
          
          Show
          Steve Rowe added a comment - Robert, I noticed something you missed in the move - here's a trivial patch: Index: contrib/icu/src/java/overview.html =================================================================== --- contrib/icu/src/java/overview.html (revision 892657) +++ contrib/icu/src/java/overview.html (working copy) @@ -34,7 +34,7 @@ <code>CollationKey</code>s. <code>icu4j-collation-4.0.jar</code>, a trimmed-down version of <code>icu4j-4.0.jar</code> that contains only the code and data needed to support collation, is included in Lucene's Subversion - repository at <code>contrib/collation/lib/</code>. + repository at <code>contrib/icu/lib/</code>. </p> <h2>Use Cases</h2>
          Hide
          Robert Muir added a comment -

          Robert, I noticed something you missed in the move - here's a trivial patch:

          thanks for catching this Steven!

          Show
          Robert Muir added a comment - Robert, I noticed something you missed in the move - here's a trivial patch: thanks for catching this Steven!

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development