Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.1
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      While there are separate Case Folding, Normalization, and Ignorable-removal filters in LUCENE-1488,
      the new ICU Normalizer2 API does this all at once with nfkc_cf (based on the new NFKC_Casefold property in Unicode).

      This is great, because it provides a ton of unicode functionality that is really needed.
      And the new Normalizer2 API takes CharSequence and writes to Appendable...

        Issue Links

          Activity

          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1
          Hide
          Robert Muir added a comment -

          backported to 3.x, rev 941689

          Show
          Robert Muir added a comment - backported to 3.x, rev 941689
          Hide
          Robert Muir added a comment -

          Committed revision 935186.

          Will later discuss how we should expose this stuff in solr (maybe an ICU contrib for now?) as I want its faster Collation filter exposed there too

          Show
          Robert Muir added a comment - Committed revision 935186. Will later discuss how we should expose this stuff in solr (maybe an ICU contrib for now?) as I want its faster Collation filter exposed there too
          Hide
          Robert Muir added a comment -

          I made the filter non-final, and only incrementToken final instead.

          This way we can implement things like LUCENE-1343, which want to do things like removing accents in a way that respects normalization (e.g. removes decomposed and composed forms).

          So we can just extend this, and pass along a statically loaded InputStream for the .nrm file to its ctor and be done with it.

          Show
          Robert Muir added a comment - I made the filter non-final, and only incrementToken final instead. This way we can implement things like LUCENE-1343 , which want to do things like removing accents in a way that respects normalization (e.g. removes decomposed and composed forms). So we can just extend this, and pass along a statically loaded InputStream for the .nrm file to its ctor and be done with it.
          Hide
          Robert Muir added a comment -

          I added some additional javadocs to try to explain the default behavior (nfkc_cf)

          Show
          Robert Muir added a comment - I added some additional javadocs to try to explain the default behavior (nfkc_cf)
          Hide
          Robert Muir added a comment -

          I know, you were running the test without assertion from Eclipse!

          Yes! So the assertion here worked perfectly!

          So great and I am happy about the cool interfaces at CharTermAttribute.

          Yes, I'm really excited about this.
          Besides just normalization, we get the ability to do "best practice" case folding (see the German and Greek examples in the test), normalization, and ignorable removal all in one simple filter, and, users can make their own .txt files for special mappings, run them through a tool, and use this filter with high performance:

          http://site.icu-project.org/design/normalization/custom
          http://userguide.icu-project.org/transforms/normalization

          Show
          Robert Muir added a comment - I know, you were running the test without assertion from Eclipse! Yes! So the assertion here worked perfectly! So great and I am happy about the cool interfaces at CharTermAttribute. Yes, I'm really excited about this. Besides just normalization, we get the ability to do "best practice" case folding (see the German and Greek examples in the test), normalization, and ignorable removal all in one simple filter, and, users can make their own .txt files for special mappings, run them through a tool, and use this filter with high performance: http://site.icu-project.org/design/normalization/custom http://userguide.icu-project.org/transforms/normalization
          Hide
          Uwe Schindler added a comment -

          I know, you were running the test without assertion from Eclipse!

          [junit] TokenStream implementation classes or at least their incrementToken() implementation must be final
          [junit] junit.framework.AssertionFailedError: TokenStream implementation classes or at least their incrementToken() implementation must be final
          [junit]     at org.apache.lucene.analysis.TokenStream.assertFinal(TokenStream.java:117)
          

          So for me the assertion worked. The second patch of course works with icu-4_4.jar! So great and I am happy about the cool interfaces at CharTermAttribute.

          I just wanted to check that the my deputy sheriff did not miss something because of wrong instructions.

          Show
          Uwe Schindler added a comment - I know, you were running the test without assertion from Eclipse! [junit] TokenStream implementation classes or at least their incrementToken() implementation must be final [junit] junit.framework.AssertionFailedError: TokenStream implementation classes or at least their incrementToken() implementation must be final [junit] at org.apache.lucene.analysis.TokenStream.assertFinal(TokenStream.java:117) So for me the assertion worked. The second patch of course works with icu-4_4.jar! So great and I am happy about the cool interfaces at CharTermAttribute. I just wanted to check that the my deputy sheriff did not miss something because of wrong instructions.
          Hide
          Uwe Schindler added a comment -

          Hurra! You used the StringBuilder as buffer to not create a new String instance each time and only need to copy the buffer. This could also be a good trick for the PatternReplaceFilter from Solr.

          i made this filter final, to avoid a ticket from the policeman.

          How did you get the filter through the assert statement without final? Strange...

          Show
          Uwe Schindler added a comment - Hurra! You used the StringBuilder as buffer to not create a new String instance each time and only need to copy the buffer. This could also be a good trick for the PatternReplaceFilter from Solr. i made this filter final, to avoid a ticket from the policeman. How did you get the filter through the assert statement without final? Strange...
          Hide
          Robert Muir added a comment -

          i made this filter final, to avoid a ticket from the policeman.

          Show
          Robert Muir added a comment - i made this filter final, to avoid a ticket from the policeman.
          Hide
          Robert Muir added a comment -

          This patch is so simple, instead of 3 hairy tokenfilters.

          I would like to commit tomorrow (upgrading our icu.jar in contrib/icu to 4.4), unless there are any objections.

          Show
          Robert Muir added a comment - This patch is so simple, instead of 3 hairy tokenfilters. I would like to commit tomorrow (upgrading our icu.jar in contrib/icu to 4.4), unless there are any objections.

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development