Lucene - Core
  1. Lucene - Core
  2. LUCENE-4072

CharFilter that Unicode-normalizes input

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.8, 5.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I'd like to contribute a CharFilter that Unicode-normalizes input with ICU4J.

      The benefit of having this process as CharFilter is that tokenizer can work on normalised text while offset-correction ensuring fast vector highlighter and other offset-dependent features do not break.

      The implementation is available at following repository:
      https://github.com/ippeiukai/ICUNormalizer2CharFilter

      Unfortunately this is my unpaid side-project and cannot spend much time to merge my work to Lucene to make appropriate patch. I'd appreciate it if anyone could give it a go. I'm happy to relicense it to whatever that meets your needs.

      1. LUCENE-4072.patch
        17 kB
        Robert Muir
      2. 4072.patch
        16 kB
        David Goldfarb
      3. 4072.patch
        16 kB
        David Goldfarb
      4. LUCENE-4072.patch
        16 kB
        Robert Muir
      5. LUCENE-4072.patch
        14 kB
        David Goldfarb
      6. LUCENE-4072.patch
        14 kB
        David Goldfarb
      7. LUCENE-4072.patch
        14 kB
        Robert Muir
      8. LUCENE-4072.patch
        11 kB
        David Goldfarb
      9. DebugCode.txt
        5 kB
        Ippei UKAI
      10. LUCENE-4072.patch
        16 kB
        Robert Muir
      11. ippeiukai-ICUNormalizer2CharFilter-4752cad.zip
        15 kB
        Ippei UKAI

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Unassigned
            Reporter:
            Ippei UKAI
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development