Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5110

DefaultICUTokenizerConfig should use the default ICU behavior for the Khmer script

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0
    • Fix Version/s: None
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Recent versions of ICU have their own implementation for the tokenization of the Khmer script. Lucene should not be overriding ICU's behavior any more.

      I haven't tried the patch out, but the patch should look something like the following:

      $ diff DefaultICUTokenizerConfig.java.orig DefaultICUTokenizerConfig.java
      67,68d66
      < private static final BreakIterator thaiBreakIterator =
      < BreakIterator.getWordInstance(new ULocale("th_TH"));
      71,72d68
      < private static final BreakIterator khmerBreakIterator =
      < readBreakIterator("Khmer.brk");
      87d82
      < case UScript.THAI: return (BreakIterator)thaiBreakIterator.clone();
      89d83
      < case UScript.KHMER: return (BreakIterator)khmerBreakIterator.clone();

      and the Khmer.* files should be removed. ICU already does script specific tokenization these days. So the Thai one should not be needed either since ICU 50.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              grhoten George Rhoten
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: