Affects Version/s: 4.0
Fix Version/s: None
Recent versions of ICU have their own implementation for the tokenization of the Khmer script. Lucene should not be overriding ICU's behavior any more.
I haven't tried the patch out, but the patch should look something like the following:
$ diff DefaultICUTokenizerConfig.java.orig DefaultICUTokenizerConfig.java
< private static final BreakIterator thaiBreakIterator =
< BreakIterator.getWordInstance(new ULocale("th_TH"));
< private static final BreakIterator khmerBreakIterator =
< case UScript.THAI: return (BreakIterator)thaiBreakIterator.clone();
< case UScript.KHMER: return (BreakIterator)khmerBreakIterator.clone();
and the Khmer.* files should be removed. ICU already does script specific tokenization these days. So the Thai one should not be needed either since ICU 50.