Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.
Here's a unit test:
This code fails with:
It seems as if the tokeniser is throwing away the combining mark entirely.
3.0's behaviour was also undesirable:
But at least the token was there, so it was possible to write a filter to work around the issue.
Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)
|Fix Version/s||3.4 [ 12316675 ]|
|Status||Open [ 1 ]||Resolved [ 5 ]|
|Assignee||Robert Muir [ rcmuir ]|
|Fix Version/s||4.0 [ 12314025 ]|
|Resolution||Fixed [ 1 ]|
|Status||Resolved [ 5 ]||Closed [ 6 ]|
|Transition||Time In Source Status||Execution Times||Last Executer||Last Execution Date|
|1d 15h 58m||1||Robert Muir||04/Aug/11 22:07|
|114d 15h 24m||1||Uwe Schindler||27/Nov/11 12:31|