Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.
Here's a unit test:
This code fails with:
It seems as if the tokeniser is throwing away the combining mark entirely.
3.0's behaviour was also undesirable:
But at least the token was there, so it was possible to write a filter to work around the issue.
Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)