Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.
Here's a unit test:
public void testHiraganaWithCombiningMarkDakuten() throws Exception
TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));
List<String> expectedTokens = Arrays.asList("\u3055\u3099");
List<String> actualTokens = new LinkedList<String>();
CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
assertEquals("Wrong tokens", expectedTokens, actualTokens);
This code fails with:
java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
It seems as if the tokeniser is throwing away the combining mark entirely.
3.0's behaviour was also undesirable:
java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
But at least the token was there, so it was possible to write a filter to work around the issue.
Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)