Description
Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.
Here's a unit test:
@Test public void testHiraganaWithCombiningMarkDakuten() throws Exception { // Hiragana 'S' following by the combining mark dakuten TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099")); // Should be kept together. List<String> expectedTokens = Arrays.asList("\u3055\u3099"); List<String> actualTokens = new LinkedList<String>(); CharTermAttribute term = stream.addAttribute(CharTermAttribute.class); while (stream.incrementToken()) { actualTokens.add(term.toString()); } assertEquals("Wrong tokens", expectedTokens, actualTokens); }
This code fails with:
java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
It seems as if the tokeniser is throwing away the combining mark entirely.
3.0's behaviour was also undesirable:
java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
But at least the token was there, so it was possible to write a filter to work around the issue.
Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)
Attachments
Attachments
Issue Links
- is related to
-
LUCENE-3361 port url+email tokenizer to standardtokenizerinterface (or similar)
- Closed