[LUCENE-3358] StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 3.3
Fix Version/s: 3.4, 4.0-ALPHA
Component/s: None
Labels:
None

Lucene Fields:

New

Description

Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.

Here's a unit test:

    @Test
    public void testHiraganaWithCombiningMarkDakuten() throws Exception
    {
        // Hiragana 'S' following by the combining mark dakuten
        TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));

        // Should be kept together.
        List<String> expectedTokens = Arrays.asList("\u3055\u3099");
        List<String> actualTokens = new LinkedList<String>();
        CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
        while (stream.incrementToken())
        {
            actualTokens.add(term.toString());
        }

        assertEquals("Wrong tokens", expectedTokens, actualTokens);

    }

This code fails with:

java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>

It seems as if the tokeniser is throwing away the combining mark entirely.

3.0's behaviour was also undesirable:

java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>

But at least the token was there, so it was possible to write a filter to work around the issue.

Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-3358.patch
04/Aug/11 19:43
82 kB
Robert Muir
LUCENE-3358.patch
03/Aug/11 12:57
2 kB
Robert Muir

Issue Links

is related to

LUCENE-3361 port url+email tokenizer to standardtokenizerinterface (or similar)

Closed

Activity

People

Assignee:: Robert Muir

Reporter:: Trejkaz

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 03/Aug/11 05:08

Updated:: 28/Aug/22 12:54

Resolved:: 04/Aug/11 21:07