Lucene - Core
  1. Lucene - Core
  2. LUCENE-3358

StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.3
    • Fix Version/s: 3.4, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.

      Here's a unit test:

          @Test
          public void testHiraganaWithCombiningMarkDakuten() throws Exception
          {
              // Hiragana 'S' following by the combining mark dakuten
              TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));
      
              // Should be kept together.
              List<String> expectedTokens = Arrays.asList("\u3055\u3099");
              List<String> actualTokens = new LinkedList<String>();
              CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
              while (stream.incrementToken())
              {
                  actualTokens.add(term.toString());
              }
      
              assertEquals("Wrong tokens", expectedTokens, actualTokens);
      
          }
      

      This code fails with:

      java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
      

      It seems as if the tokeniser is throwing away the combining mark entirely.

      3.0's behaviour was also undesirable:

      java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
      

      But at least the token was there, so it was possible to write a filter to work around the issue.

      Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)

      1. LUCENE-3358.patch
        82 kB
        Robert Muir
      2. LUCENE-3358.patch
        2 kB
        Robert Muir

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Robert Muir
              Reporter:
              Trejkaz
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development