Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-3358

StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.3
    • Fix Version/s: 3.4, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.

      Here's a unit test:

          @Test
          public void testHiraganaWithCombiningMarkDakuten() throws Exception
          {
              // Hiragana 'S' following by the combining mark dakuten
              TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));
      
              // Should be kept together.
              List<String> expectedTokens = Arrays.asList("\u3055\u3099");
              List<String> actualTokens = new LinkedList<String>();
              CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
              while (stream.incrementToken())
              {
                  actualTokens.add(term.toString());
              }
      
              assertEquals("Wrong tokens", expectedTokens, actualTokens);
      
          }
      

      This code fails with:

      java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
      

      It seems as if the tokeniser is throwing away the combining mark entirely.

      3.0's behaviour was also undesirable:

      java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
      

      But at least the token was there, so it was possible to write a filter to work around the issue.

      Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)

        Attachments

        1. LUCENE-3358.patch
          2 kB
          Robert Muir
        2. LUCENE-3358.patch
          82 kB
          Robert Muir

          Issue Links

            Activity

              People

              • Assignee:
                rcmuir Robert Muir
                Reporter:
                trejkaz Trejkaz
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: