Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-3358

StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 3.3
    • 3.4, 4.0-ALPHA
    • None
    • None
    • New

    Description

      Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.

      Here's a unit test:

          @Test
          public void testHiraganaWithCombiningMarkDakuten() throws Exception
          {
              // Hiragana 'S' following by the combining mark dakuten
              TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));
      
              // Should be kept together.
              List<String> expectedTokens = Arrays.asList("\u3055\u3099");
              List<String> actualTokens = new LinkedList<String>();
              CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
              while (stream.incrementToken())
              {
                  actualTokens.add(term.toString());
              }
      
              assertEquals("Wrong tokens", expectedTokens, actualTokens);
      
          }
      

      This code fails with:

      java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
      

      It seems as if the tokeniser is throwing away the combining mark entirely.

      3.0's behaviour was also undesirable:

      java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
      

      But at least the token was there, so it was possible to write a filter to work around the issue.

      Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            rcmuir Robert Muir
            trejkaz Trejkaz
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment