Lucene - Core
  1. Lucene - Core
  2. LUCENE-3358

StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.3
    • Fix Version/s: 3.4, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.

      Here's a unit test:

          @Test
          public void testHiraganaWithCombiningMarkDakuten() throws Exception
          {
              // Hiragana 'S' following by the combining mark dakuten
              TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));
      
              // Should be kept together.
              List<String> expectedTokens = Arrays.asList("\u3055\u3099");
              List<String> actualTokens = new LinkedList<String>();
              CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
              while (stream.incrementToken())
              {
                  actualTokens.add(term.toString());
              }
      
              assertEquals("Wrong tokens", expectedTokens, actualTokens);
      
          }
      

      This code fails with:

      java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
      

      It seems as if the tokeniser is throwing away the combining mark entirely.

      3.0's behaviour was also undesirable:

      java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
      

      But at least the token was there, so it was possible to write a filter to work around the issue.

      Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)

      1. LUCENE-3358.patch
        2 kB
        Robert Muir
      2. LUCENE-3358.patch
        82 kB
        Robert Muir

        Issue Links

          Activity

          Uwe Schindler made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Robert Muir made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Assignee Robert Muir [ rcmuir ]
          Fix Version/s 4.0 [ 12314025 ]
          Resolution Fixed [ 1 ]
          Robert Muir made changes -
          Link This issue is related to LUCENE-3361 [ LUCENE-3361 ]
          Robert Muir made changes -
          Attachment LUCENE-3358.patch [ 12489386 ]
          Robert Muir made changes -
          Fix Version/s 3.4 [ 12316675 ]
          Robert Muir made changes -
          Field Original Value New Value
          Attachment LUCENE-3358.patch [ 12489187 ]
          Trejkaz created issue -

            People

            • Assignee:
              Robert Muir
              Reporter:
              Trejkaz
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development