Lucene - Core
  1. Lucene - Core
  2. LUCENE-3358

StandardTokenizer disposes of Hiragana combining mark dakuten instead of attaching it to the character it belongs to

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.3
    • Fix Version/s: 3.4, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Lucene 3.3 (possibly 3.1 onwards) exhibits less than great behaviour for tokenising hiragana, if combining marks are in use.

      Here's a unit test:

          @Test
          public void testHiraganaWithCombiningMarkDakuten() throws Exception
          {
              // Hiragana 'S' following by the combining mark dakuten
              TokenStream stream = new StandardTokenizer(Version.LUCENE_33, new StringReader("\u3055\u3099"));
      
              // Should be kept together.
              List<String> expectedTokens = Arrays.asList("\u3055\u3099");
              List<String> actualTokens = new LinkedList<String>();
              CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
              while (stream.incrementToken())
              {
                  actualTokens.add(term.toString());
              }
      
              assertEquals("Wrong tokens", expectedTokens, actualTokens);
      
          }
      

      This code fails with:

      java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ]>
      

      It seems as if the tokeniser is throwing away the combining mark entirely.

      3.0's behaviour was also undesirable:

      java.lang.AssertionError: Wrong tokens expected:<[ざ]> but was:<[さ, ゙]>
      

      But at least the token was there, so it was possible to write a filter to work around the issue.

      Katakana seems to be avoiding this particular problem, because all katakana and combining marks found in a single run seem to be lumped into a single token (this is a problem in its own right, but I'm not sure if it's really a bug.)

      1. LUCENE-3358.patch
        2 kB
        Robert Muir
      2. LUCENE-3358.patch
        82 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          It is very unfortunate that the Unicode Consortium somehow ended up with a rule which is, quite frankly, undesirable.

          I'm not concerned about this, while your users may not like it, I think we should stick by the Standard for these reasons:

          1. its not desirable to deviate from the standard here, anyone can customize the behavior to do what they want.
          2. its not shown that what you say is true, experiments have been done here (see below) and I would say as a default, what is happening here is just fine.
          3. splitting this katakana up in some non-standard way leaves me with performance concerns of long postings lists for common terms.
          For the Japanese collection (Table 4), it is not clear whether bigram generation should have
          been done for both Kanji and Katakana characters (left part) or only for Kanji characters
          (right part of Table 4). When using title-only queries, the Okapi model provided the best
          mean average precision of 0.2972 (bigram on Kanji only) compared to 0.2873 when
          generating bigrams on both Kanji and Katakana. This difference is rather small, and is even
          smaller in the opposite direction for long queries (0.3510 vs. 0.3523). Based on these results
          we cannot infer that for the Japanese language one indexing procedure is always significantly
          better than another.
          

          http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.111.6738

          Show
          Robert Muir added a comment - It is very unfortunate that the Unicode Consortium somehow ended up with a rule which is, quite frankly, undesirable. I'm not concerned about this, while your users may not like it, I think we should stick by the Standard for these reasons: its not desirable to deviate from the standard here, anyone can customize the behavior to do what they want. its not shown that what you say is true, experiments have been done here (see below) and I would say as a default, what is happening here is just fine. splitting this katakana up in some non-standard way leaves me with performance concerns of long postings lists for common terms. For the Japanese collection (Table 4), it is not clear whether bigram generation should have been done for both Kanji and Katakana characters (left part) or only for Kanji characters (right part of Table 4). When using title-only queries, the Okapi model provided the best mean average precision of 0.2972 (bigram on Kanji only) compared to 0.2873 when generating bigrams on both Kanji and Katakana. This difference is rather small, and is even smaller in the opposite direction for long queries (0.3510 vs. 0.3523). Based on these results we cannot infer that for the Japanese language one indexing procedure is always significantly better than another. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.111.6738
          Hide
          Trejkaz added a comment -

          Thanks for such a fast fix! (I will still wait for 3.4 because it will make backwards-compat much simpler.)

          I am aware of the Unicode word breaking rules and read the standard through, which is where I discovered that the non-breaking of Katakana was part of the standard (which is why I haven't filed it as a bug or improvement about that as well.) It is very unfortunate that the Unicode Consortium somehow ended up with a rule which is, quite frankly, undesirable. When I brought the change up with Japanese users, they were 100% against that behaviour, so it's a wonder that the standard got past the Japanese without any objections (I am, of course, assuming that they actually consulted an expert in the language.) But breaking it up as a separate filter isn't so hard. It's only a single Unicode area with few combining marks, so the logic is not that difficult and StandardTokenizer even marks the token as katakana for us.

          Show
          Trejkaz added a comment - Thanks for such a fast fix! (I will still wait for 3.4 because it will make backwards-compat much simpler.) I am aware of the Unicode word breaking rules and read the standard through, which is where I discovered that the non-breaking of Katakana was part of the standard (which is why I haven't filed it as a bug or improvement about that as well.) It is very unfortunate that the Unicode Consortium somehow ended up with a rule which is, quite frankly, undesirable. When I brought the change up with Japanese users, they were 100% against that behaviour, so it's a wonder that the standard got past the Japanese without any objections (I am, of course, assuming that they actually consulted an expert in the language.) But breaking it up as a separate filter isn't so hard. It's only a single Unicode area with few combining marks, so the logic is not that difficult and StandardTokenizer even marks the token as katakana for us.
          Hide
          Robert Muir added a comment -

          Thanks Trejkaz!

          i opened LUCENE-3361 for the URL+email variant

          Show
          Robert Muir added a comment - Thanks Trejkaz! i opened LUCENE-3361 for the URL+email variant
          Hide
          Steve Rowe added a comment -

          +1 to commit.

          I applied the patch, then ran 'ant jflex' and 'ant test' in modules/analysis/common/. All succeeded.

          Show
          Steve Rowe added a comment - +1 to commit. I applied the patch, then ran 'ant jflex' and 'ant test' in modules/analysis/common/ . All succeeded.
          Hide
          Robert Muir added a comment -

          Here's a patch with sophisticated backwards.

          I'd like to commit this and open a followup issue for the URL+Email one, that one is more complicated and needs to first be ported to Standard's interface.

          Show
          Robert Muir added a comment - Here's a patch with sophisticated backwards. I'd like to commit this and open a followup issue for the URL+Email one, that one is more complicated and needs to first be ported to Standard's interface.
          Hide
          Steve Rowe added a comment -

          +1 Robert's patch looks good.

          Show
          Steve Rowe added a comment - +1 Robert's patch looks good.
          Hide
          Robert Muir added a comment -

          here's a patch: without re-generation or backwards compat yet.

          we should fix the URL+Email one also, and add backwards for both.

          Show
          Robert Muir added a comment - here's a patch: without re-generation or backwards compat yet. we should fix the URL+Email one also, and add backwards for both.
          Hide
          Robert Muir added a comment -

          The rules are wrong here for Han also.

          Show
          Robert Muir added a comment - The rules are wrong here for Han also.
          Hide
          Robert Muir added a comment -

          Remember, things in StandardTokenizer are only bugs if they differ from http://unicode.org/cldr/utility/breaks.jsp

          But in the hiragana case, thats definitely a bug in the jflex grammar, because we shouldn't be splitting a base character from its combining mark here.

          Show
          Robert Muir added a comment - Remember, things in StandardTokenizer are only bugs if they differ from http://unicode.org/cldr/utility/breaks.jsp But in the hiragana case, thats definitely a bug in the jflex grammar, because we shouldn't be splitting a base character from its combining mark here.

            People

            • Assignee:
              Robert Muir
              Reporter:
              Trejkaz
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development