Lucene - Core
  1. Lucene - Core
  2. LUCENE-3940

When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA, 3.6.1
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I modified BaseTokenStreamTestCase to assert that the start/end
      offsets match for graph (posLen > 1) tokens, and this caught a bug in
      Kuromoji when the decompounding of a compound token has a punctuation
      token that's dropped.

      In this case we should leave hole(s) so that the graph is intact, ie,
      the graph should look the same as if the punctuation tokens were not
      initially removed, but then a StopFilter had removed them.

      This also affects tokens that have no compound over them, ie we fail
      to leave a hole today when we remove the punctuation tokens.

      I'm not sure this is serious enough to warrant fixing in 3.6 at the
      last minute...

      1. LUCENE-3940.patch
        56 kB
        Michael McCandless
      2. LUCENE-3940.patch
        67 kB
        Michael McCandless
      3. LUCENE-3940.patch
        70 kB
        Michael McCandless
      4. LUCENE-3940.patch
        67 kB
        Michael McCandless

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development