Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-3940

When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA, 3.6.1
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I modified BaseTokenStreamTestCase to assert that the start/end
      offsets match for graph (posLen > 1) tokens, and this caught a bug in
      Kuromoji when the decompounding of a compound token has a punctuation
      token that's dropped.

      In this case we should leave hole(s) so that the graph is intact, ie,
      the graph should look the same as if the punctuation tokens were not
      initially removed, but then a StopFilter had removed them.

      This also affects tokens that have no compound over them, ie we fail
      to leave a hole today when we remove the punctuation tokens.

      I'm not sure this is serious enough to warrant fixing in 3.6 at the
      last minute...

        Attachments

        1. LUCENE-3940.patch
          67 kB
          Michael McCandless
        2. LUCENE-3940.patch
          70 kB
          Michael McCandless
        3. LUCENE-3940.patch
          67 kB
          Michael McCandless
        4. LUCENE-3940.patch
          56 kB
          Michael McCandless

          Activity

            People

            • Assignee:
              mikemccand Michael McCandless
              Reporter:
              mikemccand Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: