Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-3940

When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 4.0-ALPHA, 3.6.1
    • None
    • None
    • New

    Description

      I modified BaseTokenStreamTestCase to assert that the start/end
      offsets match for graph (posLen > 1) tokens, and this caught a bug in
      Kuromoji when the decompounding of a compound token has a punctuation
      token that's dropped.

      In this case we should leave hole(s) so that the graph is intact, ie,
      the graph should look the same as if the punctuation tokens were not
      initially removed, but then a StopFilter had removed them.

      This also affects tokens that have no compound over them, ie we fail
      to leave a hole today when we remove the punctuation tokens.

      I'm not sure this is serious enough to warrant fixing in 3.6 at the
      last minute...

      Attachments

        1. LUCENE-3940.patch
          67 kB
          Michael McCandless
        2. LUCENE-3940.patch
          70 kB
          Michael McCandless
        3. LUCENE-3940.patch
          67 kB
          Michael McCandless
        4. LUCENE-3940.patch
          56 kB
          Michael McCandless

        Activity

          People

            mikemccand Michael McCandless
            mikemccand Michael McCandless
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: