Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-324

org.apache.lucene.analysis.cn.ChineseTokenizer missing offset decrement

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.9
    • Component/s: modules/analysis
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: All

    • Bugzilla Id:
      32687

      Description

      Apparently, in ChineseTokenizer, offset should be decremented like bufferIndex
      when Character is OTHER_LETTER. This directly affects startOffset and endOffset
      values.

      This is critical to have Highlighter working correctly because Highlighter marks
      matching text based on these offset values.

        Activity

        Hide
        saturnism@gmail.com Ray Tsang added a comment -

        Created an attachment (id=13749)
        Patch for ChineseTokenizer to correctly count offsets

        Show
        saturnism@gmail.com Ray Tsang added a comment - Created an attachment (id=13749) Patch for ChineseTokenizer to correctly count offsets
        Hide
        otis@apache.org Otis Gospodnetic added a comment -

        Ray: is there a simple way you can show that this is indeed a needed fix? Maybe
        a short class that shows that offsets are wrong.

        Lucene developers: can anyone confirm whether this is really needed it? I don't
        use ChineseTokenizer enough to know for sure if this is a good fix, or something
        that will break the code.

        Show
        otis@apache.org Otis Gospodnetic added a comment - Ray: is there a simple way you can show that this is indeed a needed fix? Maybe a short class that shows that offsets are wrong. Lucene developers: can anyone confirm whether this is really needed it? I don't use ChineseTokenizer enough to know for sure if this is a good fix, or something that will break the code.
        Hide
        saturnism@gmail.com Ray Tsang added a comment -

        Created an attachment (id=13758)
        Testcase that tests ChineseTokenizer and OTHER_LETTER offsets

        The problem arises when OTHER_LETTER characters and the rest of the characters
        are mixed together. When given a string "a天b", tokens and corresponding
        offsets should be the following:
        a : (0, 1)
        天 : (1, 2)
        b : (2, 3)

        Show
        saturnism@gmail.com Ray Tsang added a comment - Created an attachment (id=13758) Testcase that tests ChineseTokenizer and OTHER_LETTER offsets The problem arises when OTHER_LETTER characters and the rest of the characters are mixed together. When given a string "a天b", tokens and corresponding offsets should be the following: a : (0, 1) 天 : (1, 2) b : (2, 3)
        Hide
        saturnism@gmail.com Ray Tsang added a comment -

        I haven't done a formal trace of the code yet, but I think it would make sense
        that the offset should only be incremented if the character is pushed into the
        buffer. Current code, howerver, increments offset by default, regardless
        whether the character is pushed into the buffer.

        If that's the case, then there are more places that needs to be fixed.

        Show
        saturnism@gmail.com Ray Tsang added a comment - I haven't done a formal trace of the code yet, but I think it would make sense that the offset should only be incremented if the character is pushed into the buffer. Current code, howerver, increments offset by default, regardless whether the character is pushed into the buffer. If that's the case, then there are more places that needs to be fixed.
        Hide
        ehatcher Erik Hatcher added a comment -

        Ray - ??? (let's see if JIRA can handle Chinese Sorry for the delay in applying this patch.

        Show
        ehatcher Erik Hatcher added a comment - Ray - ??? (let's see if JIRA can handle Chinese Sorry for the delay in applying this patch.

          People

          • Assignee:
            Unassigned
            Reporter:
            saturnism@gmail.com Ray Tsang
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development