Lucene - Core
  1. Lucene - Core
  2. LUCENE-2207

CJKTokenizer generates tokens with incorrect offsets

    Details

    • Lucene Fields:
      New, Patch Available

      Description

      If I index a Japanese multi-valued document with CJKTokenizer and highlight a term with FastVectorHighlighter, the output snippets have incorrect highlighted string. I'll attach a program that reproduces the problem soon.

      1. TestCJKOffset.java
        7 kB
        Koji Sekiguchi
      2. LUCENE-2207.patch
        0.7 kB
        Robert Muir
      3. LUCENE-2207.patch
        2 kB
        Robert Muir
      4. LUCENE-2207.patch
        4 kB
        Robert Muir
      5. LUCENE-2207.patch
        6 kB
        Koji Sekiguchi

        Issue Links

          Activity

          Hide
          Koji Sekiguchi added a comment -

          Attached the program that reproduces the problem. In the program, I didn't use FastVectorHighlighter, instead, I printed out offsets from TermVectorOffsetInfo. You'll see the following results:

          === WhitespaceAnalyzer ===
          あい(0,2)
          うえお(3,6)
          === CJKAnalyzer ===
          あい(0,2)
          うえ(4,6)
          えお(5,7)
          === BasicNGramAnalyzer ===
          あい(0,2)
          うえ(3,5)
          えお(4,6)
          

          For people who are seeing garbage characters, I want to rephrase using 'Cn' symbols as follows:

          === WhitespaceAnalyzer ===
          C1C2(0,2)
          C3C4C5(3,6)
          === CJKAnalyzer ===
          C1C2(0,2)
          C3C4(4,6)
          C4C5(5,7)
          === BasicNGramAnalyzer ===
          C1C2(0,2)
          C3C4(3,5)
          C4C5(4,6)
          

          As you can see, the start offset of 'C3' is 3 in WhitespaceAnalyzer and BasicNGramAnalyzer (an analyzer which uses BasicNGramTokenizer. BasicNGramTokenizer is used in FastVectorHighlighter test code. It works as a 2-gram tokenizer for not only CJK but also ASCII), but is 4 in CJKAnalyzer – incorrect!

          I'll look into it tomorrow or after, but volunteers are welcome!

          Show
          Koji Sekiguchi added a comment - Attached the program that reproduces the problem. In the program, I didn't use FastVectorHighlighter, instead, I printed out offsets from TermVectorOffsetInfo. You'll see the following results: === WhitespaceAnalyzer === あい(0,2) うえお(3,6) === CJKAnalyzer === あい(0,2) うえ(4,6) えお(5,7) === BasicNGramAnalyzer === あい(0,2) うえ(3,5) えお(4,6) For people who are seeing garbage characters, I want to rephrase using 'Cn' symbols as follows: === WhitespaceAnalyzer === C1C2(0,2) C3C4C5(3,6) === CJKAnalyzer === C1C2(0,2) C3C4(4,6) C4C5(5,7) === BasicNGramAnalyzer === C1C2(0,2) C3C4(3,5) C4C5(4,6) As you can see, the start offset of 'C3' is 3 in WhitespaceAnalyzer and BasicNGramAnalyzer (an analyzer which uses BasicNGramTokenizer. BasicNGramTokenizer is used in FastVectorHighlighter test code. It works as a 2-gram tokenizer for not only CJK but also ASCII), but is 4 in CJKAnalyzer – incorrect! I'll look into it tomorrow or after, but volunteers are welcome!
          Hide
          Robert Muir added a comment -

          Hi Koji, this looks like a bug in CJK offset calculations, probably involving end()

          Personally i find the offset logic a little complex. It currently does both additions and subtractions to the offset and I think there is an off-by-one error in this.

          I will play around and see if I can simplify this logic, but please don't wait on me, if you have an idea already how to fix it!

          Show
          Robert Muir added a comment - Hi Koji, this looks like a bug in CJK offset calculations, probably involving end() Personally i find the offset logic a little complex. It currently does both additions and subtractions to the offset and I think there is an off-by-one error in this. I will play around and see if I can simplify this logic, but please don't wait on me, if you have an idea already how to fix it!
          Hide
          Robert Muir added a comment -

          ok i found the bug. the problem is incrementToken() unconditionally increments the offset before it starts its main loop:

          line 165:

          offset++;
          

          so, when incrementToken() has no more text to return and returns false, it needs to subtract from this.

          again i think in the future we try to refactor this offset logic to be simpler, but for the short term, this fixes the bug and all tests pass.

          Koji, can you review?

          Show
          Robert Muir added a comment - ok i found the bug. the problem is incrementToken() unconditionally increments the offset before it starts its main loop: line 165: offset++; so, when incrementToken() has no more text to return and returns false, it needs to subtract from this. again i think in the future we try to refactor this offset logic to be simpler, but for the short term, this fixes the bug and all tests pass. Koji, can you review?
          Hide
          Robert Muir added a comment -

          i added a testcase for end() to my patch that fails on trunk, but passes with the fix.

          Show
          Robert Muir added a comment - i added a testcase for end() to my patch that fails on trunk, but passes with the fix.
          Hide
          Robert Muir added a comment -

          hello, this tokenizer has more serious offset/end problems than I originally thought.

          attached is my previous patch and testcase but with 3 more testcases, one still fails.

          Show
          Robert Muir added a comment - hello, this tokenizer has more serious offset/end problems than I originally thought. attached is my previous patch and testcase but with 3 more testcases, one still fails.
          Hide
          Koji Sekiguchi added a comment - - edited

          Hi Robert, thank you for looking this so quickly!

          ok i found the bug. the problem is incrementToken() unconditionally increments the offset before it starts its main loop:

          line 165:

          offset++;

          Indeed.

          In attached patch, I added one more offset-- line and two more testcases. All tests pass and this patch fixes the original problem that was found in Solr with FastVectorHighlighter.

          Show
          Koji Sekiguchi added a comment - - edited Hi Robert, thank you for looking this so quickly! ok i found the bug. the problem is incrementToken() unconditionally increments the offset before it starts its main loop: line 165: offset++; Indeed. In attached patch, I added one more offset-- line and two more testcases. All tests pass and this patch fixes the original problem that was found in Solr with FastVectorHighlighter.
          Hide
          Robert Muir added a comment -

          In attached patch, I added one more offset-- line and two more testcases. All tests pass and this patch fixes the original problem that was found in Solr with FastVectorHighlighter.

          nice, fix looks good to me.

          Show
          Robert Muir added a comment - In attached patch, I added one more offset-- line and two more testcases. All tests pass and this patch fixes the original problem that was found in Solr with FastVectorHighlighter. nice, fix looks good to me.
          Hide
          Robert Muir added a comment -

          Koji, I am testing other end() offsets with all other tokenizers, I noticed that CJKTokenizer does not call correctOffset() in end:

          final int finalOffset = offset;
          

          I think instead this should be

          final int finalOffset = correctOffset(offset);
          

          in case there is a CharFilter present.

          Show
          Robert Muir added a comment - Koji, I am testing other end() offsets with all other tokenizers, I noticed that CJKTokenizer does not call correctOffset() in end: final int finalOffset = offset; I think instead this should be final int finalOffset = correctOffset(offset); in case there is a CharFilter present.
          Hide
          Koji Sekiguchi added a comment -

          I think instead this should be

          final int finalOffset = correctOffset(offset);

          Agreed, thank you for pointing this!
          I think this is ready to commit. Robert, can you do it? And it'd be great if it could go 2.9 branch so that Solr can use the fix.

          Show
          Koji Sekiguchi added a comment - I think instead this should be final int finalOffset = correctOffset(offset); Agreed, thank you for pointing this! I think this is ready to commit. Robert, can you do it? And it'd be great if it could go 2.9 branch so that Solr can use the fix.
          Hide
          Robert Muir added a comment -

          Koji, sure I can take care of it.

          Also i added LUCENE-2219 to find these bugs in other tokenizers.

          In the future I also want to explore if we can somehow use a fake CharFilter in BaseTokenStreamTest to also ensure that correctOffset() is called when setting offsets in both incrementToken() and end(), don't yet know how it would work yet.

          Show
          Robert Muir added a comment - Koji, sure I can take care of it. Also i added LUCENE-2219 to find these bugs in other tokenizers. In the future I also want to explore if we can somehow use a fake CharFilter in BaseTokenStreamTest to also ensure that correctOffset() is called when setting offsets in both incrementToken() and end(), don't yet know how it would work yet.
          Hide
          Robert Muir added a comment -

          thanks Koji!

          Show
          Robert Muir added a comment - thanks Koji!

            People

            • Assignee:
              Robert Muir
              Reporter:
              Koji Sekiguchi
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development