Lucene - Core
  1. Lucene - Core
  2. LUCENE-2219

improve BaseTokenStreamTestCase to test end()

    Details

    • Lucene Fields:
      New, Patch Available

      Description

      If offsetAtt/end() is not implemented correctly, then there can be problems with highlighting: see LUCENE-2207 for an example with CJKTokenizer.

      In my opinion you currently have to write too much code to test this.

      This patch does the following:

      • adds optional Integer finalOffset (can be null for no checking) to assertTokenStreamContents
      • in assertAnalyzesTo, automatically fill this with the String length()

      In my opinion this is correct, for assertTokenStreamContents the behavior should be optional, it may not even have a Tokenizer. If you are using assertTokenStreamContents with a Tokenizer then simply provide the extra expected value to check it.

      for assertAnalyzesTo then it is implied there is a tokenizer so it should be checked.

      the tests pass for core but there are failures in contrib even besides CJKTokenizer (apply Koji's patch from LUCENE-2207, it is correct). Specifically ChineseTokenizer has a similar problem.

      1. LUCENE-2219.patch
        20 kB
        Robert Muir
      2. LUCENE-2219.patch
        14 kB
        Robert Muir
      3. LUCENE-2219.patch
        4 kB
        Robert Muir

        Issue Links

          Activity

          Shai Erera made changes -
          Component/s contrib/analyzers [ 12312333 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12564114 ] jira [ 12583897 ]
          Mark Thomas made changes -
          Workflow jira [ 12488463 ] Default workflow, editable Closed status [ 12564114 ]
          Uwe Schindler made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Robert Muir made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          rmuir committed 900222 (24 files)
          Reviews: none

          LUCENE-2207: CJKTokenizer generates tokens with incorrect offsets
          LUCENE-2219: Chinese, SmartChinese, Wikipedia tokenizers generate incorrect offsets, test end() in BaseTokenStreamTestCase

          Lucene lucene_2_9
          rmuir committed 900212 (23 files)
          Reviews: none

          LUCENE-2207: CJKTokenizer generates tokens with incorrect offsets
          LUCENE-2219: Chinese, SmartChinese, Wikipedia tokenizers generate incorrect offsets, test end() in BaseTokenStreamTestCase

          Lucene lucene_3_0
          Hide
          Robert Muir added a comment -

          Committed revision 900196 to trunk.

          Show
          Robert Muir added a comment - Committed revision 900196 to trunk.
          Hide
          Uwe Schindler added a comment -

          I am fine with this patch!

          Show
          Uwe Schindler added a comment - I am fine with this patch!
          Robert Muir made changes -
          Attachment LUCENE-2219.patch [ 12430546 ]
          Hide
          Robert Muir added a comment -

          i merged Koji's fix and tests to CJK from LUCENE-2207 into this patch, and improved CJKTokenizer's tests to always use assertAnalyzesTo, for better checking.

          i plan to commit soon

          Show
          Robert Muir added a comment - i merged Koji's fix and tests to CJK from LUCENE-2207 into this patch, and improved CJKTokenizer's tests to always use assertAnalyzesTo, for better checking. i plan to commit soon
          Robert Muir made changes -
          Fix Version/s 2.9.2 [ 12314342 ]
          Fix Version/s 3.0.1 [ 12314401 ]
          Fix Version/s 3.1 [ 12314025 ]
          Affects Version/s 2.9 [ 12312682 ]
          Robert Muir made changes -
          Assignee Robert Muir [ rcmuir ]
          Robert Muir made changes -
          Link This issue incorporates LUCENE-2207 [ LUCENE-2207 ]
          Hide
          Robert Muir added a comment -

          Wikipedia does not call super.end().

          Uwe, thanks for taking a look...

          even StandardTokenizer does not call super.end()!

          should we really do this?

          Show
          Robert Muir added a comment - Wikipedia does not call super.end(). Uwe, thanks for taking a look... even StandardTokenizer does not call super.end()! should we really do this?
          Hide
          Uwe Schindler added a comment -

          Wikipedia does not call super.end().

          Looks good!

          Show
          Uwe Schindler added a comment - Wikipedia does not call super.end(). Looks good!
          Robert Muir made changes -
          Attachment LUCENE-2219.patch [ 12430542 ]
          Hide
          Robert Muir added a comment -

          this fixes contrib too, as long as you apply the CJKTokenizer fix from LUCENE-2207.

          end() was incorrect for ChineseTokenizer, SmartChinese, and Wikipedia

          Show
          Robert Muir added a comment - this fixes contrib too, as long as you apply the CJKTokenizer fix from LUCENE-2207 . end() was incorrect for ChineseTokenizer, SmartChinese, and Wikipedia
          Robert Muir made changes -
          Field Original Value New Value
          Attachment LUCENE-2219.patch [ 12430541 ]
          Robert Muir created issue -

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development