Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-3717

Add fake charfilter to BaseTokenStreamTestCase to find offsets bugs

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 3.6, 4.0-ALPHA
    • None
    • None
    • New

    Description

      Recently lots of issues have been fixed about broken offsets, but it would be nice to improve the
      test coverage and test that they work across the board (especially with charfilters).

      in BaseTokenStreamTestCase.checkRandomData, we can sometimes pass the analyzer a reader wrapped
      in a "MockCharFilter" (the one in the patch sometimes doubles characters). If the analyzer does
      not call correctOffsets or does incorrect "offset math" (LUCENE-3642, etc) then eventually
      this will create offsets and the test will fail.

      Other than tests bugs, this found 2 real bugs: ICUTokenizer did not call correctOffset() in its end(),
      and ThaiWordFilter did incorrect offset math.

      Attachments

        1. LUCENE-3717_ngram.patch
          22 kB
          Robert Muir
        2. LUCENE-3717_more.patch
          39 kB
          Robert Muir
        3. LUCENE-3717.patch
          14 kB
          Robert Muir

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            rcmuir Robert Muir
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment