Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-3717

Add fake charfilter to BaseTokenStreamTestCase to find offsets bugs

Details

    • Task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 3.6, 4.0-ALPHA
    • None
    • None
    • New

    Description

      Recently lots of issues have been fixed about broken offsets, but it would be nice to improve the
      test coverage and test that they work across the board (especially with charfilters).

      in BaseTokenStreamTestCase.checkRandomData, we can sometimes pass the analyzer a reader wrapped
      in a "MockCharFilter" (the one in the patch sometimes doubles characters). If the analyzer does
      not call correctOffsets or does incorrect "offset math" (LUCENE-3642, etc) then eventually
      this will create offsets and the test will fail.

      Other than tests bugs, this found 2 real bugs: ICUTokenizer did not call correctOffset() in its end(),
      and ThaiWordFilter did incorrect offset math.

      Attachments

        1. LUCENE-3717_more.patch
          39 kB
          Robert Muir
        2. LUCENE-3717_ngram.patch
          22 kB
          Robert Muir
        3. LUCENE-3717.patch
          14 kB
          Robert Muir

        Activity

          People

            Unassigned Unassigned
            rcmuir Robert Muir
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment