Lucene - Core
  1. Lucene - Core
  2. LUCENE-3717

Add fake charfilter to BaseTokenStreamTestCase to find offsets bugs

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Recently lots of issues have been fixed about broken offsets, but it would be nice to improve the
      test coverage and test that they work across the board (especially with charfilters).

      in BaseTokenStreamTestCase.checkRandomData, we can sometimes pass the analyzer a reader wrapped
      in a "MockCharFilter" (the one in the patch sometimes doubles characters). If the analyzer does
      not call correctOffsets or does incorrect "offset math" (LUCENE-3642, etc) then eventually
      this will create offsets and the test will fail.

      Other than tests bugs, this found 2 real bugs: ICUTokenizer did not call correctOffset() in its end(),
      and ThaiWordFilter did incorrect offset math.

      1. LUCENE-3717.patch
        14 kB
        Robert Muir
      2. LUCENE-3717_more.patch
        39 kB
        Robert Muir
      3. LUCENE-3717_ngram.patch
        22 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        I committed this. I will go thru the analyzers and try to make sure they are all using checkRandomData (i think most are), just to see if we have any other bugs sitting out there.

        It would be nice to have these offsets all under control for the next release.

        Show
        Robert Muir added a comment - I committed this. I will go thru the analyzers and try to make sure they are all using checkRandomData (i think most are), just to see if we have any other bugs sitting out there. It would be nice to have these offsets all under control for the next release.
        Hide
        Robert Muir added a comment -

        i started adding checkRandomData to more analyzers, and found 5 bugs already:

        • broken offsets in TrimFilter, WordDelimiterFilter along the same lines here
        • HyphenatedWordsFilter was broken worse: if the text ends with a hyphen the last token had end offset of 0 always (because it read uninitialized attributes)
        • PatternAnalyzer completely broken with charfilters.
        • WikipediaTokenizer broken in many ways, in general the tokenizer keeps a ton of state variables, but never resets this state.

        patch fixes these but I'm sure adding more tests to the remaining filters will find more bugs.

        Show
        Robert Muir added a comment - i started adding checkRandomData to more analyzers, and found 5 bugs already: broken offsets in TrimFilter, WordDelimiterFilter along the same lines here HyphenatedWordsFilter was broken worse: if the text ends with a hyphen the last token had end offset of 0 always (because it read uninitialized attributes) PatternAnalyzer completely broken with charfilters. WikipediaTokenizer broken in many ways, in general the tokenizer keeps a ton of state variables, but never resets this state. patch fixes these but I'm sure adding more tests to the remaining filters will find more bugs.
        Hide
        Robert Muir added a comment -

        reopening since we have more work to do / more bugs.

        I'll look at committing/backporting the current patch as a start but i think we should check every tokenizer/filter/etc and just clean this up.

        Show
        Robert Muir added a comment - reopening since we have more work to do / more bugs. I'll look at committing/backporting the current patch as a start but i think we should check every tokenizer/filter/etc and just clean this up.
        Hide
        Robert Muir added a comment -

        second patch is committed+backported.

        just remains to add the random test to all remaining tokenstreams...

        Show
        Robert Muir added a comment - second patch is committed+backported. just remains to add the random test to all remaining tokenstreams...
        Hide
        Robert Muir added a comment -

        more bugs in the n-gram tokenizers. they:

        • were wrongly computing end() from the trimmed length
        • not calling correctOffset
        • not checking return value of Reader.read causing bugs in some situations (e.g. empty stringreader)
        Show
        Robert Muir added a comment - more bugs in the n-gram tokenizers. they: were wrongly computing end() from the trimmed length not calling correctOffset not checking return value of Reader.read causing bugs in some situations (e.g. empty stringreader)

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development