Lucene - Core
  1. Lucene - Core
  2. LUCENE-3717

Add fake charfilter to BaseTokenStreamTestCase to find offsets bugs

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Recently lots of issues have been fixed about broken offsets, but it would be nice to improve the
      test coverage and test that they work across the board (especially with charfilters).

      in BaseTokenStreamTestCase.checkRandomData, we can sometimes pass the analyzer a reader wrapped
      in a "MockCharFilter" (the one in the patch sometimes doubles characters). If the analyzer does
      not call correctOffsets or does incorrect "offset math" (LUCENE-3642, etc) then eventually
      this will create offsets and the test will fail.

      Other than tests bugs, this found 2 real bugs: ICUTokenizer did not call correctOffset() in its end(),
      and ThaiWordFilter did incorrect offset math.

      1. LUCENE-3717.patch
        14 kB
        Robert Muir
      2. LUCENE-3717_more.patch
        39 kB
        Robert Muir
      3. LUCENE-3717_ngram.patch
        22 kB
        Robert Muir

        Activity

        Robert Muir created issue -
        Robert Muir made changes -
        Field Original Value New Value
        Attachment LUCENE-3717.patch [ 12511456 ]
        Hide
        Robert Muir added a comment -

        I committed this. I will go thru the analyzers and try to make sure they are all using checkRandomData (i think most are), just to see if we have any other bugs sitting out there.

        It would be nice to have these offsets all under control for the next release.

        Show
        Robert Muir added a comment - I committed this. I will go thru the analyzers and try to make sure they are all using checkRandomData (i think most are), just to see if we have any other bugs sitting out there. It would be nice to have these offsets all under control for the next release.
        Robert Muir made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Robert Muir added a comment -

        i started adding checkRandomData to more analyzers, and found 5 bugs already:

        • broken offsets in TrimFilter, WordDelimiterFilter along the same lines here
        • HyphenatedWordsFilter was broken worse: if the text ends with a hyphen the last token had end offset of 0 always (because it read uninitialized attributes)
        • PatternAnalyzer completely broken with charfilters.
        • WikipediaTokenizer broken in many ways, in general the tokenizer keeps a ton of state variables, but never resets this state.

        patch fixes these but I'm sure adding more tests to the remaining filters will find more bugs.

        Show
        Robert Muir added a comment - i started adding checkRandomData to more analyzers, and found 5 bugs already: broken offsets in TrimFilter, WordDelimiterFilter along the same lines here HyphenatedWordsFilter was broken worse: if the text ends with a hyphen the last token had end offset of 0 always (because it read uninitialized attributes) PatternAnalyzer completely broken with charfilters. WikipediaTokenizer broken in many ways, in general the tokenizer keeps a ton of state variables, but never resets this state. patch fixes these but I'm sure adding more tests to the remaining filters will find more bugs.
        Robert Muir made changes -
        Attachment LUCENE-3717_more.patch [ 12511473 ]
        Hide
        Robert Muir added a comment -

        reopening since we have more work to do / more bugs.

        I'll look at committing/backporting the current patch as a start but i think we should check every tokenizer/filter/etc and just clean this up.

        Show
        Robert Muir added a comment - reopening since we have more work to do / more bugs. I'll look at committing/backporting the current patch as a start but i think we should check every tokenizer/filter/etc and just clean this up.
        Robert Muir made changes -
        Resolution Fixed [ 1 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Hide
        Robert Muir added a comment -

        second patch is committed+backported.

        just remains to add the random test to all remaining tokenstreams...

        Show
        Robert Muir added a comment - second patch is committed+backported. just remains to add the random test to all remaining tokenstreams...
        Hide
        Robert Muir added a comment -

        more bugs in the n-gram tokenizers. they:

        • were wrongly computing end() from the trimmed length
        • not calling correctOffset
        • not checking return value of Reader.read causing bugs in some situations (e.g. empty stringreader)
        Show
        Robert Muir added a comment - more bugs in the n-gram tokenizers. they: were wrongly computing end() from the trimmed length not calling correctOffset not checking return value of Reader.read causing bugs in some situations (e.g. empty stringreader)
        Robert Muir made changes -
        Attachment LUCENE-3717_ngram.patch [ 12511656 ]
        Robert Muir made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Uwe Schindler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        56m 37s 1 Robert Muir 23/Jan/12 00:26
        Resolved Resolved Reopened Reopened
        3h 48m 1 Robert Muir 23/Jan/12 04:14
        Reopened Reopened Resolved Resolved
        1d 6h 38m 1 Robert Muir 24/Jan/12 10:52
        Resolved Resolved Closed Closed
        471d 23h 51m 1 Uwe Schindler 10/May/13 11:43

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development