Lucene - Core
  1. Lucene - Core
  2. LUCENE-2211

Improve BaseTokenStreamTestCase to uses a fake attribute to check if clearAttributes() was called correctly - found bugs in contrib/analyzers

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.9, 2.9.1, 3.0
    • Fix Version/s: 2.9.2, 3.0.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Robert had the idea to use a fake attribute inside BaseTokenStreamTestCase that records if its clear() method was called. If this is not the case after incrementToken(), asserTokenStreamContents fails. It also uses the attribute in TeeSinkTokenFilter, because there a lot of copying, captureState and restoreState() is used. By the attribute, you can track wonderful, if save/restore and clearAttributes is correctly implemented. It also verifies that before a captureState() it was also cleared (as the state will also contain the clear call). Because if you consume tokens in a filter, capture the consumed tokens and insert them, the capturedStates must also be cleared before.

      In contrib analyzers are some test that fail to pass this additional assertion. They are not fixed in the attached patch.

      1. LUCENE-2211-branch29.patch
        40 kB
        Uwe Schindler
      2. LUCENE-2211-branch29.patch
        36 kB
        Uwe Schindler
      3. LUCENE-2211-branch30.patch
        34 kB
        Uwe Schindler
      4. LUCENE-2211.patch
        32 kB
        Uwe Schindler
      5. LUCENE-2211.patch
        32 kB
        Robert Muir
      6. LUCENE-2211.patch
        19 kB
        Uwe Schindler
      7. LUCENE-2211.patch
        19 kB
        Uwe Schindler
      8. LUCENE-2211.patch
        18 kB
        Robert Muir
      9. LUCENE-2211.patch
        12 kB
        Robert Muir
      10. LUCENE-2211.patch
        10 kB
        Robert Muir
      11. LUCENE-2211.patch
        8 kB
        Uwe Schindler

        Activity

        Hide
        Uwe Schindler added a comment -

        Committed into 2.9 branch revision: 899681

        Show
        Uwe Schindler added a comment - Committed into 2.9 branch revision: 899681
        Hide
        Uwe Schindler added a comment -

        There was a new bug in the 2.9 ShingleMatrixFilter because of incorrect Token reuse in a call to this.next(reusableToken). This patch now also contains the merged LUCENE-1939, which was missing.

        I commit this now.

        Show
        Uwe Schindler added a comment - There was a new bug in the 2.9 ShingleMatrixFilter because of incorrect Token reuse in a call to this.next(reusableToken). This patch now also contains the merged LUCENE-1939 , which was missing. I commit this now.
        Hide
        Uwe Schindler added a comment -

        Patch for 2.9 branch. Tests are running...

        Show
        Uwe Schindler added a comment - Patch for 2.9 branch. Tests are running...
        Hide
        Uwe Schindler added a comment -

        Fixed in Lucene 3.0 revision: 899639

        Show
        Uwe Schindler added a comment - Fixed in Lucene 3.0 revision: 899639
        Hide
        Uwe Schindler added a comment -

        Patch for 3.0 branch

        Show
        Uwe Schindler added a comment - Patch for 3.0 branch
        Hide
        Uwe Schindler added a comment -

        Fixed in trunk revision: 899627

        Show
        Uwe Schindler added a comment - Fixed in trunk revision: 899627
        Hide
        Uwe Schindler added a comment -

        I'll commit attached patch woith some fixes for typos etc.

        Show
        Uwe Schindler added a comment - I'll commit attached patch woith some fixes for typos etc.
        Hide
        Robert Muir added a comment -

        i reviewed all code with incrementToken() and found 3 more problems, 2 in highlighter, 1 in memory.

        i also fixed all helper tokenstreams in core/contrib tests.

        Show
        Robert Muir added a comment - i reviewed all code with incrementToken() and found 3 more problems, 2 in highlighter, 1 in memory. i also fixed all helper tokenstreams in core/contrib tests.
        Hide
        Uwe Schindler added a comment -

        More updates to TeeSink and also BaseTokenStreamTestCase to still call clearAttributes to force TS to not reuse attributes from previous calls to incrementToken(). This was one of the first traps, Robert investigated in 2.9.

        Show
        Uwe Schindler added a comment - More updates to TeeSink and also BaseTokenStreamTestCase to still call clearAttributes to force TS to not reuse attributes from previous calls to incrementToken(). This was one of the first traps, Robert investigated in 2.9.
        Hide
        Uwe Schindler added a comment -

        Some updates to TeeSink test. Changes.txt.

        Show
        Uwe Schindler added a comment - Some updates to TeeSink test. Changes.txt.
        Hide
        Robert Muir added a comment -

        Hi Uwe, PrefixAwareTokenFilter did not clearAttributes() either. I tried all others i could find but I think the rest are ok.

        Show
        Robert Muir added a comment - Hi Uwe, PrefixAwareTokenFilter did not clearAttributes() either. I tried all others i could find but I think the rest are ok.
        Hide
        Robert Muir added a comment -

        Hello Uwe, i did not get time to review all tokenstreams but I converted a ShingleMatrix test to assertTokenStreamContents and found a clearAttributes() bug in it too, so it is also fixed in this one, more tokenstreams with problems might remain.

        Show
        Robert Muir added a comment - Hello Uwe, i did not get time to review all tokenstreams but I converted a ShingleMatrix test to assertTokenStreamContents and found a clearAttributes() bug in it too, so it is also fixed in this one, more tokenstreams with problems might remain.
        Hide
        Robert Muir added a comment -

        before committing any fix i want to review / add tests for any tokenstreams that do not yet use this BaseTokenStreamTestCase, just to be sure there are no others with this problem.

        it may seem trivial but if this clearing does not take place properly, then things like position increment with stopfilter can grow to very large values, overflow, and cause IndexWriter to throw an exception: http://www.lucidimagination.com/search/document/f649a19901d33c75/illegalargumentexception_when_indexwriter_adddocument

        Show
        Robert Muir added a comment - before committing any fix i want to review / add tests for any tokenstreams that do not yet use this BaseTokenStreamTestCase, just to be sure there are no others with this problem. it may seem trivial but if this clearing does not take place properly, then things like position increment with stopfilter can grow to very large values, overflow, and cause IndexWriter to throw an exception: http://www.lucidimagination.com/search/document/f649a19901d33c75/illegalargumentexception_when_indexwriter_adddocument
        Hide
        Robert Muir added a comment -

        uwe's patch, with the fixes for contrib.

        broken were compounds, n-gram filter, and edge n-gram filter

        Show
        Robert Muir added a comment - uwe's patch, with the fixes for contrib. broken were compounds, n-gram filter, and edge n-gram filter
        Hide
        Uwe Schindler added a comment -

        The ngram things are serious, so also backport.

        We get the non-generic java 1.4 version for solr 1.5 for free.

        Show
        Uwe Schindler added a comment - The ngram things are serious, so also backport. We get the non-generic java 1.4 version for solr 1.5 for free.

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Uwe Schindler
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development