Lucene - Core
  1. Lucene - Core
  2. LUCENE-3042

AttributeSource can have an invalid computed state

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 2.9.4, 3.0.3, 3.1, 3.2, 4.0-ALPHA
    • Fix Version/s: 2.9.5, 3.0.4, 3.1, 3.2, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      If you work a tokenstream, consume it, then reuse it and add an attribute to it, the computed state is wrong.
      thus for example, clearAttributes() will not actually clear the attribute added.

      So in some situations, addAttribute is not actually clearing the computed state when it should.

      1. LUCENE-3042.patch
        9 kB
        Uwe Schindler
      2. LUCENE-3042.patch
        8 kB
        Uwe Schindler

        Activity

        Hide
        Robert Muir added a comment -

        Trivial test: if you add computeCurrentState() unconditionally as the first line of AttributeSource.clearAttributes, the test will pass.

        public class SimpleTest extends BaseTokenStreamTestCase {
          public void testSimple() throws IOException {
            String testString = "t";
            
            Analyzer analyzer = new MockAnalyzer(random);
            TokenStream stream = analyzer.reusableTokenStream("dummy", new StringReader(testString));
            stream.reset();
            while (stream.incrementToken()) {
              // consume
            }
            
            assertAnalyzesToReuse(analyzer, testString, new String[] { "t" });
          }
        }
        
        Show
        Robert Muir added a comment - Trivial test: if you add computeCurrentState() unconditionally as the first line of AttributeSource.clearAttributes, the test will pass. public class SimpleTest extends BaseTokenStreamTestCase { public void testSimple() throws IOException { String testString = "t"; Analyzer analyzer = new MockAnalyzer(random); TokenStream stream = analyzer.reusableTokenStream("dummy", new StringReader(testString)); stream.reset(); while (stream.incrementToken()) { // consume } assertAnalyzesToReuse(analyzer, testString, new String[] { "t" }); } }
        Hide
        Uwe Schindler added a comment -

        Now I am totally confused, this is impossible to my knowledge.

        I will look into it.

        Show
        Uwe Schindler added a comment - Now I am totally confused, this is impossible to my knowledge. I will look into it.
        Hide
        Uwe Schindler added a comment -

        I found the bug, its totally confusing but is a design error in the state caching originated by Michael Busch:

        • Tokenizer and all Filters share the same Map instances for the attributes
        • BUT: Tokenizer and every filter have its own cached state
        • if you call clearAttributes on the filter, it clears its own state, but not the one of its parent tokenizer
        • if tokenizer calls clearAttributes, nothing is done

        Fix is to make the AttributeSourcesnot only share the maps, but also the state

        Show
        Uwe Schindler added a comment - I found the bug, its totally confusing but is a design error in the state caching originated by Michael Busch: Tokenizer and all Filters share the same Map instances for the attributes BUT: Tokenizer and every filter have its own cached state if you call clearAttributes on the filter, it clears its own state, but not the one of its parent tokenizer if tokenizer calls clearAttributes, nothing is done Fix is to make the AttributeSourcesnot only share the maps, but also the state
        Hide
        Uwe Schindler added a comment -

        Here a quick patch using a reference (1-ele array) to the state that is shared between all AttributeSources

        Show
        Uwe Schindler added a comment - Here a quick patch using a reference (1-ele array) to the state that is shared between all AttributeSources
        Hide
        Uwe Schindler added a comment -

        New patch, that makes the code simplier to read:

        • I replaced the computeCurrentState by a getter method that does the check and returns the state.
        • This makes it easier to use and removes lots of checks even for hasAttributes()
        • Now, most methods using State now only consist of the for Loop.
        Show
        Uwe Schindler added a comment - New patch, that makes the code simplier to read: I replaced the computeCurrentState by a getter method that does the check and returns the state. This makes it easier to use and removes lots of checks even for hasAttributes() Now, most methods using State now only consist of the for Loop.
        Hide
        Uwe Schindler added a comment -

        Repost patch, first had logic error.

        Show
        Uwe Schindler added a comment - Repost patch, first had logic error.
        Hide
        Uwe Schindler added a comment -

        Just to conclude:
        This bug is not so serious as it appears (else someone would have noticed before), as it would never happen on 0-8-15 TokenStreams, when used like IndexWriter does.
        This bug only appears if you have TokenFilters and you add Attributes on the top level Filter later (after using the TokenStream for first time). Using the TokenStream means that you calculate the states and so every Filter/Tokenizer got his own cached state. Adding them a new Attribute on the last filter will never invalidate the cache of the Tokenizer.

        This bug could affect:

        • Analyzers that reuse TokenStreams partly and plug filters on top in the reuseableTokenStream() method, reusing the partially cached tokenstream. Like those, that always add a non-cacheable TokenFilter on top of a base TS.
        • TokenStreams that add attributes on the-fly in one of their filters.

        We should backport this patch to 3.x, 3.1.1 and maybe even 2.9.x and 3.0.x branches (if somebody wants to patch 3.0). In general this is a serious issue of the new TokenStream API since 2.9.

        Show
        Uwe Schindler added a comment - Just to conclude: This bug is not so serious as it appears (else someone would have noticed before), as it would never happen on 0-8-15 TokenStreams, when used like IndexWriter does. This bug only appears if you have TokenFilters and you add Attributes on the top level Filter later (after using the TokenStream for first time). Using the TokenStream means that you calculate the states and so every Filter/Tokenizer got his own cached state. Adding them a new Attribute on the last filter will never invalidate the cache of the Tokenizer. This bug could affect: Analyzers that reuse TokenStreams partly and plug filters on top in the reuseableTokenStream() method, reusing the partially cached tokenstream. Like those, that always add a non-cacheable TokenFilter on top of a base TS. TokenStreams that add attributes on the-fly in one of their filters. We should backport this patch to 3.x, 3.1.1 and maybe even 2.9.x and 3.0.x branches (if somebody wants to patch 3.0). In general this is a serious issue of the new TokenStream API since 2.9.
        Hide
        Uwe Schindler added a comment -

        Committed trunk revision: 1096073, 1096077
        Committed 3.x revision: 1096078

        Please reopen when 3.1.1 or 2.9.5 should be released.

        Show
        Uwe Schindler added a comment - Committed trunk revision: 1096073, 1096077 Committed 3.x revision: 1096078 Please reopen when 3.1.1 or 2.9.5 should be released.
        Hide
        Uwe Schindler added a comment -

        I thought about it:
        This bug is so serious it should be fixed in all branches, too (even if never released anymore). This is important for e.g. 2.9 users whcih are stuck with that version.

        Committed 3.1 branch revision: 1096127
        Committed 3.0 branch revision: 1096128
        Committed 2.9 branch revision: 1096129

        Show
        Uwe Schindler added a comment - I thought about it: This bug is so serious it should be fixed in all branches, too (even if never released anymore). This is important for e.g. 2.9 users whcih are stuck with that version. Committed 3.1 branch revision: 1096127 Committed 3.0 branch revision: 1096128 Committed 2.9 branch revision: 1096129
        Hide
        Robert Muir added a comment -

        Bulk closing for 3.2

        Show
        Robert Muir added a comment - Bulk closing for 3.2

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development