Lucene - Core
  1. Lucene - Core
  2. LUCENE-3042

AttributeSource can have an invalid computed state

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 2.9.4, 3.0.3, 3.1, 3.2, 4.0-ALPHA
    • Fix Version/s: 2.9.5, 3.0.4, 3.1, 3.2, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      If you work a tokenstream, consume it, then reuse it and add an attribute to it, the computed state is wrong.
      thus for example, clearAttributes() will not actually clear the attribute added.

      So in some situations, addAttribute is not actually clearing the computed state when it should.

      1. LUCENE-3042.patch
        8 kB
        Uwe Schindler
      2. LUCENE-3042.patch
        9 kB
        Uwe Schindler

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open In Progress In Progress
        1h 19m 1 Uwe Schindler 22/Apr/11 21:49
        In Progress In Progress Resolved Resolved
        2h 18m 1 Uwe Schindler 23/Apr/11 00:08
        Resolved Resolved Closed Closed
        41d 17h 29m 1 Robert Muir 03/Jun/11 17:37
        Robert Muir made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Robert Muir added a comment -

        Bulk closing for 3.2

        Show
        Robert Muir added a comment - Bulk closing for 3.2
        Uwe Schindler made changes -
        Fix Version/s 2.9.5 [ 12315914 ]
        Fix Version/s 3.0.4 [ 12315913 ]
        Fix Version/s 3.1 [ 12314822 ]
        Hide
        Uwe Schindler added a comment -

        I thought about it:
        This bug is so serious it should be fixed in all branches, too (even if never released anymore). This is important for e.g. 2.9 users whcih are stuck with that version.

        Committed 3.1 branch revision: 1096127
        Committed 3.0 branch revision: 1096128
        Committed 2.9 branch revision: 1096129

        Show
        Uwe Schindler added a comment - I thought about it: This bug is so serious it should be fixed in all branches, too (even if never released anymore). This is important for e.g. 2.9 users whcih are stuck with that version. Committed 3.1 branch revision: 1096127 Committed 3.0 branch revision: 1096128 Committed 2.9 branch revision: 1096129
        Uwe Schindler made changes -
        Status In Progress [ 3 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Uwe Schindler added a comment -

        Committed trunk revision: 1096073, 1096077
        Committed 3.x revision: 1096078

        Please reopen when 3.1.1 or 2.9.5 should be released.

        Show
        Uwe Schindler added a comment - Committed trunk revision: 1096073, 1096077 Committed 3.x revision: 1096078 Please reopen when 3.1.1 or 2.9.5 should be released.
        Uwe Schindler made changes -
        Fix Version/s 3.2 [ 12316070 ]
        Fix Version/s 4.0 [ 12314025 ]
        Affects Version/s 3.1 [ 12314822 ]
        Affects Version/s 3.0.3 [ 12315147 ]
        Affects Version/s 2.9.4 [ 12315148 ]
        Affects Version/s 3.2 [ 12316070 ]
        Affects Version/s 4.0 [ 12314025 ]
        Priority Major [ 3 ] Critical [ 2 ]
        Component/s Analysis [ 12310230 ]
        Hide
        Uwe Schindler added a comment -

        Just to conclude:
        This bug is not so serious as it appears (else someone would have noticed before), as it would never happen on 0-8-15 TokenStreams, when used like IndexWriter does.
        This bug only appears if you have TokenFilters and you add Attributes on the top level Filter later (after using the TokenStream for first time). Using the TokenStream means that you calculate the states and so every Filter/Tokenizer got his own cached state. Adding them a new Attribute on the last filter will never invalidate the cache of the Tokenizer.

        This bug could affect:

        • Analyzers that reuse TokenStreams partly and plug filters on top in the reuseableTokenStream() method, reusing the partially cached tokenstream. Like those, that always add a non-cacheable TokenFilter on top of a base TS.
        • TokenStreams that add attributes on the-fly in one of their filters.

        We should backport this patch to 3.x, 3.1.1 and maybe even 2.9.x and 3.0.x branches (if somebody wants to patch 3.0). In general this is a serious issue of the new TokenStream API since 2.9.

        Show
        Uwe Schindler added a comment - Just to conclude: This bug is not so serious as it appears (else someone would have noticed before), as it would never happen on 0-8-15 TokenStreams, when used like IndexWriter does. This bug only appears if you have TokenFilters and you add Attributes on the top level Filter later (after using the TokenStream for first time). Using the TokenStream means that you calculate the states and so every Filter/Tokenizer got his own cached state. Adding them a new Attribute on the last filter will never invalidate the cache of the Tokenizer. This bug could affect: Analyzers that reuse TokenStreams partly and plug filters on top in the reuseableTokenStream() method, reusing the partially cached tokenstream. Like those, that always add a non-cacheable TokenFilter on top of a base TS. TokenStreams that add attributes on the-fly in one of their filters. We should backport this patch to 3.x, 3.1.1 and maybe even 2.9.x and 3.0.x branches (if somebody wants to patch 3.0). In general this is a serious issue of the new TokenStream API since 2.9.
        Uwe Schindler made changes -
        Attachment LUCENE-3042.patch [ 12477168 ]
        Hide
        Uwe Schindler added a comment -

        Repost patch, first had logic error.

        Show
        Uwe Schindler added a comment - Repost patch, first had logic error.
        Uwe Schindler made changes -
        Attachment LUCENE-3042.patch [ 12477167 ]
        Uwe Schindler made changes -
        Attachment LUCENE-3042.patch [ 12477167 ]
        Hide
        Uwe Schindler added a comment -

        New patch, that makes the code simplier to read:

        • I replaced the computeCurrentState by a getter method that does the check and returns the state.
        • This makes it easier to use and removes lots of checks even for hasAttributes()
        • Now, most methods using State now only consist of the for Loop.
        Show
        Uwe Schindler added a comment - New patch, that makes the code simplier to read: I replaced the computeCurrentState by a getter method that does the check and returns the state. This makes it easier to use and removes lots of checks even for hasAttributes() Now, most methods using State now only consist of the for Loop.
        Uwe Schindler made changes -
        Attachment LUCENE-3042.patch [ 12477162 ]
        Hide
        Uwe Schindler added a comment -

        Here a quick patch using a reference (1-ele array) to the state that is shared between all AttributeSources

        Show
        Uwe Schindler added a comment - Here a quick patch using a reference (1-ele array) to the state that is shared between all AttributeSources
        Uwe Schindler made changes -
        Status Open [ 1 ] In Progress [ 3 ]
        Hide
        Uwe Schindler added a comment -

        I found the bug, its totally confusing but is a design error in the state caching originated by Michael Busch:

        • Tokenizer and all Filters share the same Map instances for the attributes
        • BUT: Tokenizer and every filter have its own cached state
        • if you call clearAttributes on the filter, it clears its own state, but not the one of its parent tokenizer
        • if tokenizer calls clearAttributes, nothing is done

        Fix is to make the AttributeSourcesnot only share the maps, but also the state

        Show
        Uwe Schindler added a comment - I found the bug, its totally confusing but is a design error in the state caching originated by Michael Busch: Tokenizer and all Filters share the same Map instances for the attributes BUT: Tokenizer and every filter have its own cached state if you call clearAttributes on the filter, it clears its own state, but not the one of its parent tokenizer if tokenizer calls clearAttributes, nothing is done Fix is to make the AttributeSourcesnot only share the maps, but also the state
        Hide
        Uwe Schindler added a comment -

        Now I am totally confused, this is impossible to my knowledge.

        I will look into it.

        Show
        Uwe Schindler added a comment - Now I am totally confused, this is impossible to my knowledge. I will look into it.
        Uwe Schindler made changes -
        Field Original Value New Value
        Assignee Uwe Schindler [ thetaphi ]
        Hide
        Robert Muir added a comment -

        Trivial test: if you add computeCurrentState() unconditionally as the first line of AttributeSource.clearAttributes, the test will pass.

        public class SimpleTest extends BaseTokenStreamTestCase {
          public void testSimple() throws IOException {
            String testString = "t";
            
            Analyzer analyzer = new MockAnalyzer(random);
            TokenStream stream = analyzer.reusableTokenStream("dummy", new StringReader(testString));
            stream.reset();
            while (stream.incrementToken()) {
              // consume
            }
            
            assertAnalyzesToReuse(analyzer, testString, new String[] { "t" });
          }
        }
        
        Show
        Robert Muir added a comment - Trivial test: if you add computeCurrentState() unconditionally as the first line of AttributeSource.clearAttributes, the test will pass. public class SimpleTest extends BaseTokenStreamTestCase { public void testSimple() throws IOException { String testString = "t"; Analyzer analyzer = new MockAnalyzer(random); TokenStream stream = analyzer.reusableTokenStream("dummy", new StringReader(testString)); stream.reset(); while (stream.incrementToken()) { // consume } assertAnalyzesToReuse(analyzer, testString, new String[] { "t" }); } }
        Robert Muir created issue -

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development