Lucene - Core
  1. Lucene - Core
  2. LUCENE-3990

TestRandomChains failure caused by incorrect delegation in CharReader/CharFilter/CharStream API

    Details

    • Type: Bug Bug
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, 5.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      100% reproduces for me:

      2> NOTE: reproduce with: ant test -Dtests.class=*.TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=88CA02C2BB7B1DA -Dargs="-Dfile.encoding=UTF-8"

      Running org.apache.lucene.analysis.core.TestRandomChains
      FAILURE 7.22s | TestRandomChains.testRandomChains
      > Throwable #1: java.lang.AssertionError: endOffset 1 expected:<7> but was:<8>
      > at __randomizedtesting.SeedInfo.seed([88CA02C2BB7B1DA:356D894D6CA5AC1A]:0)
      > at org.junit.Assert.fail(Assert.java:93)
      > at org.junit.Assert.failNotEquals(Assert.java:647)
      > at org.junit.Assert.assertEquals(Assert.java:128)
      > at org.junit.Assert.assertEquals(Assert.java:472)
      > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:165)
      > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:662)
      > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:486)
      > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:429)
      > at org.apache.lucene.analysis.core.TestRandomChains.testRandomChains(TestRandomChains.java:820)

      The root cause of this is inconsequent override of several Reader methods in subclasses of CharFilter. We should fix this urgently, thanks to the random chains we found this bug.

      1. LUCENE-3990-CharFilterFix.patch
        3 kB
        Uwe Schindler
      2. LUCENE-3990.patch
        6 kB
        Robert Muir
      3. analysis-common.tests-report.txt
        7 kB
        Steve Rowe

        Issue Links

          Activity

          Steve Rowe created issue -
          Steve Rowe made changes -
          Field Original Value New Value
          Attachment analysis-common.tests-report.txt [ 12522712 ]
          Steve Rowe made changes -
          Link This issue relates to LUCENE-3919 [ LUCENE-3919 ]
          Hide
          Robert Muir added a comment -

          Definitely something sneaky going on...

          first i found a minor bug in the test. the logic for determining if we should apply
          additional offsets checks doesn't look at the return value, so if we ever try
          a tokenizer in brokenOffsetsComponents (even if it throws IAE and we decide not to use it)
          then we lose offsets checks for the eventually-working-chain.

          seed still works after fixing this... I'll commit a fix for that (still doesnt help this problem)

          Show
          Robert Muir added a comment - Definitely something sneaky going on... first i found a minor bug in the test. the logic for determining if we should apply additional offsets checks doesn't look at the return value, so if we ever try a tokenizer in brokenOffsetsComponents (even if it throws IAE and we decide not to use it) then we lose offsets checks for the eventually-working-chain. seed still works after fixing this... I'll commit a fix for that (still doesnt help this problem)
          rmuir committed 1326457 (1 file)
          Reviews: none

          LUCENE-3990: only set offsetsAreCorrect=false if we are actually going to use that tokenizer

          Hide
          Robert Muir added a comment -

          The problem is that we are getting different results the first time we create the tokenstream components,
          versus after we reset(Reader) with the same text again.

          The bug was introduced by Uwe Schindler in r1311358: when the reader-wrapper was changed to use CharFilter
          instead. because of crazy CharFilter-Reader delegation.

          http://svn.apache.org/viewvc?view=revision&revision=1311358

          Attached is a patch demonstrating the bug: with a standalone testcase, and backing out that change.
          Seed now passes (in addition to the test.

          Show
          Robert Muir added a comment - The problem is that we are getting different results the first time we create the tokenstream components, versus after we reset(Reader) with the same text again. The bug was introduced by Uwe Schindler in r1311358: when the reader-wrapper was changed to use CharFilter instead. because of crazy CharFilter-Reader delegation. http://svn.apache.org/viewvc?view=revision&revision=1311358 Attached is a patch demonstrating the bug: with a standalone testcase, and backing out that change. Seed now passes (in addition to the test.
          Robert Muir made changes -
          Attachment LUCENE-3990.patch [ 12522734 ]
          Robert Muir made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Uwe Schindler added a comment -

          Hi Robert,

          if this patch fixes the problem we should fix the CharFilter interface. Simply reverting my change hides the bug, sorry.

          Show
          Uwe Schindler added a comment - Hi Robert, if this patch fixes the problem we should fix the CharFilter interface. Simply reverting my change hides the bug, sorry.
          Hide
          Uwe Schindler added a comment -

          We have to fix the underlying bug. Subclassing CharFilter here is correct.

          Show
          Uwe Schindler added a comment - We have to fix the underlying bug. Subclassing CharFilter here is correct.
          Uwe Schindler made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Assignee Uwe Schindler [ thetaphi ]
          Hide
          Uwe Schindler added a comment -

          Hi Robert,

          I investigated the problem, its indeed crazy wrapping: The problem was that CharFilter did not override read() (without char[]). The same applied to CharReader!!! By not overriding that method it was also slowing down all charfilters, because the base class Reader automatically delegates to read(char[]), creating a new char[1] every time.

          The attached patch fixes this to delegate correctly.

          Show
          Uwe Schindler added a comment - Hi Robert, I investigated the problem, its indeed crazy wrapping: The problem was that CharFilter did not override read() (without char[]). The same applied to CharReader!!! By not overriding that method it was also slowing down all charfilters, because the base class Reader automatically delegates to read(char[]), creating a new char [1] every time. The attached patch fixes this to delegate correctly.
          Uwe Schindler made changes -
          Attachment LUCENE-3990-CharFilterFix.patch [ 12522739 ]
          Hide
          Dawid Weiss added a comment - - edited

          I remember looking at the fact that read() is not overridden when I was working on that pattern char filter and was surprised a bit about it, but then thought maybe it's wrapped in a buffered stream someplace else (then it doesn't matter).

          A good way to make sure it does get proper implementation is to make read() abstract again in CharStream...

          Show
          Dawid Weiss added a comment - - edited I remember looking at the fact that read() is not overridden when I was working on that pattern char filter and was surprised a bit about it, but then thought maybe it's wrapped in a buffered stream someplace else (then it doesn't matter). A good way to make sure it does get proper implementation is to make read() abstract again in CharStream...
          Hide
          Uwe Schindler added a comment -

          Committed trunk revision: 1326512

          Show
          Uwe Schindler added a comment - Committed trunk revision: 1326512
          Uwe Schindler made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Fix Version/s 4.0 [ 12314025 ]
          Resolution Fixed [ 1 ]
          Hide
          Uwe Schindler added a comment -

          Dawid: I did not notice, indeed PatternCharReplaceFilter now gets a failure. I'll revert. The problem is exactly as you said. Thats completely broken, we should make that abstract!

          Show
          Uwe Schindler added a comment - Dawid: I did not notice, indeed PatternCharReplaceFilter now gets a failure. I'll revert. The problem is exactly as you said. Thats completely broken, we should make that abstract!
          uschindler committed 1326515 (2 files)
          Reviews: none

          Reverted CharFilter changes. The rest of patch is fine.
          Reverse merged revision(s) 1326512 from lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/charfilter/CharFilter.java:
          LUCENE-3990: Fix CharFilter to delegate correctly (and prevent CharFilter slowdown! - maybe backport!!!)

          Hide
          Uwe Schindler added a comment -

          I reverted the CharFilter changes in revision: 1326515

          We should really fix the broken delegation and make CharFilters require to implement both read() and read(char[], int, int)

          Show
          Uwe Schindler added a comment - I reverted the CharFilter changes in revision: 1326515 We should really fix the broken delegation and make CharFilters require to implement both read() and read(char[], int, int)
          Hide
          Uwe Schindler added a comment -

          We should fix CharFilters in trunk to not have the horrible delegating.

          Show
          Uwe Schindler added a comment - We should fix CharFilters in trunk to not have the horrible delegating.
          Uwe Schindler made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Uwe Schindler made changes -
          Summary TestRandomChains failure TestRandomChains failure caused by incorrect delegation in CharReader/CharFilter/CharStream API
          Description 100% reproduces for me:

          2> NOTE: reproduce with: ant test -Dtests.class=*.TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=88CA02C2BB7B1DA -Dargs="-Dfile.encoding=UTF-8"

          Running org.apache.lucene.analysis.core.TestRandomChains
          FAILURE 7.22s | TestRandomChains.testRandomChains
             > Throwable #1: java.lang.AssertionError: endOffset 1 expected:<7> but was:<8>
             > at __randomizedtesting.SeedInfo.seed([88CA02C2BB7B1DA:356D894D6CA5AC1A]:0)
             > at org.junit.Assert.fail(Assert.java:93)
             > at org.junit.Assert.failNotEquals(Assert.java:647)
             > at org.junit.Assert.assertEquals(Assert.java:128)
             > at org.junit.Assert.assertEquals(Assert.java:472)
             > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:165)
             > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:662)
             > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:486)
             > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:429)
             > at org.apache.lucene.analysis.core.TestRandomChains.testRandomChains(TestRandomChains.java:820)
          100% reproduces for me:

          2> NOTE: reproduce with: ant test -Dtests.class=*.TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=88CA02C2BB7B1DA -Dargs="-Dfile.encoding=UTF-8"

          Running org.apache.lucene.analysis.core.TestRandomChains
          FAILURE 7.22s | TestRandomChains.testRandomChains
             > Throwable #1: java.lang.AssertionError: endOffset 1 expected:<7> but was:<8>
             > at __randomizedtesting.SeedInfo.seed([88CA02C2BB7B1DA:356D894D6CA5AC1A]:0)
             > at org.junit.Assert.fail(Assert.java:93)
             > at org.junit.Assert.failNotEquals(Assert.java:647)
             > at org.junit.Assert.assertEquals(Assert.java:128)
             > at org.junit.Assert.assertEquals(Assert.java:472)
             > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:165)
             > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:662)
             > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:486)
             > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:429)
             > at org.apache.lucene.analysis.core.TestRandomChains.testRandomChains(TestRandomChains.java:820)


          The root cause of this is inconsequent override of several Reader methods in subclasses of CharFilter. We should fix this urgently, thanks to the random chains we found this bug.
          Hide
          Uwe Schindler added a comment -

          Robert: Sorry for the noise with heavy committing, and thanks for pointing to that bug!

          Show
          Uwe Schindler added a comment - Robert: Sorry for the noise with heavy committing, and thanks for pointing to that bug!
          Hide
          Uwe Schindler added a comment -

          I investigated why the original commit lead to the bug:
          The current committed stuff (Robert's revert and also my latest patch), just hide a bug in MappingCharFilter that causes this. Let me explain:

          • With Robert's patch (and also my current fix): read(char[],int,int) and read() both delegate to the underlying charstream.
          • With the original CharFilter approach, int read() was not overridden, so the default method in java.io.Reader did the following: allocate char[1], call read(buffer,0,1) -> MappingCharFilters's buffered read method was called. Unfortunately this method is behaving different than MappingCharFilter's one-character read(). And thats the bug here. So neither my code nor Roberts code is to blame, its MappingCharFilter or MockCharFilter that behave different in the two read methods.
          Show
          Uwe Schindler added a comment - I investigated why the original commit lead to the bug: The current committed stuff (Robert's revert and also my latest patch), just hide a bug in MappingCharFilter that causes this. Let me explain: With Robert's patch (and also my current fix): read(char[],int,int) and read() both delegate to the underlying charstream. With the original CharFilter approach, int read() was not overridden, so the default method in java.io.Reader did the following: allocate char [1] , call read(buffer,0,1) -> MappingCharFilters's buffered read method was called. Unfortunately this method is behaving different than MappingCharFilter's one-character read(). And thats the bug here. So neither my code nor Roberts code is to blame, its MappingCharFilter or MockCharFilter that behave different in the two read methods.
          Hide
          Uwe Schindler added a comment - - edited

          We should have a asserting method for CharFilter consistency. Indeed the read(char[],int,int) method in MappingCharFilter is failing horribly (which is caused by the underlying MockCharFilter somehow).

          I propose to adda CharFilter consistency method that reads two instances of the same CharFilter, one using read() and in parallel using read(char[]) with varying buffer sizes. It should check offsets (and that is what is heavy buggy in MappingCharOffsetCorrumpter / MockCharCorrumpter).

          I'll prepare a patch with the test method in BaseTokenStreamTestCase.

          Show
          Uwe Schindler added a comment - - edited We should have a asserting method for CharFilter consistency. Indeed the read(char[],int,int) method in MappingCharFilter is failing horribly (which is caused by the underlying MockCharFilter somehow). I propose to adda CharFilter consistency method that reads two instances of the same CharFilter, one using read() and in parallel using read(char[]) with varying buffer sizes. It should check offsets (and that is what is heavy buggy in MappingCharOffsetCorrumpter / MockCharCorrumpter). I'll prepare a patch with the test method in BaseTokenStreamTestCase.
          Hide
          Robert Muir added a comment -

          Hi Uwe, i tried these same things, I know the interface is horrible but that wasnt really the bug here.

          If we want to fix CharFilter/CharStream i think we should use LUCENE-2788.

          That removes CharFilter/CharStream/CharReader and replaces with just one CharFilter class that extends FilterReader and is reusable. I think its a simpler way to go at least to only have one class...

          Show
          Robert Muir added a comment - Hi Uwe, i tried these same things, I know the interface is horrible but that wasnt really the bug here. If we want to fix CharFilter/CharStream i think we should use LUCENE-2788 . That removes CharFilter/CharStream/CharReader and replaces with just one CharFilter class that extends FilterReader and is reusable. I think its a simpler way to go at least to only have one class...
          Hide
          Robert Muir added a comment -

          We should have a asserting method for CharFilter consistency. Indeed the read(char[],int,int) method in MappingCharFilter is failing horribly (which is caused by the underlying MockCharFilter somehow).

          We don't need to do all that actually. I think its enough if we have a CharFilter or Tokenizer that
          randomly uses read(CharBuffer)/read(char[])/read()/read(char[], int,int)...

          the other methods should be tested too.

          Show
          Robert Muir added a comment - We should have a asserting method for CharFilter consistency. Indeed the read(char[],int,int) method in MappingCharFilter is failing horribly (which is caused by the underlying MockCharFilter somehow). We don't need to do all that actually. I think its enough if we have a CharFilter or Tokenizer that randomly uses read(CharBuffer)/read(char[])/read()/read(char[], int,int)... the other methods should be tested too.
          rmuir committed 1326561 (1 file)
          Reviews: none

          LUCENE-3990: revert broken refactoring AGAIN. charfilter does not delegate all read methods. I'm not wasting hours of my life debugging these test fails again to save 3 lines of code

          uschindler committed 1326562 (1 file)
          Reviews: none

          LUCENE-3990: Next time please ask for revert instead of just doing it, the commit was by somebody else and he/she had no chance to respond.

          rmuir committed 1327800 (59 files)
          Reviews: none

          LUCENE-3919, LUCENE-3873, LUCENE-3968, LUCENE-3940, LUCENE-3942, LUCENE-3969, LUCENE-3971, LUCENE-3990: backport analysis bugfixes and improved tests

          Lucene lucene_solr_3_6
          Robert Muir made changes -
          Fix Version/s 3.6.1 [ 12320752 ]
          Hide
          Hoss Man added a comment -

          bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment

          Show
          Hoss Man added a comment - bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment
          Hoss Man made changes -
          Fix Version/s 4.0 [ 12322456 ]
          Fix Version/s 4.0-ALPHA [ 12314025 ]
          Fix Version/s 3.6.1 [ 12320752 ]
          Hide
          Robert Muir added a comment -

          rmuir20120906-bulk-40-change

          Show
          Robert Muir added a comment - rmuir20120906-bulk-40-change
          Robert Muir made changes -
          Fix Version/s 4.0 [ 12322550 ]
          Fix Version/s 4.0-BETA [ 12322456 ]
          Robert Muir made changes -
          Fix Version/s 4.1 [ 12321140 ]
          Fix Version/s 4.0 [ 12322550 ]
          Steve Rowe made changes -
          Fix Version/s 4.2 [ 12323899 ]
          Fix Version/s 4.1 [ 12321140 ]
          Robert Muir made changes -
          Fix Version/s 4.3 [ 12324143 ]
          Fix Version/s 4.2 [ 12323899 ]
          Uwe Schindler made changes -
          Fix Version/s 4.4 [ 12324323 ]
          Fix Version/s 4.3 [ 12324143 ]
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Steve Rowe made changes -
          Fix Version/s 5.0 [ 12321663 ]
          Fix Version/s 4.5 [ 12324742 ]
          Fix Version/s 4.4 [ 12324323 ]
          Adrien Grand made changes -
          Fix Version/s 4.6 [ 12324999 ]
          Fix Version/s 5.0 [ 12321663 ]
          Fix Version/s 4.5 [ 12324742 ]
          Simon Willnauer made changes -
          Fix Version/s 4.7 [ 12325572 ]
          Fix Version/s 4.6 [ 12324999 ]
          David Smiley made changes -
          Fix Version/s 4.8 [ 12326269 ]
          Fix Version/s 4.7 [ 12325572 ]
          Hide
          Uwe Schindler added a comment -

          Move issue to Lucene 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Lucene 4.9.
          Uwe Schindler made changes -
          Fix Version/s 4.9 [ 12326730 ]
          Fix Version/s 5.0 [ 12321663 ]
          Fix Version/s 4.8 [ 12326269 ]

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Steve Rowe
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:

                Development