Lucene - Core
  1. Lucene - Core
  2. LUCENE-3990

TestRandomChains failure caused by incorrect delegation in CharReader/CharFilter/CharStream API

    Details

    • Type: Bug Bug
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, 5.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      100% reproduces for me:

      2> NOTE: reproduce with: ant test -Dtests.class=*.TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=88CA02C2BB7B1DA -Dargs="-Dfile.encoding=UTF-8"

      Running org.apache.lucene.analysis.core.TestRandomChains
      FAILURE 7.22s | TestRandomChains.testRandomChains
      > Throwable #1: java.lang.AssertionError: endOffset 1 expected:<7> but was:<8>
      > at __randomizedtesting.SeedInfo.seed([88CA02C2BB7B1DA:356D894D6CA5AC1A]:0)
      > at org.junit.Assert.fail(Assert.java:93)
      > at org.junit.Assert.failNotEquals(Assert.java:647)
      > at org.junit.Assert.assertEquals(Assert.java:128)
      > at org.junit.Assert.assertEquals(Assert.java:472)
      > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:165)
      > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:662)
      > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:486)
      > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:429)
      > at org.apache.lucene.analysis.core.TestRandomChains.testRandomChains(TestRandomChains.java:820)

      The root cause of this is inconsequent override of several Reader methods in subclasses of CharFilter. We should fix this urgently, thanks to the random chains we found this bug.

      1. analysis-common.tests-report.txt
        7 kB
        Steve Rowe
      2. LUCENE-3990.patch
        6 kB
        Robert Muir
      3. LUCENE-3990-CharFilterFix.patch
        3 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          Definitely something sneaky going on...

          first i found a minor bug in the test. the logic for determining if we should apply
          additional offsets checks doesn't look at the return value, so if we ever try
          a tokenizer in brokenOffsetsComponents (even if it throws IAE and we decide not to use it)
          then we lose offsets checks for the eventually-working-chain.

          seed still works after fixing this... I'll commit a fix for that (still doesnt help this problem)

          Show
          Robert Muir added a comment - Definitely something sneaky going on... first i found a minor bug in the test. the logic for determining if we should apply additional offsets checks doesn't look at the return value, so if we ever try a tokenizer in brokenOffsetsComponents (even if it throws IAE and we decide not to use it) then we lose offsets checks for the eventually-working-chain. seed still works after fixing this... I'll commit a fix for that (still doesnt help this problem)
          Hide
          Robert Muir added a comment -

          The problem is that we are getting different results the first time we create the tokenstream components,
          versus after we reset(Reader) with the same text again.

          The bug was introduced by Uwe Schindler in r1311358: when the reader-wrapper was changed to use CharFilter
          instead. because of crazy CharFilter-Reader delegation.

          http://svn.apache.org/viewvc?view=revision&revision=1311358

          Attached is a patch demonstrating the bug: with a standalone testcase, and backing out that change.
          Seed now passes (in addition to the test.

          Show
          Robert Muir added a comment - The problem is that we are getting different results the first time we create the tokenstream components, versus after we reset(Reader) with the same text again. The bug was introduced by Uwe Schindler in r1311358: when the reader-wrapper was changed to use CharFilter instead. because of crazy CharFilter-Reader delegation. http://svn.apache.org/viewvc?view=revision&revision=1311358 Attached is a patch demonstrating the bug: with a standalone testcase, and backing out that change. Seed now passes (in addition to the test.
          Hide
          Uwe Schindler added a comment -

          Hi Robert,

          if this patch fixes the problem we should fix the CharFilter interface. Simply reverting my change hides the bug, sorry.

          Show
          Uwe Schindler added a comment - Hi Robert, if this patch fixes the problem we should fix the CharFilter interface. Simply reverting my change hides the bug, sorry.
          Hide
          Uwe Schindler added a comment -

          We have to fix the underlying bug. Subclassing CharFilter here is correct.

          Show
          Uwe Schindler added a comment - We have to fix the underlying bug. Subclassing CharFilter here is correct.
          Hide
          Uwe Schindler added a comment -

          Hi Robert,

          I investigated the problem, its indeed crazy wrapping: The problem was that CharFilter did not override read() (without char[]). The same applied to CharReader!!! By not overriding that method it was also slowing down all charfilters, because the base class Reader automatically delegates to read(char[]), creating a new char[1] every time.

          The attached patch fixes this to delegate correctly.

          Show
          Uwe Schindler added a comment - Hi Robert, I investigated the problem, its indeed crazy wrapping: The problem was that CharFilter did not override read() (without char[]). The same applied to CharReader!!! By not overriding that method it was also slowing down all charfilters, because the base class Reader automatically delegates to read(char[]), creating a new char [1] every time. The attached patch fixes this to delegate correctly.
          Hide
          Dawid Weiss added a comment - - edited

          I remember looking at the fact that read() is not overridden when I was working on that pattern char filter and was surprised a bit about it, but then thought maybe it's wrapped in a buffered stream someplace else (then it doesn't matter).

          A good way to make sure it does get proper implementation is to make read() abstract again in CharStream...

          Show
          Dawid Weiss added a comment - - edited I remember looking at the fact that read() is not overridden when I was working on that pattern char filter and was surprised a bit about it, but then thought maybe it's wrapped in a buffered stream someplace else (then it doesn't matter). A good way to make sure it does get proper implementation is to make read() abstract again in CharStream...
          Hide
          Uwe Schindler added a comment -

          Committed trunk revision: 1326512

          Show
          Uwe Schindler added a comment - Committed trunk revision: 1326512
          Hide
          Uwe Schindler added a comment -

          Dawid: I did not notice, indeed PatternCharReplaceFilter now gets a failure. I'll revert. The problem is exactly as you said. Thats completely broken, we should make that abstract!

          Show
          Uwe Schindler added a comment - Dawid: I did not notice, indeed PatternCharReplaceFilter now gets a failure. I'll revert. The problem is exactly as you said. Thats completely broken, we should make that abstract!
          Hide
          Uwe Schindler added a comment -

          I reverted the CharFilter changes in revision: 1326515

          We should really fix the broken delegation and make CharFilters require to implement both read() and read(char[], int, int)

          Show
          Uwe Schindler added a comment - I reverted the CharFilter changes in revision: 1326515 We should really fix the broken delegation and make CharFilters require to implement both read() and read(char[], int, int)
          Hide
          Uwe Schindler added a comment -

          We should fix CharFilters in trunk to not have the horrible delegating.

          Show
          Uwe Schindler added a comment - We should fix CharFilters in trunk to not have the horrible delegating.
          Hide
          Uwe Schindler added a comment -

          Robert: Sorry for the noise with heavy committing, and thanks for pointing to that bug!

          Show
          Uwe Schindler added a comment - Robert: Sorry for the noise with heavy committing, and thanks for pointing to that bug!
          Hide
          Uwe Schindler added a comment -

          I investigated why the original commit lead to the bug:
          The current committed stuff (Robert's revert and also my latest patch), just hide a bug in MappingCharFilter that causes this. Let me explain:

          • With Robert's patch (and also my current fix): read(char[],int,int) and read() both delegate to the underlying charstream.
          • With the original CharFilter approach, int read() was not overridden, so the default method in java.io.Reader did the following: allocate char[1], call read(buffer,0,1) -> MappingCharFilters's buffered read method was called. Unfortunately this method is behaving different than MappingCharFilter's one-character read(). And thats the bug here. So neither my code nor Roberts code is to blame, its MappingCharFilter or MockCharFilter that behave different in the two read methods.
          Show
          Uwe Schindler added a comment - I investigated why the original commit lead to the bug: The current committed stuff (Robert's revert and also my latest patch), just hide a bug in MappingCharFilter that causes this. Let me explain: With Robert's patch (and also my current fix): read(char[],int,int) and read() both delegate to the underlying charstream. With the original CharFilter approach, int read() was not overridden, so the default method in java.io.Reader did the following: allocate char [1] , call read(buffer,0,1) -> MappingCharFilters's buffered read method was called. Unfortunately this method is behaving different than MappingCharFilter's one-character read(). And thats the bug here. So neither my code nor Roberts code is to blame, its MappingCharFilter or MockCharFilter that behave different in the two read methods.
          Hide
          Uwe Schindler added a comment - - edited

          We should have a asserting method for CharFilter consistency. Indeed the read(char[],int,int) method in MappingCharFilter is failing horribly (which is caused by the underlying MockCharFilter somehow).

          I propose to adda CharFilter consistency method that reads two instances of the same CharFilter, one using read() and in parallel using read(char[]) with varying buffer sizes. It should check offsets (and that is what is heavy buggy in MappingCharOffsetCorrumpter / MockCharCorrumpter).

          I'll prepare a patch with the test method in BaseTokenStreamTestCase.

          Show
          Uwe Schindler added a comment - - edited We should have a asserting method for CharFilter consistency. Indeed the read(char[],int,int) method in MappingCharFilter is failing horribly (which is caused by the underlying MockCharFilter somehow). I propose to adda CharFilter consistency method that reads two instances of the same CharFilter, one using read() and in parallel using read(char[]) with varying buffer sizes. It should check offsets (and that is what is heavy buggy in MappingCharOffsetCorrumpter / MockCharCorrumpter). I'll prepare a patch with the test method in BaseTokenStreamTestCase.
          Hide
          Robert Muir added a comment -

          Hi Uwe, i tried these same things, I know the interface is horrible but that wasnt really the bug here.

          If we want to fix CharFilter/CharStream i think we should use LUCENE-2788.

          That removes CharFilter/CharStream/CharReader and replaces with just one CharFilter class that extends FilterReader and is reusable. I think its a simpler way to go at least to only have one class...

          Show
          Robert Muir added a comment - Hi Uwe, i tried these same things, I know the interface is horrible but that wasnt really the bug here. If we want to fix CharFilter/CharStream i think we should use LUCENE-2788 . That removes CharFilter/CharStream/CharReader and replaces with just one CharFilter class that extends FilterReader and is reusable. I think its a simpler way to go at least to only have one class...
          Hide
          Robert Muir added a comment -

          We should have a asserting method for CharFilter consistency. Indeed the read(char[],int,int) method in MappingCharFilter is failing horribly (which is caused by the underlying MockCharFilter somehow).

          We don't need to do all that actually. I think its enough if we have a CharFilter or Tokenizer that
          randomly uses read(CharBuffer)/read(char[])/read()/read(char[], int,int)...

          the other methods should be tested too.

          Show
          Robert Muir added a comment - We should have a asserting method for CharFilter consistency. Indeed the read(char[],int,int) method in MappingCharFilter is failing horribly (which is caused by the underlying MockCharFilter somehow). We don't need to do all that actually. I think its enough if we have a CharFilter or Tokenizer that randomly uses read(CharBuffer)/read(char[])/read()/read(char[], int,int)... the other methods should be tested too.
          Hide
          Hoss Man added a comment -

          bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment

          Show
          Hoss Man added a comment - bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment
          Hide
          Robert Muir added a comment -

          rmuir20120906-bulk-40-change

          Show
          Robert Muir added a comment - rmuir20120906-bulk-40-change
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Uwe Schindler added a comment -

          Move issue to Lucene 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Lucene 4.9.

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Steve Rowe
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:

                Development