Lucene - Core
  1. Lucene - Core
  2. LUCENE-2972

Intermittent failure in TestFieldCacheTermsFilter.testMissingTerms

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.2, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Running tests in while(1) I hit this:

      NOTE: reproduce with: ant test -Dtestcase=TestFieldCacheTermsFilter -Dtestmethod=testMissingTerms -Dtests.seed=-1046382732738729184:5855929314778232889
      
      1) testMissingTerms(org.apache.lucene.search.TestFieldCacheTermsFilter)
      java.lang.AssertionError: Must match 1 expected:<1> but was:<0>
      	at org.junit.Assert.fail(Assert.java:91)
      	at org.junit.Assert.failNotEquals(Assert.java:645)
      	at org.junit.Assert.assertEquals(Assert.java:126)
      	at org.junit.Assert.assertEquals(Assert.java:470)
      	at org.apache.lucene.search.TestFieldCacheTermsFilter.testMissingTerms(TestFieldCacheTermsFilter.java:63)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      	at java.lang.reflect.Method.invoke(Method.java:597)
      	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
      	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
      	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
      	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
      	at org.junit.rules.TestWatchman$1.evaluate(TestWatchman.java:48)
      	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
      	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
      	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
      	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1214)
      	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1146)
      	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
      	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
      	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
      	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
      	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
      	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
      	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
      	at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
      	at org.junit.runners.Suite.runChild(Suite.java:128)
      	at org.junit.runners.Suite.runChild(Suite.java:24)
      	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
      	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
      	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
      	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
      	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
      	at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
      	at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
      	at org.junit.runner.JUnitCore.run(JUnitCore.java:136)
      	at org.junit.runner.JUnitCore.run(JUnitCore.java:117)
      	at org.junit.runner.JUnitCore.runMain(JUnitCore.java:98)
      	at org.junit.runner.JUnitCore.runMainAndExit(JUnitCore.java:53)
      	at org.junit.runner.JUnitCore.main(JUnitCore.java:45)
      

      Unfortunately the seed doesn't [consistently] repro for me...

      1. LUCENE-2972.patch
        5 kB
        Michael McCandless

        Activity

        Hide
        Robert Muir added a comment -

        Bulk closing for 3.2

        Show
        Robert Muir added a comment - Bulk closing for 3.2
        Hide
        Michael McCandless added a comment -

        Good idea Yonik, I'll do that...

        Show
        Michael McCandless added a comment - Good idea Yonik, I'll do that...
        Hide
        Yonik Seeley added a comment -

        To avoid slowing anything down, can we also maintain numBits in an assert?
        Perhaps something like
        assert (numBits = Math.max(other.numBits, numBits)) > 0;

        Show
        Yonik Seeley added a comment - To avoid slowing anything down, can we also maintain numBits in an assert? Perhaps something like assert (numBits = Math.max(other.numBits, numBits)) > 0;
        Hide
        Michael McCandless added a comment -

        Patch, tracking numBits and adding asserts on all methods where the "index" param is supposed to be within bounds.

        With this patch and without the core fix I committed the test case quickly fails (trips the new asserts...).

        Show
        Michael McCandless added a comment - Patch, tracking numBits and adding asserts on all methods where the "index" param is supposed to be within bounds. With this patch and without the core fix I committed the test case quickly fails (trips the new asserts...).
        Hide
        Michael McCandless added a comment -

        Note that 3.x doesn't have the bug; it was already correctly using numOrd...

        Show
        Michael McCandless added a comment - Note that 3.x doesn't have the bug; it was already correctly using numOrd...
        Hide
        Michael McCandless added a comment -

        I just committed the fix:

        Index: src/java/org/apache/lucene/search/FieldCacheTermsFilter.java
        ===================================================================
        --- src/java/org/apache/lucene/search/FieldCacheTermsFilter.java	(revision 1125975)
        +++ src/java/org/apache/lucene/search/FieldCacheTermsFilter.java	(working copy)
        @@ -127,7 +127,7 @@
         
             public FieldCacheTermsFilterDocIdSet(FieldCache.DocTermsIndex fcsi) {
               this.fcsi = fcsi;
        -      openBitSet = new OpenBitSet(this.fcsi.size());
        +      openBitSet = new OpenBitSet(this.fcsi.numOrd());
               final BytesRef spare = new BytesRef();
               for (int i=0;i<terms.length;i++) {
                 int termNumber = this.fcsi.binarySearchLookup(terms[i], spare);
        

        We were using number-of-docs instead of number-of-ords to allocate the
        OBS. This only causes a problem if 1) every doc has a unique term
        (thus number-of-ords is 1 + number-of-docs since 0 is reserved to mean
        unset), and 2) the number of docs is 0 mod 64, because this would then
        cause an AIOOBE coming out of the OBS (not the fcsi.getOrd!), which
        the disi impl interprets as no-more-docs. Phew!

        I think likely we couldn't repro because ConccurrentMergeScheduler was
        involved... this was able to repro for me:

        ant test-core -Dtestcase=TestFieldCacheTermsFilter -Dtestmethod=testMissingTerms -Dtests.seed=-7172644887386150665:-4028247385708880228a
        

        It uses SMS.

        I'd like to add assertions to OBS to better catch bugs like this in
        the future... I'll work up a patch.

        Show
        Michael McCandless added a comment - I just committed the fix: Index: src/java/org/apache/lucene/search/FieldCacheTermsFilter.java =================================================================== --- src/java/org/apache/lucene/search/FieldCacheTermsFilter.java (revision 1125975) +++ src/java/org/apache/lucene/search/FieldCacheTermsFilter.java (working copy) @@ -127,7 +127,7 @@ public FieldCacheTermsFilterDocIdSet(FieldCache.DocTermsIndex fcsi) { this.fcsi = fcsi; - openBitSet = new OpenBitSet(this.fcsi.size()); + openBitSet = new OpenBitSet(this.fcsi.numOrd()); final BytesRef spare = new BytesRef(); for (int i=0;i<terms.length;i++) { int termNumber = this.fcsi.binarySearchLookup(terms[i], spare); We were using number-of-docs instead of number-of-ords to allocate the OBS. This only causes a problem if 1) every doc has a unique term (thus number-of-ords is 1 + number-of-docs since 0 is reserved to mean unset), and 2) the number of docs is 0 mod 64, because this would then cause an AIOOBE coming out of the OBS (not the fcsi.getOrd!), which the disi impl interprets as no-more-docs. Phew! I think likely we couldn't repro because ConccurrentMergeScheduler was involved... this was able to repro for me: ant test-core -Dtestcase=TestFieldCacheTermsFilter -Dtestmethod=testMissingTerms -Dtests.seed=-7172644887386150665:-4028247385708880228a It uses SMS. I'd like to add assertions to OBS to better catch bugs like this in the future... I'll work up a patch.
        Hide
        Michael McCandless added a comment -

        The test still fails if I force LTC to never use an ES in newSearcher...

        Show
        Michael McCandless added a comment - The test still fails if I force LTC to never use an ES in newSearcher...
        Hide
        Michael McCandless added a comment -

        Yes, we are... lesser evil I think, but still does not make one happy when that lesser evil attacks you!

        I'll re-run w/ verbose... could cause but not to happen though.

        Show
        Michael McCandless added a comment - Yes, we are... lesser evil I think, but still does not make one happy when that lesser evil attacks you! I'll re-run w/ verbose... could cause but not to happen though.
        Hide
        Robert Muir added a comment -

        would really like to know if the one that failed is using an ExecutorService.

        Remember in that case we are trading reproducability for more coverage at the moment

        Show
        Robert Muir added a comment - would really like to know if the one that failed is using an ExecutorService. Remember in that case we are trading reproducability for more coverage at the moment
        Hide
        Michael McCandless added a comment -

        Woops, the while(true) tester just hit it... took 136 iterations @ -Dtests.iter=100. It got this seed:

        NOTE: reproduce with: ant test -Dtestcase=TestFieldCacheTermsFilter -Dtestmethod=testMissingTerms -Dtests.seed=3673074264729476620:7782176163702481120
        

        But, that does not repro for me...

        Show
        Michael McCandless added a comment - Woops, the while(true) tester just hit it... took 136 iterations @ -Dtests.iter=100. It got this seed: NOTE: reproduce with: ant test -Dtestcase=TestFieldCacheTermsFilter -Dtestmethod=testMissingTerms -Dtests.seed=3673074264729476620:7782176163702481120 But, that does not repro for me...
        Hide
        Simon Willnauer added a comment -

        I think we should close - can't reproduce - we can still reopen...

        Show
        Simon Willnauer added a comment - I think we should close - can't reproduce - we can still reopen...
        Hide
        Michael McCandless added a comment -

        I can't repro it either... and the while(1) tester (w/ no seed) also cannot so far...

        Show
        Michael McCandless added a comment - I can't repro it either... and the while(1) tester (w/ no seed) also cannot so far...
        Hide
        Simon Willnauer added a comment -

        mike this does not reproduce for me.. can you still trigger this one?

        Show
        Simon Willnauer added a comment - mike this does not reproduce for me.. can you still trigger this one?

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development