Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0, 5.0
    • Component/s: general/test
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This will be duplicated in the runner and in LuceneTestCase; try to consolidate.

      1. LUCENE-3985.patch
        56 kB
        Dawid Weiss
      2. LUCENE-3985.patch
        56 kB
        Dawid Weiss
      3. LUCENE-3985.patch
        56 kB
        Dawid Weiss
      4. LUCENE-3985.patch
        52 kB
        Dawid Weiss

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment

          Show
          Hoss Man added a comment - bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment
          Hide
          Dawid Weiss added a comment -

          I have a branch on which tests pass. There are some threads that I ignore in Solr because there is no way to shut them down cleanly. I'll provide a patch soon.

          Show
          Dawid Weiss added a comment - I have a branch on which tests pass. There are some threads that I ignore in Solr because there is no way to shut them down cleanly. I'll provide a patch soon.
          Hide
          Dawid Weiss added a comment -

          A patch that adds refactored thread leak detection (at the suite level), improved reporting, suite timeouts and other whistles.

          Check out TestWorstCaseTestBehavior and uncomment methods demonstrating these features if you'd like to see what it looks like.

          Solr tests pass because I added this:

          +    // THESE ARE LIKELY BUGS - these threads should be closed!
          +    if (threadName.startsWith("Overseer-") ||
          +        threadName.startsWith("aliveCheckExecutor-") ||
          +        threadName.startsWith("concurrentUpdateScheduler-")) {
          +      return true;
          +    } 
          

          These should be fixed but I have too little Solr knowledge to do it myself. Also, one entire test suite is commented out – FullSolrCloudTest. This one leaks threads that don't look safe to me.

          Show
          Dawid Weiss added a comment - A patch that adds refactored thread leak detection (at the suite level), improved reporting, suite timeouts and other whistles. Check out TestWorstCaseTestBehavior and uncomment methods demonstrating these features if you'd like to see what it looks like. Solr tests pass because I added this: + // THESE ARE LIKELY BUGS - these threads should be closed! + if (threadName.startsWith( "Overseer-" ) || + threadName.startsWith( "aliveCheckExecutor-" ) || + threadName.startsWith( "concurrentUpdateScheduler-" )) { + return true ; + } These should be fixed but I have too little Solr knowledge to do it myself. Also, one entire test suite is commented out – FullSolrCloudTest. This one leaks threads that don't look safe to me.
          Hide
          Mark Miller added a comment -

          This one leaks threads that don't look safe to me.

          What kind of threads are you considering not safe?

          Show
          Mark Miller added a comment - This one leaks threads that don't look safe to me. What kind of threads are you considering not safe?
          Hide
          Dawid Weiss added a comment -

          I'd say no threads should be left after a suite completes, really, but I understand the pattern of having a static executor to which tasks are dispatched. Shutting it down will be difficult because it's effectively a one-time initialization singleton. I consider this "safe" because these threads in general just wait idly for jobs. I cannot rule out that they are still processing something but checking this adds more overhead than necessary. We could shorten the expiration time on these executors and increase lingering beyond this expiration time so theoretically it's possible to wait for a clean state but I think it'll increase tests time to an unacceptable level (didn't check though).

          That one particular test suite leaks threads that have stack traces reaching into solr though – if you care to take a look at see what they are and if some kind of shutdown is missing, I'd appreciate.

          Show
          Dawid Weiss added a comment - I'd say no threads should be left after a suite completes, really, but I understand the pattern of having a static executor to which tasks are dispatched. Shutting it down will be difficult because it's effectively a one-time initialization singleton. I consider this "safe" because these threads in general just wait idly for jobs. I cannot rule out that they are still processing something but checking this adds more overhead than necessary. We could shorten the expiration time on these executors and increase lingering beyond this expiration time so theoretically it's possible to wait for a clean state but I think it'll increase tests time to an unacceptable level (didn't check though). That one particular test suite leaks threads that have stack traces reaching into solr though – if you care to take a look at see what they are and if some kind of shutdown is missing, I'd appreciate.
          Hide
          Dawid Weiss added a comment -

          FullSolrCloudTest leaves the following:

          
          68621 T11 ccr.ThreadLeakControl.checkThreadLeaks SEVERE 3 threads leaked from SUITE scope at org.apache.solr.cloud.FullSolrCloudTest: 
          	   1) Thread[id=606, name=Thread-97, state=WAITING, group=Overseer state updater.]
          	        at java.lang.Object.wait(Native Method)
          	        at java.lang.Object.wait(Object.java:503)
          	        at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1345)
          	        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:925)
          	        at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:289)
          	        at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:1)
          	        at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:67)
          	        at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:286)
          	        at org.apache.solr.cloud.Overseer$CloudStateUpdater.amILeader(Overseer.java:187)
          	        at org.apache.solr.cloud.Overseer$CloudStateUpdater.run(Overseer.java:113)
          	        at java.lang.Thread.run(Thread.java:722)
          	   2) Thread[id=600, name=TEST-FullSolrCloudTest.testDistribSearch-seed#[535509622737983D]-EventThread, state=WAITING, group=TGRP-FullSolrCloudTest]
          	        at sun.misc.Unsafe.park(Native Method)
          	        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
          	        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
          	        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
          	        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
          	   3) Thread[id=599, name=TEST-FullSolrCloudTest.testDistribSearch-seed#[535509622737983D]-SendThread(127.0.0.1:57474), state=RUNNABLE, group=TGRP-FullSolrCloudTest]
          	        at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll0(Native Method)
          	        at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll(WindowsSelectorImpl.java:295)
          	        at sun.nio.ch.WindowsSelectorImpl$SubSelector.access$400(WindowsSelectorImpl.java:277)
          	        at sun.nio.ch.WindowsSelectorImpl.doSelect(WindowsSelectorImpl.java:158)
          	        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
          	        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
          	        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1131)
          
          Show
          Dawid Weiss added a comment - FullSolrCloudTest leaves the following: 68621 T11 ccr.ThreadLeakControl.checkThreadLeaks SEVERE 3 threads leaked from SUITE scope at org.apache.solr.cloud.FullSolrCloudTest: 1) Thread [id=606, name= Thread -97, state=WAITING, group=Overseer state updater.] at java.lang. Object .wait(Native Method) at java.lang. Object .wait( Object .java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1345) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:925) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:289) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:1) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:67) at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:286) at org.apache.solr.cloud.Overseer$CloudStateUpdater.amILeader(Overseer.java:187) at org.apache.solr.cloud.Overseer$CloudStateUpdater.run(Overseer.java:113) at java.lang. Thread .run( Thread .java:722) 2) Thread [id=600, name=TEST-FullSolrCloudTest.testDistribSearch-seed#[535509622737983D]-EventThread, state=WAITING, group=TGRP-FullSolrCloudTest] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 3) Thread [id=599, name=TEST-FullSolrCloudTest.testDistribSearch-seed#[535509622737983D]-SendThread(127.0.0.1:57474), state=RUNNABLE, group=TGRP-FullSolrCloudTest] at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll0(Native Method) at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll(WindowsSelectorImpl.java:295) at sun.nio.ch.WindowsSelectorImpl$SubSelector.access$400(WindowsSelectorImpl.java:277) at sun.nio.ch.WindowsSelectorImpl.doSelect(WindowsSelectorImpl.java:158) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1131)
          Hide
          Mark Miller added a comment -

          heh - unfortunately, for me, only sometimes does it fail with one thread - which is different than the 3 you see. I took a try at improving the case I can see.

          Show
          Mark Miller added a comment - heh - unfortunately, for me, only sometimes does it fail with one thread - which is different than the 3 you see. I took a try at improving the case I can see.
          Hide
          Dawid Weiss added a comment -

          Yeah, it will vary – the lingering time is 2 seconds but it's still not enough and I see some failures depending on the machine I use. My build server failed with this one, for example:

          build	31-Jul-2012 15:01:23	[junit4:junit4] Suite: org.apache.solr.cloud.ZkCLITest
          ...
          build	31-Jul-2012 15:01:23	[junit4:junit4]    > Throwable #1: com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked from SUITE scope at org.apache.solr.cloud.ZkCLITest: 
          build	31-Jul-2012 15:01:23	[junit4:junit4]    >    1) Thread[id=20260, name=SessionTracker, state=TIMED_WAITING, group=TGRP-ZkCLITest]
          build	31-Jul-2012 15:01:23	[junit4:junit4]    >         at java.lang.Object.wait(Native Method)
          build	31-Jul-2012 15:01:23	[junit4:junit4]    >         at org.apache.zookeeper.server.SessionTrackerImpl.run(SessionTrackerImpl.java:145)
          build	31-Jul-2012 15:01:23	[junit4:junit4]    > 	at __randomizedtesting.SeedInfo.seed([B90AA507E190F87]:0)
          build	31-Jul-2012 15:01:23	[junit4:junit4]    >
          

          We should try to nail these because they probably contribute to those thread limits that you've been commenting on. I also see some crazy thread IDs in logs like: T20000 which seems suspicious – they should be reused after a while.

          Show
          Dawid Weiss added a comment - Yeah, it will vary – the lingering time is 2 seconds but it's still not enough and I see some failures depending on the machine I use. My build server failed with this one, for example: build 31-Jul-2012 15:01:23 [junit4:junit4] Suite: org.apache.solr.cloud.ZkCLITest ... build 31-Jul-2012 15:01:23 [junit4:junit4] > Throwable #1: com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked from SUITE scope at org.apache.solr.cloud.ZkCLITest: build 31-Jul-2012 15:01:23 [junit4:junit4] > 1) Thread [id=20260, name=SessionTracker, state=TIMED_WAITING, group=TGRP-ZkCLITest] build 31-Jul-2012 15:01:23 [junit4:junit4] > at java.lang. Object .wait(Native Method) build 31-Jul-2012 15:01:23 [junit4:junit4] > at org.apache.zookeeper.server.SessionTrackerImpl.run(SessionTrackerImpl.java:145) build 31-Jul-2012 15:01:23 [junit4:junit4] > at __randomizedtesting.SeedInfo.seed([B90AA507E190F87]:0) build 31-Jul-2012 15:01:23 [junit4:junit4] > We should try to nail these because they probably contribute to those thread limits that you've been commenting on. I also see some crazy thread IDs in logs like: T20000 which seems suspicious – they should be reused after a while.
          Hide
          Dawid Weiss added a comment -

          they should be reused after a while.

          Ok, from the JDK source code it seems they won't be.

          Show
          Dawid Weiss added a comment - they should be reused after a while. Ok, from the JDK source code it seems they won't be.
          Hide
          Mark Miller added a comment -

          I saw another:

          2) Thread[id=13, name=SessionTracker, state=TIMED_WAITING, group=TGRP-FullSolrCloudTest]
          at java.lang.Object.wait(Native Method)
          at org.apache.zookeeper.server.SessionTrackerImpl.run(SessionTrackerImpl.java:145)
          at __randomizedtesting.SeedInfo.seed([657241244CA545BB]:0)

          This, like I think some of the others, is stopping - it just has not stopped yet I guess. In this case, ZooKeeper tries to shutdown this thread simply by setting running to false - on its next loop it will notice and bail. ZooKeeper doesn't wait to make sure that has happened before returning from shutdown though.

          Show
          Mark Miller added a comment - I saw another: 2) Thread [id=13, name=SessionTracker, state=TIMED_WAITING, group=TGRP-FullSolrCloudTest] at java.lang.Object.wait(Native Method) at org.apache.zookeeper.server.SessionTrackerImpl.run(SessionTrackerImpl.java:145) at __randomizedtesting.SeedInfo.seed( [657241244CA545BB] :0) This, like I think some of the others, is stopping - it just has not stopped yet I guess. In this case, ZooKeeper tries to shutdown this thread simply by setting running to false - on its next loop it will notice and bail. ZooKeeper doesn't wait to make sure that has happened before returning from shutdown though.
          Hide
          Dawid Weiss added a comment -

          You're right. I'll probably add a hacky way to terminate these threads earlier instead of enforcing a long lingering time. Thanks Mark.

          Show
          Dawid Weiss added a comment - You're right. I'll probably add a hacky way to terminate these threads earlier instead of enforcing a long lingering time. Thanks Mark.
          Hide
          Dawid Weiss added a comment -

          Reiterated patch improving logging on suite timeouts among other things.

          I've added @BadApples test group (enabled by default) which includes two zk tests that I just have no idea how to fix and that fail for me frequently (and with such a long log that I can't really tell what's a result of the test itself and what's an unexpected behavior). I needed this to run full solr/lucene tests to ensure other things were working.

          Help to fix these two remaining tests would be great, I don't want to throw everything on Mark

          This also brings a question if I should commit in a patch that will probably break all existing builds (those bad apples above). I think it's worth it, even at the price of temporarily ignoring those two tests.

          Show
          Dawid Weiss added a comment - Reiterated patch improving logging on suite timeouts among other things. I've added @BadApples test group (enabled by default) which includes two zk tests that I just have no idea how to fix and that fail for me frequently (and with such a long log that I can't really tell what's a result of the test itself and what's an unexpected behavior). I needed this to run full solr/lucene tests to ensure other things were working. Help to fix these two remaining tests would be great, I don't want to throw everything on Mark This also brings a question if I should commit in a patch that will probably break all existing builds (those bad apples above). I think it's worth it, even at the price of temporarily ignoring those two tests.
          Hide
          Dawid Weiss added a comment -

          Updated patch passing -Dtests.badapples to the spawned jvm (missed that).

          Show
          Dawid Weiss added a comment - Updated patch passing -Dtests.badapples to the spawned jvm (missed that).
          Hide
          Robert Muir added a comment -

          Did the "fail tests on uncaught exceptions from other threads" get factored into the new randomizedrunner?

          Show
          Robert Muir added a comment - Did the "fail tests on uncaught exceptions from other threads" get factored into the new randomizedrunner?
          Hide
          Dawid Weiss added a comment -

          It was actually there from the very beginning, only initially it was at the thread group level (not at the global level). Normally this wouldn't matter as most threads would inherit the test group but I changed it so that it's now on both and additionally verifies if the handles hasn't been changed along the way. ZooKeeper does substitute it so there is a hack aimed at eagerly initializing that part of ZooKeeper and restoring the handler.

          Show
          Dawid Weiss added a comment - It was actually there from the very beginning, only initially it was at the thread group level (not at the global level). Normally this wouldn't matter as most threads would inherit the test group but I changed it so that it's now on both and additionally verifies if the handles hasn't been changed along the way. ZooKeeper does substitute it so there is a hack aimed at eagerly initializing that part of ZooKeeper and restoring the handler.
          Hide
          Dawid Weiss added a comment -

          Build server bug feedback – zookeeper init should be a class rule, not a test rule. Corrected.

          Show
          Dawid Weiss added a comment - Build server bug feedback – zookeeper init should be a class rule, not a test rule. Corrected.
          Hide
          Uwe Schindler added a comment -

          You can add the signatures to the forbidden API list under jdk.txt (with comment) or a new file (but don't forget to place this new signature file in Lucene and Solr's filesets).

          I think, to not complicate the filesets, we should use for this case simply jdk.txt and not a separate file (as all signatures refer to JDK. Otherwise we must rename jdk.txt to defaultCharsJdk.txt or whatever). Just place a comment in the introduction and add the signatures to jdk.txt. The other txt files in banned methods are more for other parts of lucene code-base (like test-only), or like commons-io, refer to a solr-only lib.

          Show
          Uwe Schindler added a comment - You can add the signatures to the forbidden API list under jdk.txt (with comment) or a new file (but don't forget to place this new signature file in Lucene and Solr's filesets). I think, to not complicate the filesets, we should use for this case simply jdk.txt and not a separate file (as all signatures refer to JDK. Otherwise we must rename jdk.txt to defaultCharsJdk.txt or whatever). Just place a comment in the introduction and add the signatures to jdk.txt. The other txt files in banned methods are more for other parts of lucene code-base (like test-only), or like commons-io, refer to a solr-only lib.
          Hide
          Dawid Weiss added a comment - - edited

          It is a separate file, I wanted it to be somewhat explicit. We can merge in later on, not a problem.

          Show
          Dawid Weiss added a comment - - edited It is a separate file, I wanted it to be somewhat explicit. We can merge in later on, not a problem.
          Hide
          Uwe Schindler added a comment -

          OK, doesnt matter. I would just prefer to have it merged in - or we should rename the other files, too.

          Show
          Uwe Schindler added a comment - OK, doesnt matter. I would just prefer to have it merged in - or we should rename the other files, too.
          Hide
          Mark Miller added a comment -

          Hopefully I can look at my piece of this today or tomorrow.

          Show
          Mark Miller added a comment - Hopefully I can look at my piece of this today or tomorrow.
          Hide
          Mark Miller added a comment -

          I added a close to Overseer and ConnectionManager - at least for me that FullSolrCloudTest seems to pass consistently. I only ran it like 15-20 times, but it passed each time and was failing 50% or more previously.

          Show
          Mark Miller added a comment - I added a close to Overseer and ConnectionManager - at least for me that FullSolrCloudTest seems to pass consistently. I only ran it like 15-20 times, but it passed each time and was failing 50% or more previously.
          Hide
          Dawid Weiss added a comment -

          Thanks Mark. I'll merge with the trunk and let the tests run for a while.

          Show
          Dawid Weiss added a comment - Thanks Mark. I'll merge with the trunk and let the tests run for a while.
          Hide
          Robert Muir added a comment -

          rmuir20120906-bulk-40-change

          Show
          Robert Muir added a comment - rmuir20120906-bulk-40-change
          Hide
          Dawid Weiss added a comment -

          Applies on trunk. Will backport to 4.x tomorrow if nothing pops up.

          Show
          Dawid Weiss added a comment - Applies on trunk. Will backport to 4.x tomorrow if nothing pops up.
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Dawid Weiss
              Reporter:
              Dawid Weiss
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development