Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      Handle intermittent failures in the ChaosMonkey tests.

      1. CMSL_fail1.log
        4.05 MB
        Yonik Seeley
      2. CMSL_hang_2.txt
        132 kB
        Yonik Seeley
      3. CMSL_hang.txt
        129 kB
        Yonik Seeley
      4. fail.130101_034142.txt
        4.54 MB
        Yonik Seeley
      5. fail.130102_020942.txt
        7.13 MB
        Yonik Seeley
      6. fail.130103_105104.txt
        9.75 MB
        Yonik Seeley
      7. fail.130103_193722.txt
        8.44 MB
        Yonik Seeley
      8. fail.inconsistent.txt
        9.58 MB
        Yonik Seeley
      9. test_report_1.txt
        6.74 MB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment -

          We've come a long way and the monkey has uncovered a number of bugs that we've fixed and is helping to make a really solid solution.

          I just uncovered another one having to do with races on shutdown.
          When we kill the Jetty instance, it can cause an interrupted exception that closes the underlying NIO files under lucene.
          If a commit is happening concurrently then what can happen is that we can end up with more than one unfinished transaction log.

          We call preCommit, which move the current tlog to prevTlog.
          The commit fails, but concurrently other updates are coming in and they cause a new tlog to be created.
          Even other updates coming in after this point can also succeed since they are simply buffered in memory by the IW.

          Show
          Yonik Seeley added a comment - We've come a long way and the monkey has uncovered a number of bugs that we've fixed and is helping to make a really solid solution. I just uncovered another one having to do with races on shutdown. When we kill the Jetty instance, it can cause an interrupted exception that closes the underlying NIO files under lucene. If a commit is happening concurrently then what can happen is that we can end up with more than one unfinished transaction log. We call preCommit, which move the current tlog to prevTlog. The commit fails, but concurrently other updates are coming in and they cause a new tlog to be created. Even other updates coming in after this point can also succeed since they are simply buffered in memory by the IW.
          Hide
          Yonik Seeley added a comment -

          Just checked in a fix for this as well as a test that recovers from more than one tlog at startup.

          Show
          Yonik Seeley added a comment - Just checked in a fix for this as well as a test that recovers from more than one tlog at startup.
          Hide
          Yonik Seeley added a comment -

          Failures are much less frequent, but I still got one after about 7 hours I think.
          I saw a commit fail (due to the interrupted exception), but then I later saw the IW.close() succeed (which caused Solr to cap the log file, assuming that everything was in the index).

          As a result, I just committed a change to the shutdown code to do an explicit commit.

          Show
          Yonik Seeley added a comment - Failures are much less frequent, but I still got one after about 7 hours I think. I saw a commit fail (due to the interrupted exception), but then I later saw the IW.close() succeed (which caused Solr to cap the log file, assuming that everything was in the index). As a result, I just committed a change to the shutdown code to do an explicit commit.
          Hide
          Yonik Seeley added a comment - - edited

          Unfortunately, the chaos monkey tests have gathered some rust due to not being run regularly.

          I've added some instrumentation for one particular type of fail (number of docs mismatch between control and collection) and ran into something peculiar.

          selected lines from test_report_1.txt

          6952 T39 oasc.ZkController.register Register shard - core:c
          ollection1 address:http://127.0.0.1:60440/ij/pf shardId:shard1
          
          9734 T56 oasc.ZkController.register Register shard - core:c
          ollection1 address:http://127.0.0.1:48048/ij/pf shardId:shard1
          
          12184 T71 oasc.ZkController.register Register shard - core:
          collection1 address:http://127.0.0.1:33264/ij/pf shardId:shard2
          
          59069 T27 C12 P60440 /update {wt=javabin&version=2} {add=[50067 (1421643212854919168)]} 0 3
          
          89123 T48 C1 P48048 /update {wt=javabin&version=2} {add=[50067 (1421643244360433664)]} 0 10
          
          89354 T25 C12 P60440 /update {wt=javabin&version=2} {delete=[50067 (-1421643244613140480)]} 0 1
          
          89364 T47 C1 P48048 /update {wt=javabin&version=2} {delete=[50067 (-1421643244618383360)]} 0 6
          
          90114 T63 C3 P33264 /update {wt=javabin&version=2} {add=[50067]} 0 31041
          
          217860 T10 oasc.AbstractFullDistribZkTestBase.checkShardConsistency SEVERE document count mismatch.  control=1460 sum(shards)=1461 cloudClient=1461
          [junit4:junit4]
          
          ###### Only in cloudDocList: [{id=50067}]
          

          The only updates we see from the logs for id 50067 are to collection1 (none to the control shard).

          Further, grepping for "Register shard" shows that they are all for collection1 and none for the control collection. Not sure how that's possible and only be off by 1, so I'm probably mis-interpreting the logging somehow...

          Show
          Yonik Seeley added a comment - - edited Unfortunately, the chaos monkey tests have gathered some rust due to not being run regularly. I've added some instrumentation for one particular type of fail (number of docs mismatch between control and collection) and ran into something peculiar. selected lines from test_report_1.txt 6952 T39 oasc.ZkController.register Register shard - core:c ollection1 address:http: //127.0.0.1:60440/ij/pf shardId:shard1 9734 T56 oasc.ZkController.register Register shard - core:c ollection1 address:http: //127.0.0.1:48048/ij/pf shardId:shard1 12184 T71 oasc.ZkController.register Register shard - core: collection1 address:http: //127.0.0.1:33264/ij/pf shardId:shard2 59069 T27 C12 P60440 /update {wt=javabin&version=2} {add=[50067 (1421643212854919168)]} 0 3 89123 T48 C1 P48048 /update {wt=javabin&version=2} {add=[50067 (1421643244360433664)]} 0 10 89354 T25 C12 P60440 /update {wt=javabin&version=2} {delete=[50067 (-1421643244613140480)]} 0 1 89364 T47 C1 P48048 /update {wt=javabin&version=2} {delete=[50067 (-1421643244618383360)]} 0 6 90114 T63 C3 P33264 /update {wt=javabin&version=2} {add=[50067]} 0 31041 217860 T10 oasc.AbstractFullDistribZkTestBase.checkShardConsistency SEVERE document count mismatch. control=1460 sum(shards)=1461 cloudClient=1461 [junit4:junit4] ###### Only in cloudDocList: [{id=50067}] The only updates we see from the logs for id 50067 are to collection1 (none to the control shard). Further, grepping for "Register shard" shows that they are all for collection1 and none for the control collection. Not sure how that's possible and only be off by 1, so I'm probably mis-interpreting the logging somehow...
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Yonik Seeley
          http://svn.apache.org/viewvc?view=revision&revision=1423275

          SOLR-3180: improve logging

          Show
          Commit Tag Bot added a comment - [trunk commit] Yonik Seeley http://svn.apache.org/viewvc?view=revision&revision=1423275 SOLR-3180 : improve logging
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Yonik Seeley
          http://svn.apache.org/viewvc?view=revision&revision=1423276

          SOLR-3180: improve logging

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Yonik Seeley http://svn.apache.org/viewvc?view=revision&revision=1423276 SOLR-3180 : improve logging
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Yonik Seeley
          http://svn.apache.org/viewvc?view=revision&revision=1423591

          SOLR-3180: improve logging

          Show
          Commit Tag Bot added a comment - [trunk commit] Yonik Seeley http://svn.apache.org/viewvc?view=revision&revision=1423591 SOLR-3180 : improve logging
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Yonik Seeley
          http://svn.apache.org/viewvc?view=revision&revision=1423597

          SOLR-3180: improve logging

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Yonik Seeley http://svn.apache.org/viewvc?view=revision&revision=1423597 SOLR-3180 : improve logging
          Hide
          Yonik Seeley added a comment -

          Uploading fail.inconsistent.txt - I had to truncate the start of the log file to get it under the limit for JIRA.

          Analysis:

          
            2> ASYNC  NEW_CORE C6 name=collection1 org.apache.solr.core.SolrCore@eaecb09 url=http://127.0.0.1:58270/collection1 node=127.0.0.1:58270_ C6_STATE=coll:control_collection core:collection1 props:{shard=shard1, roles=null, state=active, core=collection1, collection=control_collection, node_name=127.0.0.1:58270_, base_url=http://127.0.0.1:58270, leader=true}
          
            2> ASYNC  NEW_CORE C5 name=collection1 org.apache.solr.core.SolrCore@54eeabe8 url=http://127.0.0.1:37198/collection1 node=127.0.0.1:37198_ C5_STATE=coll:collection1 core:collection1 props:{shard=shard3, roles=null, state=active, core=collection1, collection=collection1, node_name=127.0.0.1:37198_, base_url=http://127.0.0.1:37198, leader=true}
            2> 25510 T80 C5 P37198 REQ /get {distrib=false&qt=/get&wt=javabin&version=2&getVersions=100} status=0 QTime=0 
          
          
            2> 187637 T669 C21 P39620 oasu.PeerSync.sync PeerSync: core=collection1 url=http://127.0.0.1:39620 DONE. sync succeeded
            2> 188653 T669 C21 P39620 oasc.RecoveryStrategy.doRecovery PeerSync Recovery was successful - registering as Active. core=collection1
          
          #
          # C21 (the replica) is recovering around the same time that the update for id:52720 comes in (we only
          # see when the update finishes below, not when it starts).
          #
          
          # update control finished
            2> 187923 T24 C6 P58270 /update {wt=javabin&version=2} {add=[52720 (1422073056556220416)]} 0 1
          # update leader for shard3 finished
            2> 187927 T77 C5 P37198 /update {wt=javabin&version=2} {add=[52720 (1422073056559366144)]} 0 1
          # these are the only adds for id:52720 in the logs...
          # TODO: verify that there was no replica for C5 to forward to?
          
          
          --------------------
            2> 225993 T77 C5 P37198 REQ /select {tests=checkShardConsistency&q=*:*&distrib=false&wt=javabin&rows=0&version=2} hits=835 status=0 QTime=1 
          # Note that C5 is still the leader - this means that C21 recovered from it at some point?
          
            2> 225997 T658 C21 P39620 REQ /select {tests=checkShardConsistency&q=*:*&distrib=false&wt=javabin&rows=0&version=2} hits=833 status=0 QTime=1 
            2>  live:true
            2>  num:833
            2> 
            2> ######shard3 is not consistent.  Got 835 from http://127.0.0.1:37198/collection1lastClient and got 833 from http://127.0.0.1:39620/collection1
           
            2> ###### sizes=835,833
            2> ###### Only in http://127.0.0.1:37198/collection1: [{id=52720, _version_=1422073056559366144}, {id=52710, _version_=1422073056325533696}, {id=52717, _version_=1422073056485965825}, {id=2225, _version_=1422073056602357760}, {id=52709, _version_=1422073056298270720}, {id=2226, _version_=1422073056612843520}, {id=2219, _version_=1422073056477577216}, {id=52723, _version_=1422073056605503488}]
            2> ###### Only in http://127.0.0.1:39620/collection1: [{id=52680, _version_=1422073042480136192}, {id=52669, _version_=1422073042276712448}, {id=52676, _version_=1422073042420367360}, {id=2204, _version_=1422073042912149504}, {id=2198, _version_=1422073042778980352}, {id=2207, _version_=1422073053454532608}]
          

          So what looks to be the case is that we came up and did a peersync with the leader and succeeded, but the leader hadn't noticed us yet and so didn't forward us a new update in the meantime.

          If we don't already, we need to make sure to do some of the same type of stuff that we do with replication recovery. The replica needs to ensure that the leader sees it (and hence will forward future updates) before it peersyncs with the leader.

          Show
          Yonik Seeley added a comment - Uploading fail.inconsistent.txt - I had to truncate the start of the log file to get it under the limit for JIRA. Analysis: 2> ASYNC NEW_CORE C6 name=collection1 org.apache.solr.core.SolrCore@eaecb09 url=http: //127.0.0.1:58270/collection1 node=127.0.0.1:58270_ C6_STATE=coll:control_collection core:collection1 props:{shard=shard1, roles= null , state=active, core=collection1, collection=control_collection, node_name=127.0.0.1:58270_, base_url=http://127.0.0.1:58270, leader= true } 2> ASYNC NEW_CORE C5 name=collection1 org.apache.solr.core.SolrCore@54eeabe8 url=http: //127.0.0.1:37198/collection1 node=127.0.0.1:37198_ C5_STATE=coll:collection1 core:collection1 props:{shard=shard3, roles= null , state=active, core=collection1, collection=collection1, node_name=127.0.0.1:37198_, base_url=http://127.0.0.1:37198, leader= true } 2> 25510 T80 C5 P37198 REQ /get {distrib= false &qt=/get&wt=javabin&version=2&getVersions=100} status=0 QTime=0 2> 187637 T669 C21 P39620 oasu.PeerSync.sync PeerSync: core=collection1 url=http: //127.0.0.1:39620 DONE. sync succeeded 2> 188653 T669 C21 P39620 oasc.RecoveryStrategy.doRecovery PeerSync Recovery was successful - registering as Active. core=collection1 # # C21 (the replica) is recovering around the same time that the update for id:52720 comes in (we only # see when the update finishes below, not when it starts). # # update control finished 2> 187923 T24 C6 P58270 /update {wt=javabin&version=2} {add=[52720 (1422073056556220416)]} 0 1 # update leader for shard3 finished 2> 187927 T77 C5 P37198 /update {wt=javabin&version=2} {add=[52720 (1422073056559366144)]} 0 1 # these are the only adds for id:52720 in the logs... # TODO: verify that there was no replica for C5 to forward to? -------------------- 2> 225993 T77 C5 P37198 REQ /select {tests=checkShardConsistency&q=*:*&distrib= false &wt=javabin&rows=0&version=2} hits=835 status=0 QTime=1 # Note that C5 is still the leader - this means that C21 recovered from it at some point? 2> 225997 T658 C21 P39620 REQ /select {tests=checkShardConsistency&q=*:*&distrib= false &wt=javabin&rows=0&version=2} hits=833 status=0 QTime=1 2> live: true 2> num:833 2> 2> ######shard3 is not consistent. Got 835 from http: //127.0.0.1:37198/collection1lastClient and got 833 from http://127.0.0.1:39620/collection1 2> ###### sizes=835,833 2> ###### Only in http: //127.0.0.1:37198/collection1: [{id=52720, _version_=1422073056559366144}, {id=52710, _version_=1422073056325533696}, {id=52717, _version_=1422073056485965825}, {id=2225, _version_=1422073056602357760}, {id=52709, _version_=1422073056298270720}, {id=2226, _version_=1422073056612843520}, {id=2219, _version_=1422073056477577216}, {id=52723, _version_=1422073056605503488}] 2> ###### Only in http: //127.0.0.1:39620/collection1: [{id=52680, _version_=1422073042480136192}, {id=52669, _version_=1422073042276712448}, {id=52676, _version_=1422073042420367360}, {id=2204, _version_=1422073042912149504}, {id=2198, _version_=1422073042778980352}, {id=2207, _version_=1422073053454532608}] So what looks to be the case is that we came up and did a peersync with the leader and succeeded, but the leader hadn't noticed us yet and so didn't forward us a new update in the meantime. If we don't already, we need to make sure to do some of the same type of stuff that we do with replication recovery. The replica needs to ensure that the leader sees it (and hence will forward future updates) before it peersyncs with the leader.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1425342

          SOLR-3180: wait for leader to see our recovering state before peer sync as well

          Show
          Commit Tag Bot added a comment - [trunk commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1425342 SOLR-3180 : wait for leader to see our recovering state before peer sync as well
          Hide
          Mark Miller added a comment -

          I've got fullmetaljenkins.org on chaos monkey duty: http://fullmetaljenkins.org/view/Chaos/

          I'm running the test with a variety of situations to help narrow in on the problem.

          I've also started checking old checkouts - did that some last night, but with somewhat inconclusive results. Hope to make more progress.

          Show
          Mark Miller added a comment - I've got fullmetaljenkins.org on chaos monkey duty: http://fullmetaljenkins.org/view/Chaos/ I'm running the test with a variety of situations to help narrow in on the problem. I've also started checking old checkouts - did that some last night, but with somewhat inconclusive results. Hope to make more progress.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1425343

          SOLR-3180: wait for leader to see our recovering state before peer sync as well

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1425343 SOLR-3180 : wait for leader to see our recovering state before peer sync as well
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1425554

          SOLR-3180: remove bad assumption

          Show
          Commit Tag Bot added a comment - [trunk commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1425554 SOLR-3180 : remove bad assumption
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1425555

          SOLR-3180: remove bad assumption

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1425555 SOLR-3180 : remove bad assumption
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1425561

          LUCENE-4638, SOLR-3180: try using the IW's writeLock to unlock

          Show
          Commit Tag Bot added a comment - [trunk commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1425561 LUCENE-4638 , SOLR-3180 : try using the IW's writeLock to unlock
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1425563

          LUCENE-4638, SOLR-3180: try using the IW's writeLock to unlock

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1425563 LUCENE-4638 , SOLR-3180 : try using the IW's writeLock to unlock
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1425574

          LUCENE-4638, SOLR-3180: revert for now (try using the IW's writeLock to unlock)

          Show
          Commit Tag Bot added a comment - [trunk commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1425574 LUCENE-4638 , SOLR-3180 : revert for now (try using the IW's writeLock to unlock)
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1425576

          LUCENE-4638, SOLR-3180: revert for now (try using the IW's writeLock to unlock)

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1425576 LUCENE-4638 , SOLR-3180 : revert for now (try using the IW's writeLock to unlock)
          Hide
          Mark Miller added a comment -

          The removed bad assumption commit takes care of the final shard consistency issues. Two holes that I was somehow not seeing when running this test previously. We should probably spin them out into their own issues for CHANGES.

          There is still a very rare cloud doc count is off by one with control issue to resolve - still unclear if its a real issue or test issue.

          The last set of commits was to try and take care of a somewhat unrelated issue around index locks - it didnt work, it just moved the blow up point. It's only related to this issue in that I suspected it as part of an indexsearcher not being closed fail that made 'false' fails annoying to wade through.

          Show
          Mark Miller added a comment - The removed bad assumption commit takes care of the final shard consistency issues. Two holes that I was somehow not seeing when running this test previously. We should probably spin them out into their own issues for CHANGES. There is still a very rare cloud doc count is off by one with control issue to resolve - still unclear if its a real issue or test issue. The last set of commits was to try and take care of a somewhat unrelated issue around index locks - it didnt work, it just moved the blow up point. It's only related to this issue in that I suspected it as part of an indexsearcher not being closed fail that made 'false' fails annoying to wade through.
          Hide
          Yonik Seeley added a comment -

          I got a hang running the CMSL test.
          Here's a thread dump.

          I took another thread dump a few minutes later to see what changed, and it wasn't much:

          /opt/code/misc$ diff -u CMSL_hang.txt CMSL_hang2.txt
          --- CMSL_hang.txt	2012-12-28 12:21:47.000000000 -0500
          +++ CMSL_hang2.txt	2012-12-28 12:24:37.000000000 -0500
          @@ -1,4 +1,4 @@
          -2012-12-28 12:21:47
          +2012-12-28 12:24:38
           Full thread dump Java HotSpot(TM) 64-Bit Server VM (20.4-b02 mixed mode):
           
           "Attach Listener" daemon prio=10 tid=0x0000000042790000 nid=0x1df4 waiting on condition [0x0000000000000000]
          @@ -1559,7 +1559,7 @@
              java.lang.Thread.State: TIMED_WAITING (on object monitor)
           	at java.lang.Object.wait(Native Method)
           	at org.apache.solr.cloud.DistributedQueue$LatchChildWatcher.await(DistributedQueue.java:182)
          -	- locked <0x00000000fd60df70> (a java.lang.Object)
          +	- locked <0x00000000f7ed6b70> (a java.lang.Object)
           	at org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:281)
           	at org.apache.solr.cloud.OverseerCollectionProcessor.run(OverseerCollectionProcessor.java:100)
           	at java.lang.Thread.run(Thread.java:662)
          
          Show
          Yonik Seeley added a comment - I got a hang running the CMSL test. Here's a thread dump. I took another thread dump a few minutes later to see what changed, and it wasn't much: /opt/code/misc$ diff -u CMSL_hang.txt CMSL_hang2.txt --- CMSL_hang.txt 2012-12-28 12:21:47.000000000 -0500 +++ CMSL_hang2.txt 2012-12-28 12:24:37.000000000 -0500 @@ -1,4 +1,4 @@ -2012-12-28 12:21:47 +2012-12-28 12:24:38 Full thread dump Java HotSpot(TM) 64-Bit Server VM (20.4-b02 mixed mode): "Attach Listener" daemon prio=10 tid=0x0000000042790000 nid=0x1df4 waiting on condition [0x0000000000000000] @@ -1559,7 +1559,7 @@ java.lang. Thread .State: TIMED_WAITING (on object monitor) at java.lang. Object .wait(Native Method) at org.apache.solr.cloud.DistributedQueue$LatchChildWatcher.await(DistributedQueue.java:182) - - locked <0x00000000fd60df70> (a java.lang. Object ) + - locked <0x00000000f7ed6b70> (a java.lang. Object ) at org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:281) at org.apache.solr.cloud.OverseerCollectionProcessor.run(OverseerCollectionProcessor.java:100) at java.lang. Thread .run( Thread .java:662)
          Hide
          Yonik Seeley added a comment -

          Here's another hang.
          Diff between 2 snapshots at different times is:

              java.lang.Thread.State: TIMED_WAITING (on object monitor)
           	at java.lang.Object.wait(Native Method)
           	at org.apache.solr.cloud.DistributedQueue$LatchChildWatcher.await(DistributedQueue.java:182)
          -	- locked <0x00000000f5821d30> (a java.lang.Object)
          +	- locked <0x00000000f5f2c7c0> (a java.lang.Object)
           	at org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:281)
           	at org.apache.solr.cloud.OverseerCollectionProcessor.run(OverseerCollectionProcessor.java:100)
           	at java.lang.Thread.run(Thread.java:662)
          
          Show
          Yonik Seeley added a comment - Here's another hang. Diff between 2 snapshots at different times is: java.lang. Thread .State: TIMED_WAITING (on object monitor) at java.lang. Object .wait(Native Method) at org.apache.solr.cloud.DistributedQueue$LatchChildWatcher.await(DistributedQueue.java:182) - - locked <0x00000000f5821d30> (a java.lang. Object ) + - locked <0x00000000f5f2c7c0> (a java.lang. Object ) at org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:281) at org.apache.solr.cloud.OverseerCollectionProcessor.run(OverseerCollectionProcessor.java:100) at java.lang. Thread .run( Thread .java:662)
          Hide
          Yonik Seeley added a comment - - edited

          Progress!
          Here's my analysis of one of the failures where the control doesn't match the cloud:

          
            2> 115826 T24 C12 P44177 /update {wt=javabin&version=2} {add=[50247 (1422479033427296256)]} 0 11
          
            2> 145877 T61 C4 P38042 /update {wt=javabin&version=2} {add=[50247 (1422479064932810752)]} 0 17
          
            2> 146263 T26 C12 P44177 /update {wt=javabin&version=2} {delete=[50247 (-1422479065354338304)]} 0 0
          
            2> 146268 T63 C4 P38042 /update {wt=javabin&version=2} {delete=[50247 (-1422479065356435456)]} 0 3
          
            2> 146867 T46 C1 P60593 /update {wt=javabin&version=2} {add=[50247]} 0 31033
          
            2> ###### Only in cloudDocList: [{id=50247}]
            2> 190158 T10 oasc.AbstractFullDistribZkTestBase.checkShardConsistency SEVERE controlClient :{numFound=0,start=0,docs=[]}
            2> 		cloudClient :{numFound=1,start=0,docs=[SolrDocument{id=50247, _version_=1422479065981386752}]}
           
            NOTE: The version we found does not match any version previously added, and it does come logically after the last delete!
          
          

          edit: the upshot is that it looks like there's a real bug here somewhere - this is not just a test issue.

          Show
          Yonik Seeley added a comment - - edited Progress! Here's my analysis of one of the failures where the control doesn't match the cloud: 2> 115826 T24 C12 P44177 /update {wt=javabin&version=2} {add=[50247 (1422479033427296256)]} 0 11 2> 145877 T61 C4 P38042 /update {wt=javabin&version=2} {add=[50247 (1422479064932810752)]} 0 17 2> 146263 T26 C12 P44177 /update {wt=javabin&version=2} {delete=[50247 (-1422479065354338304)]} 0 0 2> 146268 T63 C4 P38042 /update {wt=javabin&version=2} {delete=[50247 (-1422479065356435456)]} 0 3 2> 146867 T46 C1 P60593 /update {wt=javabin&version=2} {add=[50247]} 0 31033 2> ###### Only in cloudDocList: [{id=50247}] 2> 190158 T10 oasc.AbstractFullDistribZkTestBase.checkShardConsistency SEVERE controlClient :{numFound=0,start=0,docs=[]} 2> cloudClient :{numFound=1,start=0,docs=[SolrDocument{id=50247, _version_=1422479065981386752}]} NOTE: The version we found does not match any version previously added, and it does come logically after the last delete! edit: the upshot is that it looks like there's a real bug here somewhere - this is not just a test issue.
          Hide
          Yonik Seeley added a comment -

          With the logging fixed, I can see more of what the problem is. It appears if two updates for document 50030 are happening simultaneously, which shouldn't happen since the tests don't do this.

            2> 65234 T28 C12 P56307 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948614358958080
            2> 65235 T28 C12 P56307 /update {wt=javabin&version=2} {add=[50030 (1422948614358958080)]} 0 2
           
            2> 65242 T61 C4 P44328 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948614366298112
            2> 65245 T103 C1 P59742 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948614366298112
            2> 65246 T160 C7 P51177 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948614366298112
            2> 65247 T103 C1 P59742 /update {distrib.from=http://127.0.0.1:44328/fa/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[50030 (1422948614366298112)]} 0 2
            2> 65247 T160 C7 P51177 /update {distrib.from=http://127.0.0.1:44328/fa/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[50030 (1422948614366298112)]} 0 2
           
            2> 80763 T59 C4 P44328 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948630641246208
            2> 80773 T158 C7 P51177 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948630641246208
            2> 80776 T158 C7 P51177 /update {distrib.from=http://127.0.0.1:44328/fa/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[50030 (1422948630641246208)]} 0 5
           
            2> 85279 T61 C4 P44328 /update {distrib.from=http://127.0.0.1:36869/fa/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} {add=[50030 (1422948614366298112)]} 0 20038
          
            2> 87884 T213 C10 P42076 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948630641246208
            2> 87885 T213 C10 P42076 /update {distrib.from=http://127.0.0.1:44328/fa/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[50030 (1422948630641246208)]} 0 7116
          
            2> 87973 T216 C10 P42076 /update {distrib.from=http://127.0.0.1:44328/fa/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[50030 (1422948614366298112)]} 0 22728
          
            2> 95272 T61 C4 P44328 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948645855035392
            2> 95277 T158 C7 P51177 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948645855035392
            2> 95281 T158 C7 P51177 /update {distrib.from=http://127.0.0.1:44328/fa/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[50030 (1422948645855035392)]} 0 5
            2> 95283 T61 C4 P44328 /update {wt=javabin&version=2} {add=[50030 (1422948645855035392)]} 0 12
           
          
            2> 95326 T27 C12 P56307 /update {wt=javabin&version=2} {delete=[50030 (-1422948645912707072)]} 0 0
            2> 95336 T159 C7 P51177 /update {distrib.from=http://127.0.0.1:44328/fa/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {delete=[50030 (-1422948645918998528)]} 0 1
            2> 95338 T61 C4 P44328 /update {wt=javabin&version=2} {delete=[50030 (-1422948645918998528)]} 0 7
          
          
            2> 96276 T475 C4 P44328 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948646907805696
            2> 96286 T156 C7 P51177 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948646907805696
            2> 96290 T156 C7 P51177 /update {distrib.from=http://127.0.0.1:44328/fa/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[50030 (1422948646907805696)]} 0 5
            2> 96292 T475 C4 P44328 /update {distrib.from=http://127.0.0.1:36869/fa/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} {add=[50030 (1422948646907805696)]} 0 18
            2> 96293 T77 C6 P36869 /update {wt=javabin&version=2} {add=[50030]} 0 31054
          
          
            2> 97738 T59 C4 P44328 /update {distrib.from=http://127.0.0.1:36869/fa/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} {add=[50030 (1422948630641246208)]} 0 16977
          
          
            2> 98396 T103 C1 P59742 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948630641246208
            2> 98397 T103 C1 P59742 /update {distrib.from=http://127.0.0.1:44328/fa/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[50030 (1422948630641246208)]} 0 17627
          
            2> 105349 T488 C1 P59742 oasu.PeerSync.handleUpdates FINE PeerSync: core=collection1 url=http://127.0.0.1:59742/fa raw update record [1, 1422948645855035392, SolrInputDocument[id=50030, a_si=50, other_tl1=50, a_t=to come to the aid of their country., rnd_b=true, _version_=1422948645855035392]]
            2> 105350 T488 C1 P59742 oasu.PeerSync.handleUpdates FINE PeerSync: core=collection1 url=http://127.0.0.1:59742/fa add add{flags=12,_version_=1422948645855035392,id=50030}
            2> 105351 T488 C1 P59742 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948645855035392
           
          
            2> 105356 T488 C1 P59742 oasu.PeerSync.handleUpdates FINE PeerSync: core=collection1 url=http://127.0.0.1:59742/fa raw update record [2, -1422948645918998528, [B@3b76982e]
            2> 105356 T488 C1 P59742 oasu.PeerSync.handleUpdates FINE PeerSync: core=collection1 url=http://127.0.0.1:59742/fa delete delete{flags=12,_version_=-1422948645918998528,indexedId=50030,commitWithin=-1}
            2> 105356 T488 C1 P59742 oasu.PeerSync.handleUpdates FINE PeerSync: core=collection1 url=http://127.0.0.1:59742/fa raw update record [1, 1422948646907805696, SolrInputDocument[id=50030, a_si=50, other_tl1=50, a_t=to come to the aid of their country., rnd_b=true, _version_=1422948646907805696]]
            2> 105356 T488 C1 P59742 oasu.PeerSync.handleUpdates FINE PeerSync: core=collection1 url=http://127.0.0.1:59742/fa add add{flags=12,_version_=1422948646907805696,id=50030}
          
          
          
            2> ###### Only in cloudDocList: [{id=50030}]
            2> 203443 T10 oasc.AbstractFullDistribZkTestBase.checkShardConsistency SEVERE controlClient :{numFound=0,start=0,docs=[]}
            2> 		cloudClient :{numFound=1,start=0,docs=[SolrDocument{id=50030, _version_=1422948646907805696}]}
          

          As usual C12 is the control.
          You can see an add complete on C4 and on C6

            2> 95283 T61 C4 P44328 /update {wt=javabin&version=2} {add=[50030 (1422948645855035392)]} 0 12
            2> 96293 T77 C6 P36869 /update {wt=javabin&version=2} {add=[50030]} 0 31054
          
          Show
          Yonik Seeley added a comment - With the logging fixed, I can see more of what the problem is. It appears if two updates for document 50030 are happening simultaneously, which shouldn't happen since the tests don't do this. 2> 65234 T28 C12 P56307 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948614358958080 2> 65235 T28 C12 P56307 /update {wt=javabin&version=2} {add=[50030 (1422948614358958080)]} 0 2 2> 65242 T61 C4 P44328 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948614366298112 2> 65245 T103 C1 P59742 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948614366298112 2> 65246 T160 C7 P51177 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948614366298112 2> 65247 T103 C1 P59742 /update {distrib.from=http: //127.0.0.1:44328/fa/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[50030 (1422948614366298112)]} 0 2 2> 65247 T160 C7 P51177 /update {distrib.from=http: //127.0.0.1:44328/fa/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[50030 (1422948614366298112)]} 0 2 2> 80763 T59 C4 P44328 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948630641246208 2> 80773 T158 C7 P51177 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948630641246208 2> 80776 T158 C7 P51177 /update {distrib.from=http: //127.0.0.1:44328/fa/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[50030 (1422948630641246208)]} 0 5 2> 85279 T61 C4 P44328 /update {distrib.from=http: //127.0.0.1:36869/fa/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} {add=[50030 (1422948614366298112)]} 0 20038 2> 87884 T213 C10 P42076 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948630641246208 2> 87885 T213 C10 P42076 /update {distrib.from=http: //127.0.0.1:44328/fa/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[50030 (1422948630641246208)]} 0 7116 2> 87973 T216 C10 P42076 /update {distrib.from=http: //127.0.0.1:44328/fa/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[50030 (1422948614366298112)]} 0 22728 2> 95272 T61 C4 P44328 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948645855035392 2> 95277 T158 C7 P51177 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948645855035392 2> 95281 T158 C7 P51177 /update {distrib.from=http: //127.0.0.1:44328/fa/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[50030 (1422948645855035392)]} 0 5 2> 95283 T61 C4 P44328 /update {wt=javabin&version=2} {add=[50030 (1422948645855035392)]} 0 12 2> 95326 T27 C12 P56307 /update {wt=javabin&version=2} {delete=[50030 (-1422948645912707072)]} 0 0 2> 95336 T159 C7 P51177 /update {distrib.from=http: //127.0.0.1:44328/fa/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {delete=[50030 (-1422948645918998528)]} 0 1 2> 95338 T61 C4 P44328 /update {wt=javabin&version=2} {delete=[50030 (-1422948645918998528)]} 0 7 2> 96276 T475 C4 P44328 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948646907805696 2> 96286 T156 C7 P51177 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948646907805696 2> 96290 T156 C7 P51177 /update {distrib.from=http: //127.0.0.1:44328/fa/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[50030 (1422948646907805696)]} 0 5 2> 96292 T475 C4 P44328 /update {distrib.from=http: //127.0.0.1:36869/fa/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} {add=[50030 (1422948646907805696)]} 0 18 2> 96293 T77 C6 P36869 /update {wt=javabin&version=2} {add=[50030]} 0 31054 2> 97738 T59 C4 P44328 /update {distrib.from=http: //127.0.0.1:36869/fa/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} {add=[50030 (1422948630641246208)]} 0 16977 2> 98396 T103 C1 P59742 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948630641246208 2> 98397 T103 C1 P59742 /update {distrib.from=http: //127.0.0.1:44328/fa/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[50030 (1422948630641246208)]} 0 17627 2> 105349 T488 C1 P59742 oasu.PeerSync.handleUpdates FINE PeerSync: core=collection1 url=http: //127.0.0.1:59742/fa raw update record [1, 1422948645855035392, SolrInputDocument[id=50030, a_si=50, other_tl1=50, a_t=to come to the aid of their country., rnd_b= true , _version_=1422948645855035392]] 2> 105350 T488 C1 P59742 oasu.PeerSync.handleUpdates FINE PeerSync: core=collection1 url=http: //127.0.0.1:59742/fa add add{flags=12,_version_=1422948645855035392,id=50030} 2> 105351 T488 C1 P59742 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=50030 version=1422948645855035392 2> 105356 T488 C1 P59742 oasu.PeerSync.handleUpdates FINE PeerSync: core=collection1 url=http: //127.0.0.1:59742/fa raw update record [2, -1422948645918998528, [B@3b76982e] 2> 105356 T488 C1 P59742 oasu.PeerSync.handleUpdates FINE PeerSync: core=collection1 url=http: //127.0.0.1:59742/fa delete delete{flags=12,_version_=-1422948645918998528,indexedId=50030,commitWithin=-1} 2> 105356 T488 C1 P59742 oasu.PeerSync.handleUpdates FINE PeerSync: core=collection1 url=http: //127.0.0.1:59742/fa raw update record [1, 1422948646907805696, SolrInputDocument[id=50030, a_si=50, other_tl1=50, a_t=to come to the aid of their country., rnd_b= true , _version_=1422948646907805696]] 2> 105356 T488 C1 P59742 oasu.PeerSync.handleUpdates FINE PeerSync: core=collection1 url=http: //127.0.0.1:59742/fa add add{flags=12,_version_=1422948646907805696,id=50030} 2> ###### Only in cloudDocList: [{id=50030}] 2> 203443 T10 oasc.AbstractFullDistribZkTestBase.checkShardConsistency SEVERE controlClient :{numFound=0,start=0,docs=[]} 2> cloudClient :{numFound=1,start=0,docs=[SolrDocument{id=50030, _version_=1422948646907805696}]} As usual C12 is the control. You can see an add complete on C4 and on C6 2> 95283 T61 C4 P44328 /update {wt=javabin&version=2} {add=[50030 (1422948645855035392)]} 0 12 2> 96293 T77 C6 P36869 /update {wt=javabin&version=2} {add=[50030]} 0 31054
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1427589

          SOLR-3180: remove bad method call

          Show
          Commit Tag Bot added a comment - [trunk commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1427589 SOLR-3180 : remove bad method call
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1427590

          SOLR-3180: remove bad method call

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1427590 SOLR-3180 : remove bad method call
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1427658

          SOLR-3180: update node props for retry node, not just url

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1427658 SOLR-3180 : update node props for retry node, not just url
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1427657

          SOLR-3180: update node props for retry node, not just url

          Show
          Commit Tag Bot added a comment - [trunk commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1427657 SOLR-3180 : update node props for retry node, not just url
          Hide
          Yonik Seeley added a comment -

          OK, here's a fail with more logging.

          What finally manifests as an error the test detects seems to be caused by retries (but there are still threads hanging around from the first time that eventually execute).

          I didn't follow this fail all the way through to the end since I saw enough potentially wonky stuff earlier.

          C14 tries to publish success, fails because CM caused a connection loss, and then future updates to C14 hang for a long time.

          
          ###  Since we know C14 (P56842) seems problematic we look back in time before the problematic update.
            2> ASYNC  NEW_CORE C14 name=collection1 org.apache.solr.core.SolrCore@15c617e0 url=http://127.0.0.1:56842/collection1 node=127.0.0.1:56842_ C14_STATE=coll:collection1 core:collection1 props:{shard=shard1, roles=null, state=active, core=collection1, collection=collection1, node_name=127.0.0.1:56842_, base_url=http://127.0.0.1:56842}
            2> 73226 T323 C14 P56842 oasc.RecoveryStrategy.run Starting recovery process.  core=collection1 recoveringAfterStartup=true
           
            2> 73233 T323 C14 P56842 oasc.ZkController.publish publishing core=collection1 state=recovering
          
            2> 75058 T256 oasc.ChaosMonkey.monkeyLog monkey: chose a victim! 56842
            2> 75059 T256 oasc.ChaosMonkey.monkeyLog monkey: expire session for 56842 !
            2> 75059 T256 oasc.ChaosMonkey.monkeyLog monkey: cause connection loss!
           
            2>  C14_STATE=coll:collection1 core:collection1 props:{shard=shard1, roles=null, state=recovering, core=collection1, collection=collection1, node_name=127.0.0.1:56842_, base_url=http://127.0.0.1:56842}
            2> 77316 T323 C14 P56842 oasc.RecoveryStrategy.doRecovery Attempting to PeerSync from http://127.0.0.1:41533/collection1/ core=collection1 - recoveringAfterStartup=true
          
          
            2> 77328 T323 C14 P56842 oasu.PeerSync.handleVersions PeerSync: core=collection1 url=http://127.0.0.1:56842  Our versions are newer. ourLowThreshold=1423033399394697216 otherHigh=1423033403051081728
            2> 77329 T323 C14 P56842 oasu.PeerSync.sync PeerSync: core=collection1 url=http://127.0.0.1:56842 DONE. sync succeeded
           
          ##### OK, perhaps the real problem starts here, when we try to publish as "active" but
          ##### we have issues talking to ZK (because CM caused a connection loss previously?)
            2> 77478 T323 C14 P56842 oasc.ZkController.publish publishing core=collection1 state=active
            2> 77480 T319 oaz.ClientCnxn$SendThread.run WARNING Session 0x13bfa14a03d0020 for server null, unexpected error, closing socket connection and attempting reconnect java.nio.channels.CancelledKeyException
          
          ##### The first request that comes in for us in on T314, but this is the last we hear from this thread until time 109571 (same as the problematic add)
            2> 82034 T314 C14 P56842 oasup.LogUpdateProcessor.processDelete PRE_UPDATE DELETE delete{flags=0,_version_=-1423033432318935040,id=60072,commitWithin=-1} {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
          
           
          ############## start of troublesome update
          
            2> 82115 T28 C12 P49621 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {wt=javabin&version=2}
            2> 82116 T28 C12 P49621 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10079 version=1423033432421695488
            2> 82117 T28 C12 P49621 PRE_UPDATE FINISH  {wt=javabin&version=2}
            2> 82118 T28 C12 P49621 /update {wt=javabin&version=2} {add=[10079 (1423033432421695488)]} 0 3
          
          
          ##### Below, C4 receives the update, forwards it to the leader C2, which forwards it to C14 and C0.  Looking ahead, it looks like C14 is a problem node - this update takes a long time to complete.
          
            2> 82121 T59 C4 P48210 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {wt=javabin&version=2}
            2> 82121 T59 C4 P48210 PRE_UPDATE FINISH  {wt=javabin&version=2}
            2> 82124 T44 C2 P41533 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {distrib.from=http://127.0.0.1:48210/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
            2> 82124 T44 C2 P41533 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10079 version=1423033432430084096
            2> 82125 T44 C2 P41533 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:48210/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
            2> 82127 T315 C14 P56842 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 82128 T89 C0 P56785 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 82128 T89 C0 P56785 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10079 version=1423033432430084096
            2> 82129 T89 C0 P56785 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 82129 T89 C0 P56785 /update {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10079 (1423033432430084096)]} 0 1
          
            2> 82247 T256 oasc.ChaosMonkey.monkeyLog monkey: chose a victim! 56785
            2> 82247 T256 oasc.ChaosMonkey.monkeyLog monkey: expire session for 56785 !
            2> 82248 T256 oasc.ChaosMonkey.monkeyLog monkey: cause connection loss!
          ##### 56785 is C0, but it looks like we already completed our update forwarded from C2 (although it's always possible the response was not completely written yet)
          
          
          
          ##### So C14 hung for some reason, but it doesn't look like the chaos monkey touched him!
            2> 97029 T46 C2 P41533 oasc.SolrException.log SEVERE shard update error StdNode: http://127.0.0.1:56842/collection1/:org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://127.0.0.1:56842/collection1
            2> 97031 T46 C2 P41533 oasup.DistributedUpdateProcessor.doFinish try and ask http://127.0.0.1:56842 to recover
          
          ##### I presume this is C14, and it got the request to recover
            2> 97044 T316 oasha.CoreAdminHandler.handleRequestRecoveryAction It has been requested that we recover
          
          ##### It looks like C2 finally returned failure to C4, which retries sending to the leader
            2> 97133 T59 C4 P48210 oasc.SolrException.log SEVERE forwarding update to http://127.0.0.1:41533/collection1/ failed - retrying ... 
          
          ##### Another timeout from the leader to that node - not sure what request this was...
            2> 97135 T44 C2 P41533 oasc.SolrException.log SEVERE shard update error StdNode: http://127.0.0.1:56842/collection1/:org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://127.0.0.1:56842/collection1
          
          ##### But we ask the node to recover again, and it looks like the node gets that request again
            2> 97137 T44 C2 P41533 oasup.DistributedUpdateProcessor.doFinish try and ask http://127.0.0.1:56842 to recover
            2> 97138 T44 C2 P41533 oascsi.HttpClientUtil.createClient Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
            2> 97148 T311 oasha.CoreAdminHandler.handleRequestRecoveryAction It has been requested that we recover
           
          
          ##### Below, the request was re-sent to the leader C2 from C4
          ##### Either the top level cloud client timed things out and re-sent, or the command distributor in C4 re-sent.
            2> 97639 T47 C2 P41533 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {distrib.from=http://127.0.0.1:48210/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
            2> 97641 T47 C2 P41533 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10079 version=1423033448699789312
            2> 97643 T47 C2 P41533 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:48210/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
            2> 97648 T313 C14 P56842 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 97648 T93 C0 P56785 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
           
          
          ##### Why did this time out?  We saw C14 get the request... perhaps more needs to be done asynchronously on the receiving end of such a request? 
          ##### Also, what's happening on C14 and C0 right now???
             2> 102044 T46 C2 P41533 oasup.DistributedUpdateProcessor.doFinish Could not tell a replica to recover org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://127.0.0.1:56842
          
          
          ##### The original hung update to C14 finally finished after 27 sec.
            2> 109516 T419 oascc.ConnectionManager.process Watcher org.apache.solr.common.cloud.ConnectionManager@58046530 name:ZooKeeperConnection Watcher:127.0.0.1:40151/solr got event WatchedEvent state:SyncConnected type:None path:null path:null type:None
            2> 109518 T320 oascc.ConnectionManager.waitForConnected Client is connected to ZooKeeper
            2> 109518 T320 oascc.ConnectionManager$1.update Connection with ZooKeeper reestablished.
            2> 109519 T320 oasc.RecoveryStrategy.close WARNING Stopping recovery for zkNodeName=127.0.0.1:56842__collection1core=collection1
           
            2> 109564 T313 C14 P56842 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10079 version=1423033448699789312
            2> 109571 T314 C14 P56842 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 109571 T313 C14 P56842 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 109571 T315 C14 P56842 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 109572 T314 C14 P56842 /update {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {delete=[60072 (-1423033432318935040)]} 0 27538
            2> 109572 T313 C14 P56842 /update {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10079 (1423033448699789312)]} 0 11925
            2> 109573 T315 C14 P56842 /update {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10079 (1423033432430084096)]} 0 27446
           
          
            2> ###### Only in cloudDocList: [{id=10079}]
          
            2> 207369 T545 C22 P57160 REQ /select {fl=id,_version_&shard.url=127.0.0.1:41533/collection1/|127.0.0.1:57160/collection1/&NOW=1357110561123&tests=checkShardConsistency(vsControl)/getVers&q=id:(+10079)&ids=10079&distrib=false&isShard=true&wt=javabin&rows=100000&version=2} status=0 QTime=1 
            2> 207372 T10 oasc.AbstractFullDistribZkTestBase.checkShardConsistency SEVERE controlClient :{numFound=0,start=0,docs=[]}
            2> 		cloudClient :{numFound=1,start=0,docs=[SolrDocument{id=10079, _version_=1423033464957960192}]}
           
          
          Show
          Yonik Seeley added a comment - OK, here's a fail with more logging. What finally manifests as an error the test detects seems to be caused by retries (but there are still threads hanging around from the first time that eventually execute). I didn't follow this fail all the way through to the end since I saw enough potentially wonky stuff earlier. C14 tries to publish success, fails because CM caused a connection loss, and then future updates to C14 hang for a long time. ### Since we know C14 (P56842) seems problematic we look back in time before the problematic update. 2> ASYNC NEW_CORE C14 name=collection1 org.apache.solr.core.SolrCore@15c617e0 url=http: //127.0.0.1:56842/collection1 node=127.0.0.1:56842_ C14_STATE=coll:collection1 core:collection1 props:{shard=shard1, roles= null , state=active, core=collection1, collection=collection1, node_name=127.0.0.1:56842_, base_url=http://127.0.0.1:56842} 2> 73226 T323 C14 P56842 oasc.RecoveryStrategy.run Starting recovery process. core=collection1 recoveringAfterStartup= true 2> 73233 T323 C14 P56842 oasc.ZkController.publish publishing core=collection1 state=recovering 2> 75058 T256 oasc.ChaosMonkey.monkeyLog monkey: chose a victim! 56842 2> 75059 T256 oasc.ChaosMonkey.monkeyLog monkey: expire session for 56842 ! 2> 75059 T256 oasc.ChaosMonkey.monkeyLog monkey: cause connection loss! 2> C14_STATE=coll:collection1 core:collection1 props:{shard=shard1, roles= null , state=recovering, core=collection1, collection=collection1, node_name=127.0.0.1:56842_, base_url=http: //127.0.0.1:56842} 2> 77316 T323 C14 P56842 oasc.RecoveryStrategy.doRecovery Attempting to PeerSync from http: //127.0.0.1:41533/collection1/ core=collection1 - recoveringAfterStartup= true 2> 77328 T323 C14 P56842 oasu.PeerSync.handleVersions PeerSync: core=collection1 url=http: //127.0.0.1:56842 Our versions are newer. ourLowThreshold=1423033399394697216 otherHigh=1423033403051081728 2> 77329 T323 C14 P56842 oasu.PeerSync.sync PeerSync: core=collection1 url=http: //127.0.0.1:56842 DONE. sync succeeded ##### OK, perhaps the real problem starts here, when we try to publish as "active" but ##### we have issues talking to ZK (because CM caused a connection loss previously?) 2> 77478 T323 C14 P56842 oasc.ZkController.publish publishing core=collection1 state=active 2> 77480 T319 oaz.ClientCnxn$SendThread.run WARNING Session 0x13bfa14a03d0020 for server null , unexpected error, closing socket connection and attempting reconnect java.nio.channels.CancelledKeyException ##### The first request that comes in for us in on T314, but this is the last we hear from this thread until time 109571 (same as the problematic add) 2> 82034 T314 C14 P56842 oasup.LogUpdateProcessor.processDelete PRE_UPDATE DELETE delete{flags=0,_version_=-1423033432318935040,id=60072,commitWithin=-1} {distrib.from=http: //127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} ############## start of troublesome update 2> 82115 T28 C12 P49621 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {wt=javabin&version=2} 2> 82116 T28 C12 P49621 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10079 version=1423033432421695488 2> 82117 T28 C12 P49621 PRE_UPDATE FINISH {wt=javabin&version=2} 2> 82118 T28 C12 P49621 /update {wt=javabin&version=2} {add=[10079 (1423033432421695488)]} 0 3 ##### Below, C4 receives the update, forwards it to the leader C2, which forwards it to C14 and C0. Looking ahead, it looks like C14 is a problem node - this update takes a long time to complete. 2> 82121 T59 C4 P48210 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {wt=javabin&version=2} 2> 82121 T59 C4 P48210 PRE_UPDATE FINISH {wt=javabin&version=2} 2> 82124 T44 C2 P41533 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {distrib.from=http: //127.0.0.1:48210/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} 2> 82124 T44 C2 P41533 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10079 version=1423033432430084096 2> 82125 T44 C2 P41533 PRE_UPDATE FINISH {distrib.from=http: //127.0.0.1:48210/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} 2> 82127 T315 C14 P56842 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {distrib.from=http: //127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 82128 T89 C0 P56785 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {distrib.from=http: //127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 82128 T89 C0 P56785 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10079 version=1423033432430084096 2> 82129 T89 C0 P56785 PRE_UPDATE FINISH {distrib.from=http: //127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 82129 T89 C0 P56785 /update {distrib.from=http: //127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10079 (1423033432430084096)]} 0 1 2> 82247 T256 oasc.ChaosMonkey.monkeyLog monkey: chose a victim! 56785 2> 82247 T256 oasc.ChaosMonkey.monkeyLog monkey: expire session for 56785 ! 2> 82248 T256 oasc.ChaosMonkey.monkeyLog monkey: cause connection loss! ##### 56785 is C0, but it looks like we already completed our update forwarded from C2 (although it's always possible the response was not completely written yet) ##### So C14 hung for some reason, but it doesn't look like the chaos monkey touched him! 2> 97029 T46 C2 P41533 oasc.SolrException.log SEVERE shard update error StdNode: http: //127.0.0.1:56842/collection1/:org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://127.0.0.1:56842/collection1 2> 97031 T46 C2 P41533 oasup.DistributedUpdateProcessor.doFinish try and ask http: //127.0.0.1:56842 to recover ##### I presume this is C14, and it got the request to recover 2> 97044 T316 oasha.CoreAdminHandler.handleRequestRecoveryAction It has been requested that we recover ##### It looks like C2 finally returned failure to C4, which retries sending to the leader 2> 97133 T59 C4 P48210 oasc.SolrException.log SEVERE forwarding update to http: //127.0.0.1:41533/collection1/ failed - retrying ... ##### Another timeout from the leader to that node - not sure what request this was... 2> 97135 T44 C2 P41533 oasc.SolrException.log SEVERE shard update error StdNode: http: //127.0.0.1:56842/collection1/:org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://127.0.0.1:56842/collection1 ##### But we ask the node to recover again, and it looks like the node gets that request again 2> 97137 T44 C2 P41533 oasup.DistributedUpdateProcessor.doFinish try and ask http: //127.0.0.1:56842 to recover 2> 97138 T44 C2 P41533 oascsi.HttpClientUtil.createClient Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects= false 2> 97148 T311 oasha.CoreAdminHandler.handleRequestRecoveryAction It has been requested that we recover ##### Below, the request was re-sent to the leader C2 from C4 ##### Either the top level cloud client timed things out and re-sent, or the command distributor in C4 re-sent. 2> 97639 T47 C2 P41533 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {distrib.from=http: //127.0.0.1:48210/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} 2> 97641 T47 C2 P41533 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10079 version=1423033448699789312 2> 97643 T47 C2 P41533 PRE_UPDATE FINISH {distrib.from=http: //127.0.0.1:48210/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} 2> 97648 T313 C14 P56842 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {distrib.from=http: //127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 97648 T93 C0 P56785 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {distrib.from=http: //127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} ##### Why did this time out? We saw C14 get the request... perhaps more needs to be done asynchronously on the receiving end of such a request? ##### Also, what's happening on C14 and C0 right now??? 2> 102044 T46 C2 P41533 oasup.DistributedUpdateProcessor.doFinish Could not tell a replica to recover org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http: //127.0.0.1:56842 ##### The original hung update to C14 finally finished after 27 sec. 2> 109516 T419 oascc.ConnectionManager.process Watcher org.apache.solr.common.cloud.ConnectionManager@58046530 name:ZooKeeperConnection Watcher:127.0.0.1:40151/solr got event WatchedEvent state:SyncConnected type:None path: null path: null type:None 2> 109518 T320 oascc.ConnectionManager.waitForConnected Client is connected to ZooKeeper 2> 109518 T320 oascc.ConnectionManager$1.update Connection with ZooKeeper reestablished. 2> 109519 T320 oasc.RecoveryStrategy.close WARNING Stopping recovery for zkNodeName=127.0.0.1:56842__collection1core=collection1 2> 109564 T313 C14 P56842 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10079 version=1423033448699789312 2> 109571 T314 C14 P56842 PRE_UPDATE FINISH {distrib.from=http: //127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 109571 T313 C14 P56842 PRE_UPDATE FINISH {distrib.from=http: //127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 109571 T315 C14 P56842 PRE_UPDATE FINISH {distrib.from=http: //127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 109572 T314 C14 P56842 /update {distrib.from=http: //127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {delete=[60072 (-1423033432318935040)]} 0 27538 2> 109572 T313 C14 P56842 /update {distrib.from=http: //127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10079 (1423033448699789312)]} 0 11925 2> 109573 T315 C14 P56842 /update {distrib.from=http: //127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10079 (1423033432430084096)]} 0 27446 2> ###### Only in cloudDocList: [{id=10079}] 2> 207369 T545 C22 P57160 REQ /select {fl=id,_version_&shard.url=127.0.0.1:41533/collection1/|127.0.0.1:57160/collection1/&NOW=1357110561123&tests=checkShardConsistency(vsControl)/getVers&q=id:(+10079)&ids=10079&distrib= false &isShard= true &wt=javabin&rows=100000&version=2} status=0 QTime=1 2> 207372 T10 oasc.AbstractFullDistribZkTestBase.checkShardConsistency SEVERE controlClient :{numFound=0,start=0,docs=[]} 2> cloudClient :{numFound=1,start=0,docs=[SolrDocument{id=10079, _version_=1423033464957960192}]}
          Hide
          Yonik Seeley added a comment -

          I've added a testing hook for failures in the command distributor, and hooked it up for the chaos monkey safe leader test to do a thread dump on socket timeout. Hopefully this will help nail down why some of the requests are taking so long.

          Show
          Yonik Seeley added a comment - I've added a testing hook for failures in the command distributor, and hooked it up for the chaos monkey safe leader test to do a thread dump on socket timeout. Hopefully this will help nail down why some of the requests are taking so long.
          Hide
          Yonik Seeley added a comment -

          Here's a log with some notes on the first timeout. Chaos monkey decides to cause connection loss to ZK at 48 sec. At 60 sec, two requests start that don't seem to finish until 83 sec into the test. They seem to be blocked in zkCheck().

            2> 48972 T260 oasc.ChaosMonkey.monkeyLog monkey: chose a victim! 42854
            2> 48973 T260 oasc.ChaosMonkey.monkeyLog monkey: expire session for 42854 !
            2> 48975 T260 oasc.ChaosMonkey.monkeyLog monkey: cause connection loss!
          
           
            2> 52127 T122 C3 P42854 /update {distrib.from=http://127.0.0.1:51342/r_/f/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {delete=[60058 (-1423156803769729024)]} 0 1
            2> 60518 T120 C3 P42854 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10063} {distrib.from=http://127.0.0.1:51342/r_/f/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 60786 T124 C3 P42854 oasup.LogUpdateProcessor.processDelete PRE_UPDATE DELETE delete{flags=0,_version_=-1423156812849348608,id=60059,commitWithin=-1} {distrib.from=http://127.0.0.1:51342/r_/f/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
          
            2> 75667 T74 C6 P51342 oasc.Diagnostics.logThreadDumps SEVERE REQUESTING THREAD DUMP DUE TO TIMEOUT: Timeout occured while waiting response from server at: http://127.0.0.1:42854/r_/f/collection1
          [...]
            2> 	"qtp1333272771-124" Id=124 TIMED_WAITING
            2> 		at java.lang.Thread.sleep(Native Method)
            2> 		at org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:925)
            2> 		at org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:699)
            2> 		at org.apache.solr.update.processor.LogUpdateProcessor.processDelete(LogUpdateProcessor.java:97)
            2> 		at org.apache.solr.handler.loader.XMLLoader.processDelete(XMLLoader.java:346)
            2> 		at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:277)
            2> 		at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
            2> 		at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
          [...]
            2> 	"qtp1333272771-120" Id=120 TIMED_WAITING
            2> 		at java.lang.Thread.sleep(Native Method)
            2> 		at org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:925)
            2> 		at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:330)
            2> 		at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessor.java:76)
            2> 		at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
            2> 		at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
            2> 		at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
            2> 		at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
          
            2> 83174 T124 C3 P42854 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:51342/r_/f/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 83174 T120 C3 P42854 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:51342/r_/f/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
          
          
          Show
          Yonik Seeley added a comment - Here's a log with some notes on the first timeout. Chaos monkey decides to cause connection loss to ZK at 48 sec. At 60 sec, two requests start that don't seem to finish until 83 sec into the test. They seem to be blocked in zkCheck(). 2> 48972 T260 oasc.ChaosMonkey.monkeyLog monkey: chose a victim! 42854 2> 48973 T260 oasc.ChaosMonkey.monkeyLog monkey: expire session for 42854 ! 2> 48975 T260 oasc.ChaosMonkey.monkeyLog monkey: cause connection loss! 2> 52127 T122 C3 P42854 /update {distrib.from=http: //127.0.0.1:51342/r_/f/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {delete=[60058 (-1423156803769729024)]} 0 1 2> 60518 T120 C3 P42854 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10063} {distrib.from=http: //127.0.0.1:51342/r_/f/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 60786 T124 C3 P42854 oasup.LogUpdateProcessor.processDelete PRE_UPDATE DELETE delete{flags=0,_version_=-1423156812849348608,id=60059,commitWithin=-1} {distrib.from=http: //127.0.0.1:51342/r_/f/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 75667 T74 C6 P51342 oasc.Diagnostics.logThreadDumps SEVERE REQUESTING THREAD DUMP DUE TO TIMEOUT: Timeout occured while waiting response from server at: http: //127.0.0.1:42854/r_/f/collection1 [...] 2> "qtp1333272771-124" Id=124 TIMED_WAITING 2> at java.lang. Thread .sleep(Native Method) 2> at org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:925) 2> at org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:699) 2> at org.apache.solr.update.processor.LogUpdateProcessor.processDelete(LogUpdateProcessor.java:97) 2> at org.apache.solr.handler.loader.XMLLoader.processDelete(XMLLoader.java:346) 2> at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:277) 2> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) 2> at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) [...] 2> "qtp1333272771-120" Id=120 TIMED_WAITING 2> at java.lang. Thread .sleep(Native Method) 2> at org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:925) 2> at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:330) 2> at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessor.java:76) 2> at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) 2> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) 2> at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) 2> at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) 2> 83174 T124 C3 P42854 PRE_UPDATE FINISH {distrib.from=http: //127.0.0.1:51342/r_/f/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 83174 T120 C3 P42854 PRE_UPDATE FINISH {distrib.from=http: //127.0.0.1:51342/r_/f/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
          Hide
          Mark Miller added a comment -

          zkCheck currently waits up to the length of the client timeout to see if the connection comes back before expiration...

          Show
          Mark Miller added a comment - zkCheck currently waits up to the length of the client timeout to see if the connection comes back before expiration...
          Hide
          Yonik Seeley added a comment - - edited

          Here's another interesting tidbit:

            2>    "recoveryExecutor-140-thread-1" Id=320 TIMED_WAITING
            2>            at java.lang.Thread.sleep(Native Method)
            2>            at org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:925)
            2>            at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:330)
            2>            at org.apache.solr.update.UpdateLog$LogReplayer.doReplay(UpdateLog.java:1268)
            2>            at org.apache.solr.update.UpdateLog$LogReplayer.run(UpdateLog.java:1159)
            2>            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
            2>            at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
            2>            at java.util.concurrent.FutureTask.run(FutureTask.java:138)
            2>            ...
            2>    
            2>            Number of locked synchronizers = 1
            2>            - java.util.concurrent.locks.ReentrantLock$NonfairSync@1f2eb71e
          

          Seems like we shouldn't check ZK if the update is from log replay or from peersync (basically if distrib=false?)

          edit: I just committed SOLR-4257, which should handle these cases.

          Show
          Yonik Seeley added a comment - - edited Here's another interesting tidbit: 2> "recoveryExecutor-140-thread-1" Id=320 TIMED_WAITING 2> at java.lang. Thread .sleep(Native Method) 2> at org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:925) 2> at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:330) 2> at org.apache.solr.update.UpdateLog$LogReplayer.doReplay(UpdateLog.java:1268) 2> at org.apache.solr.update.UpdateLog$LogReplayer.run(UpdateLog.java:1159) 2> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) 2> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) 2> at java.util.concurrent.FutureTask.run(FutureTask.java:138) 2> ... 2> 2> Number of locked synchronizers = 1 2> - java.util.concurrent.locks.ReentrantLock$NonfairSync@1f2eb71e Seems like we shouldn't check ZK if the update is from log replay or from peersync (basically if distrib=false?) edit: I just committed SOLR-4257 , which should handle these cases.
          Hide
          Yonik Seeley added a comment - - edited

          Here's an analyzed log that I traced all the way to the end.
          The issues involved are all timeout related (socket timeouts).
          Timing out an update request in general is bad, since the request itself normally continues on and can finish at some point in the future.
          We should strive to only time out requests that are truely / hopelessly hung.

          
            2> 54461 T256 oasc.ChaosMonkey.monkeyLog monkey: chose a victim! 35930
            2> 54462 T256 oasc.ChaosMonkey.monkeyLog monkey: expire session for 35930 !
            2> 54473 T256 oasc.ChaosMonkey.monkeyLog monkey: cause connection loss!
          
          # NOTE: for some reason, this didn't seem to slow 35930 down... it's still accepting and processing updates at this point!
          
            2> 59545 T256 oasc.ChaosMonkey.monkeyLog monkey: chose a victim! 35930
            2> 59545 T256 oasc.ChaosMonkey.monkeyLog monkey: expire session for 35930 !
            2> 59546 T256 oasc.ChaosMonkey.monkeyLog monkey: cause connection loss!
           
          
            2> 65367 T48 C2 P47815 PRE_UPDATE FINISH  {wt=javabin&version=2}
            2> 65367 T48 C2 P47815 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {wt=javabin&version=2}
          # C2 is forwarding to the leader, C6 (P59996)
          # NOTE: this is the same thread that finally times out
          
            2> 65372 T75 C6 P59996 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http://127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
            2> 65374 T75 C6 P59996 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189915618770944
          
            2> 65384 T179 C8 P40531 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 65384 T229 C11 P44046 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 65384 T124 C3 P35930 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 65385 T179 C8 P40531 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189915618770944
            2> 65385 T229 C11 P44046 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 65387 T229 C11 P44046 /update {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10070 (1423189915618770944)]} 0 3
            2> 65445 T179 C8 P40531 /update {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10070 (1423189915618770944)]} 0 62
          # C8 and C11 finish, but C3 (P35930) does not
          
          
            2> 80583 T48 C2 P47815 oasc.Diagnostics.logThreadDumps SEVERE REQUESTING THREAD DUMP DUE TO TIMEOUT: Timeout occured while waiting response from server at: http://127.0.0.1:59996/to_/y/collection1
          
            2> 	"qtp330001436-124" Id=124 TIMED_WAITING
            2> 		at java.lang.Thread.sleep(Native Method)
            2> 		at org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:944)
            2> 		at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:342)
            2> 		at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessor.java:76)
            2> 		at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
            2> 		at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
            2> 		at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
            2> 		at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
           
          # C3 is waiting for the ZK connection
          
            2> 80926 T75 C6 P59996 oasc.SolrException.log SEVERE shard update error StdNode: http://127.0.0.1:35930/to_/y/collection1/:org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://127.0.0.1:35930/to_/y/collection1
            2> 80632 T73 C6 P59996 oasc.SolrException.log SEVERE shard update error StdNode: http://127.0.0.1:35930/to_/y/collection1/:org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://127.0.0.1:35930/to_/y/collection1
          
            2> 80603 T48 C2 P47815 oasc.SolrException.log SEVERE forwarding update to http://127.0.0.1:59996/to_/y/collection1/ failed - retrying ... 
            2> 80955 T75 C6 P59996 oasup.DistributedUpdateProcessor.doFinish try and ask http://127.0.0.1:35930/to_/y to recover
            2> 80956 T73 C6 P59996 oasup.DistributedUpdateProcessor.doFinish try and ask http://127.0.0.1:35930/to_/y to recover
           
          # there are actually 2 updates failing, hence the 2 requests to recover
          # It looks like C2 retries forwarding to the leader just at about the same time that the leader times out the requests to the replicas and asks them to recover
          
            2> 80965 T120 oasha.CoreAdminHandler.handleRequestRecoveryAction It has been requested that we recover
            2> 80971 T121 oasha.CoreAdminHandler.handleRequestRecoveryAction It has been requested that we recover
          
          
            2> 80972 T75 C6 P59996 /update {distrib.from=http://127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} {add=[10070 (1423189915618770944)]} 0 15600
          ######### This is success because we added the doc to multiple shards and asked the shard that failed to recover?
          
            2> 81461 T76 C6 P59996 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http://127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
          ######### Now this is the retry from C2 sending to the leader, but we've already counted the previous update a success on C6 (but C2 had already timed it out)
          
            2> 81463 T76 C6 P59996 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189932489310208
            2> 81465 T76 C6 P59996 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
          
            2> 81471 T122 C3 P35930 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 81486 T324 C15 P44046 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 81489 T324 C15 P44046 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189932489310208
            2> 81497 T324 C15 P44046 /update {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10070 (1423189932489310208)]} 0 11
          
          
          ############# The following update doesn't have "forward to leader" on it!  This must be a retry from
          ############# the cloud client?
            2> 95400 T74 C6 P59996 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {wt=javabin&version=2}
            2> 95402 T74 C6 P59996 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189947105411072
            2> 95404 T74 C6 P59996 PRE_UPDATE FINISH  {wt=javabin&version=2}
          
            2> 95410 T327 C15 P44046 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 95411 T327 C15 P44046 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189947105411072
            2> 95414 T327 C15 P44046 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 95415 T327 C15 P44046 /update {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10070 (1423189947105411072)]} 0 6
           
            2>  C16_STATE=coll:collection1 core:collection1 props:{shard=shard3, roles=null, state=active, core=collection1, collection=collection1, node_name=127.0.0.1:40531_to_%2Fy, base_url=http://127.0.0.1:40531/to_/y}
            2> 95420 T358 C16 P40531 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 95424 T358 C16 P40531 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189947105411072
            2> 95432 T358 C16 P40531 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 95433 T358 C16 P40531 /update {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10070 (1423189947105411072)]} 0 13
          
            2> 95436 T74 C6 P59996 /update {wt=javabin&version=2} {add=[10070 (1423189947105411072)]} 0 36
          ############## The update is a success because apparently only C15 and C16 are active replcias of the shard?  
          
          
            2> 95465 T124 C3 P35930 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
          
          ############# Here comes the delete for the document, since the second add finished! (even though the first is still hung up)
          ############# This must mean that the cloud client timed out the request?
          
            2> 96296 T25 C12 P52041 oasup.LogUpdateProcessor.processDelete PRE_UPDATE DELETE delete{flags=0,_version_=0,id=10070,commitWithin=-1} {wt=javabin&version=2}
            2> 96297 T25 C12 P52041 /update {wt=javabin&version=2} {delete=[10070 (-1423189948044935168)]} 0 1
            2> 96300 T75 C6 P59996 oasup.LogUpdateProcessor.processDelete PRE_UPDATE DELETE delete{flags=0,_version_=0,id=10070,commitWithin=-1} {wt=javabin&version=2}
          
          
            2> 96757 T48 C2 P47815 oasc.SolrException.log SEVERE forwarding update to http://127.0.0.1:59996/to_/y/collection1/ failed - retrying ... 
          #### try to forward to the leader again, and this is the update that finally sticks
          
            2> 97306 T77 C6 P59996 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http://127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
            2> 97308 T77 C6 P59996 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189949105045504
          
            2> 97315 T326 C15 P44046 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 97314 T359 C16 P40531 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 97321 T359 C16 P40531 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189949105045504
            2> 97323 T359 C16 P40531 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 97323 T359 C16 P40531 /update {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10070 (1423189949105045504)]} 0 10
            2> 97321 T326 C15 P44046 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189949105045504
            2> 97325 T326 C15 P44046 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
            2> 97325 T326 C15 P44046 /update {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10070 (1423189949105045504)]} 0 10
           
            2> 97326 T77 C6 P59996 /update {distrib.from=http://127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} {add=[10070 (1423189949105045504)]} 0 20
          ##### And now this update succeeds (which was a retry between non-leader and leader of the original update)
          
          
            2> 97328 T48 C2 P47815 /update {wt=javabin&version=2} {add=[10070]} 0 31962
          #### This is the first (and only) time that the original update succeeded!
          
            2> ###### Only in cloudDocList: [{id=60073}, {id=10070}]
            2> 209316 T76 C6 P59996 REQ /select {fl=id,_version_&shard.url=127.0.0.1:59996/to_/y/collection1/|127.0.0.1:35930/to_/y/collection1/&NOW=1357259813791&tests=checkShardConsistency(vsControl)/getVers&q=id:(+60073+10070)&ids=60073,10070&distrib=false&isShard=true&wt=javabin&rows=100000&version=2} status=0 QTime=1 
            2> 209318 T10 oasc.AbstractFullDistribZkTestBase.checkShardConsistency SEVERE controlClient :{numFound=0,start=0,docs=[]}
            2> 		cloudClient :{numFound=2,start=0,docs=[SolrDocument{id=10070, _version_=1423189949105045504}, SolrDocument{id=60073, _version_=1423189949103996928}]}
          

          There was a lot of timeout / retry activity that could cause problems for other tests / scenarios, but this test is simpler
          because it waits for a response to the add before moving on to possibly delete that add. For this scenario, the
          retry that caused the issue was from the cloud client. It timed out it's original update and retried the update. The retry completed. Then the test deleted that document. Then the original update succeeded and added the doc back.

          Having the same timeouts on forwards to leaders as forwards from leaders has turned out to be not-so-good. Because the former happens before the latter, if a replica update hangs, the to_leader update will timeout and retry slightly before the from_leader times out to the replica (and maybe succeeds by asking that replica to recover!).

          Q) A replica receiving a forward from a leader - do we really need to have a ZK connection to accept that update?
          Maybe so for defensive check reasons?

          Here's how I think we need to fix this:
          A) We need to figure out how long an update to a replica forwarded by the leader can reasonably take. Then we need to make the socket timeout be greater than that.
          B) We need to figure out how long an update to a leader can take (taking into account (A)), and make the socket timeout to the leader greater than that.
          C) We need to figure out how long an update to a non-leader (which is then forwarded to a leader) can take, and make the socket timeout from SolrJ servers to be greater than that.
          D) Make sure that the generic Jetty socket timeouts are greater than all of the above?

          If it's too hard to separate all these different socket timeouts now, then the best approximation
          would be to try and minimize the time any update can take, and raise all of the timeouts up high enough
          such that we should never see them.

          We should probably also take care to only retry in certain scenarios. For instance if we try to forward to a leader, but can't reach the leader. We should retry on connect timeout, but never on socket timeout.

          Show
          Yonik Seeley added a comment - - edited Here's an analyzed log that I traced all the way to the end. The issues involved are all timeout related (socket timeouts). Timing out an update request in general is bad, since the request itself normally continues on and can finish at some point in the future. We should strive to only time out requests that are truely / hopelessly hung. 2> 54461 T256 oasc.ChaosMonkey.monkeyLog monkey: chose a victim! 35930 2> 54462 T256 oasc.ChaosMonkey.monkeyLog monkey: expire session for 35930 ! 2> 54473 T256 oasc.ChaosMonkey.monkeyLog monkey: cause connection loss! # NOTE: for some reason, this didn't seem to slow 35930 down... it's still accepting and processing updates at this point! 2> 59545 T256 oasc.ChaosMonkey.monkeyLog monkey: chose a victim! 35930 2> 59545 T256 oasc.ChaosMonkey.monkeyLog monkey: expire session for 35930 ! 2> 59546 T256 oasc.ChaosMonkey.monkeyLog monkey: cause connection loss! 2> 65367 T48 C2 P47815 PRE_UPDATE FINISH {wt=javabin&version=2} 2> 65367 T48 C2 P47815 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {wt=javabin&version=2} # C2 is forwarding to the leader, C6 (P59996) # NOTE: this is the same thread that finally times out 2> 65372 T75 C6 P59996 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http: //127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} 2> 65374 T75 C6 P59996 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189915618770944 2> 65384 T179 C8 P40531 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 65384 T229 C11 P44046 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 65384 T124 C3 P35930 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 65385 T179 C8 P40531 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189915618770944 2> 65385 T229 C11 P44046 PRE_UPDATE FINISH {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 65387 T229 C11 P44046 /update {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10070 (1423189915618770944)]} 0 3 2> 65445 T179 C8 P40531 /update {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10070 (1423189915618770944)]} 0 62 # C8 and C11 finish, but C3 (P35930) does not 2> 80583 T48 C2 P47815 oasc.Diagnostics.logThreadDumps SEVERE REQUESTING THREAD DUMP DUE TO TIMEOUT: Timeout occured while waiting response from server at: http: //127.0.0.1:59996/to_/y/collection1 2> "qtp330001436-124" Id=124 TIMED_WAITING 2> at java.lang. Thread .sleep(Native Method) 2> at org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:944) 2> at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:342) 2> at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessor.java:76) 2> at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) 2> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) 2> at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) 2> at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) # C3 is waiting for the ZK connection 2> 80926 T75 C6 P59996 oasc.SolrException.log SEVERE shard update error StdNode: http: //127.0.0.1:35930/to_/y/collection1/:org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://127.0.0.1:35930/to_/y/collection1 2> 80632 T73 C6 P59996 oasc.SolrException.log SEVERE shard update error StdNode: http: //127.0.0.1:35930/to_/y/collection1/:org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://127.0.0.1:35930/to_/y/collection1 2> 80603 T48 C2 P47815 oasc.SolrException.log SEVERE forwarding update to http: //127.0.0.1:59996/to_/y/collection1/ failed - retrying ... 2> 80955 T75 C6 P59996 oasup.DistributedUpdateProcessor.doFinish try and ask http: //127.0.0.1:35930/to_/y to recover 2> 80956 T73 C6 P59996 oasup.DistributedUpdateProcessor.doFinish try and ask http: //127.0.0.1:35930/to_/y to recover # there are actually 2 updates failing, hence the 2 requests to recover # It looks like C2 retries forwarding to the leader just at about the same time that the leader times out the requests to the replicas and asks them to recover 2> 80965 T120 oasha.CoreAdminHandler.handleRequestRecoveryAction It has been requested that we recover 2> 80971 T121 oasha.CoreAdminHandler.handleRequestRecoveryAction It has been requested that we recover 2> 80972 T75 C6 P59996 /update {distrib.from=http: //127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} {add=[10070 (1423189915618770944)]} 0 15600 ######### This is success because we added the doc to multiple shards and asked the shard that failed to recover? 2> 81461 T76 C6 P59996 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http: //127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} ######### Now this is the retry from C2 sending to the leader, but we've already counted the previous update a success on C6 (but C2 had already timed it out) 2> 81463 T76 C6 P59996 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189932489310208 2> 81465 T76 C6 P59996 PRE_UPDATE FINISH {distrib.from=http: //127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} 2> 81471 T122 C3 P35930 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 81486 T324 C15 P44046 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 81489 T324 C15 P44046 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189932489310208 2> 81497 T324 C15 P44046 /update {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10070 (1423189932489310208)]} 0 11 ############# The following update doesn't have "forward to leader" on it! This must be a retry from ############# the cloud client? 2> 95400 T74 C6 P59996 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {wt=javabin&version=2} 2> 95402 T74 C6 P59996 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189947105411072 2> 95404 T74 C6 P59996 PRE_UPDATE FINISH {wt=javabin&version=2} 2> 95410 T327 C15 P44046 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 95411 T327 C15 P44046 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189947105411072 2> 95414 T327 C15 P44046 PRE_UPDATE FINISH {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 95415 T327 C15 P44046 /update {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10070 (1423189947105411072)]} 0 6 2> C16_STATE=coll:collection1 core:collection1 props:{shard=shard3, roles= null , state=active, core=collection1, collection=collection1, node_name=127.0.0.1:40531_to_%2Fy, base_url=http: //127.0.0.1:40531/to_/y} 2> 95420 T358 C16 P40531 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 95424 T358 C16 P40531 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189947105411072 2> 95432 T358 C16 P40531 PRE_UPDATE FINISH {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 95433 T358 C16 P40531 /update {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10070 (1423189947105411072)]} 0 13 2> 95436 T74 C6 P59996 /update {wt=javabin&version=2} {add=[10070 (1423189947105411072)]} 0 36 ############## The update is a success because apparently only C15 and C16 are active replcias of the shard? 2> 95465 T124 C3 P35930 PRE_UPDATE FINISH {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} ############# Here comes the delete for the document, since the second add finished! (even though the first is still hung up) ############# This must mean that the cloud client timed out the request? 2> 96296 T25 C12 P52041 oasup.LogUpdateProcessor.processDelete PRE_UPDATE DELETE delete{flags=0,_version_=0,id=10070,commitWithin=-1} {wt=javabin&version=2} 2> 96297 T25 C12 P52041 /update {wt=javabin&version=2} {delete=[10070 (-1423189948044935168)]} 0 1 2> 96300 T75 C6 P59996 oasup.LogUpdateProcessor.processDelete PRE_UPDATE DELETE delete{flags=0,_version_=0,id=10070,commitWithin=-1} {wt=javabin&version=2} 2> 96757 T48 C2 P47815 oasc.SolrException.log SEVERE forwarding update to http: //127.0.0.1:59996/to_/y/collection1/ failed - retrying ... #### try to forward to the leader again, and this is the update that finally sticks 2> 97306 T77 C6 P59996 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http: //127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} 2> 97308 T77 C6 P59996 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189949105045504 2> 97315 T326 C15 P44046 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 97314 T359 C16 P40531 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070} {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 97321 T359 C16 P40531 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189949105045504 2> 97323 T359 C16 P40531 PRE_UPDATE FINISH {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 97323 T359 C16 P40531 /update {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10070 (1423189949105045504)]} 0 10 2> 97321 T326 C15 P44046 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189949105045504 2> 97325 T326 C15 P44046 PRE_UPDATE FINISH {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 97325 T326 C15 P44046 /update {distrib.from=http: //127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10070 (1423189949105045504)]} 0 10 2> 97326 T77 C6 P59996 /update {distrib.from=http: //127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} {add=[10070 (1423189949105045504)]} 0 20 ##### And now this update succeeds (which was a retry between non-leader and leader of the original update) 2> 97328 T48 C2 P47815 /update {wt=javabin&version=2} {add=[10070]} 0 31962 #### This is the first (and only) time that the original update succeeded! 2> ###### Only in cloudDocList: [{id=60073}, {id=10070}] 2> 209316 T76 C6 P59996 REQ /select {fl=id,_version_&shard.url=127.0.0.1:59996/to_/y/collection1/|127.0.0.1:35930/to_/y/collection1/&NOW=1357259813791&tests=checkShardConsistency(vsControl)/getVers&q=id:(+60073+10070)&ids=60073,10070&distrib= false &isShard= true &wt=javabin&rows=100000&version=2} status=0 QTime=1 2> 209318 T10 oasc.AbstractFullDistribZkTestBase.checkShardConsistency SEVERE controlClient :{numFound=0,start=0,docs=[]} 2> cloudClient :{numFound=2,start=0,docs=[SolrDocument{id=10070, _version_=1423189949105045504}, SolrDocument{id=60073, _version_=1423189949103996928}]} There was a lot of timeout / retry activity that could cause problems for other tests / scenarios, but this test is simpler because it waits for a response to the add before moving on to possibly delete that add. For this scenario, the retry that caused the issue was from the cloud client. It timed out it's original update and retried the update. The retry completed. Then the test deleted that document. Then the original update succeeded and added the doc back. Having the same timeouts on forwards to leaders as forwards from leaders has turned out to be not-so-good. Because the former happens before the latter, if a replica update hangs, the to_leader update will timeout and retry slightly before the from_leader times out to the replica (and maybe succeeds by asking that replica to recover!). Q) A replica receiving a forward from a leader - do we really need to have a ZK connection to accept that update? Maybe so for defensive check reasons? Here's how I think we need to fix this: A) We need to figure out how long an update to a replica forwarded by the leader can reasonably take. Then we need to make the socket timeout be greater than that. B) We need to figure out how long an update to a leader can take (taking into account (A)), and make the socket timeout to the leader greater than that. C) We need to figure out how long an update to a non-leader (which is then forwarded to a leader) can take, and make the socket timeout from SolrJ servers to be greater than that. D) Make sure that the generic Jetty socket timeouts are greater than all of the above? If it's too hard to separate all these different socket timeouts now, then the best approximation would be to try and minimize the time any update can take, and raise all of the timeouts up high enough such that we should never see them. We should probably also take care to only retry in certain scenarios. For instance if we try to forward to a leader, but can't reach the leader. We should retry on connect timeout, but never on socket timeout.
          Hide
          Mark Miller added a comment -

          I just noticed the update so and conn timeouts being set for tests were reversed and fixed.

          Show
          Mark Miller added a comment - I just noticed the update so and conn timeouts being set for tests were reversed and fixed.
          Hide
          Mark Miller added a comment -

          Since making the above change, fullmetaljenkins in 'choas duty mode' seems to as happy as its ever been http://fullmetaljenkins.org/view/Chaos/

          I've also independently run 45 runs with an infinite timeout with no failures.

          Looks like we can move on to the leader kill test...

          Show
          Mark Miller added a comment - Since making the above change, fullmetaljenkins in 'choas duty mode' seems to as happy as its ever been http://fullmetaljenkins.org/view/Chaos/ I've also independently run 45 runs with an infinite timeout with no failures. Looks like we can move on to the leader kill test...
          Hide
          Mark Miller added a comment -

          FYI, ive also raised the timeout from 30s to 60s just to be safe - we will see if that makes freebsd cry or not.

          Show
          Mark Miller added a comment - FYI, ive also raised the timeout from 30s to 60s just to be safe - we will see if that makes freebsd cry or not.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1429029

          SOLR-3180: tests: timeouts from 30 sec to 60 sec

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1429029 SOLR-3180 : tests: timeouts from 30 sec to 60 sec
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1429027

          SOLR-3180: tests: timeouts from 30 sec to 60 sec

          Show
          Commit Tag Bot added a comment - [trunk commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1429027 SOLR-3180 : tests: timeouts from 30 sec to 60 sec
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0

            People

            • Assignee:
              Yonik Seeley
              Reporter:
              Yonik Seeley
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development