Solr
  1. Solr
  2. SOLR-7819

ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect retryOnConnLoss

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 5.2, 5.2.1
    • Fix Version/s: 5.4, 6.0
    • Component/s: SolrCloud
    • Labels:

      Description

      SOLR-7245 added a retryOnConnLoss parameter to ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads do not hang during a partition on ZK operations. However, some of those changes were unintentionally reverted by SOLR-7336 in 5.2.

      I found this while running Jepsen tests on 5.2.1 where a hung update managed to put a leader into a 'down' state (I'm still investigating and will open a separate issue about this problem).

      1. SOLR-7819.patch
        43 kB
        Shalin Shekhar Mangar
      2. SOLR-7819.patch
        44 kB
        Shalin Shekhar Mangar
      3. SOLR-7819.patch
        30 kB
        Shalin Shekhar Mangar
      4. SOLR-7819.patch
        30 kB
        Shalin Shekhar Mangar
      5. SOLR-7819.patch
        30 kB
        Shalin Shekhar Mangar
      6. SOLR-7819.patch
        9 kB
        Shalin Shekhar Mangar

        Issue Links

          Activity

          Hide
          Shalin Shekhar Mangar added a comment -

          Ramkumar Aiyengar - It looks like the commits for SOLR-7245 only added a retryOnConnLoss parameter but it was never used inside the ZkController.updateLeaderInitiatedRecoveryState method?

          Also, now that I am thinking about this change, is it really safe? For example, if a leader was not able to write to a 'live' replica, and during the LIR process if the leader couldn't complete a ZK operation (because retryOnConnLoss=false) then LIR won't be set and updates can be missed. Also, the code as it is currently written, bails on a ConnectionLossException and doesn't even start a LIR thread which is bad.

          I think not having a thread wait for LIR related activity is a noble cause but we should move the entire LIR logic to a background thread which must retry on connection loss until it either succeeds or a session expired exception is thrown.

          Thoughts?

          Show
          Shalin Shekhar Mangar added a comment - Ramkumar Aiyengar - It looks like the commits for SOLR-7245 only added a retryOnConnLoss parameter but it was never used inside the ZkController.updateLeaderInitiatedRecoveryState method? Also, now that I am thinking about this change, is it really safe? For example, if a leader was not able to write to a 'live' replica, and during the LIR process if the leader couldn't complete a ZK operation (because retryOnConnLoss=false) then LIR won't be set and updates can be missed. Also, the code as it is currently written, bails on a ConnectionLossException and doesn't even start a LIR thread which is bad. I think not having a thread wait for LIR related activity is a noble cause but we should move the entire LIR logic to a background thread which must retry on connection loss until it either succeeds or a session expired exception is thrown. Thoughts?
          Hide
          Ramkumar Aiyengar added a comment -

          Duh, this is why we need a good test for this (I gave up after trying a bit in the original ticket), and I need to pay attention to automated merges more. Looks like my initial patch had the change, but when I merged with your changes for SOLR-7109, looks like the local variable use just got removed

          I get your concern, I think we already do this, look at DistributedUpdateProcessor.java around line 883, if we are unable to set the LIR node, we start a thread to keep retrying the node set. We just need to return false in the connection loss case as well, we currently do it only if the node is not live (and hence we didnt even bother setting the node).

          Show
          Ramkumar Aiyengar added a comment - Duh, this is why we need a good test for this (I gave up after trying a bit in the original ticket), and I need to pay attention to automated merges more. Looks like my initial patch had the change, but when I merged with your changes for SOLR-7109 , looks like the local variable use just got removed I get your concern, I think we already do this, look at DistributedUpdateProcessor.java around line 883, if we are unable to set the LIR node, we start a thread to keep retrying the node set. We just need to return false in the connection loss case as well, we currently do it only if the node is not live (and hence we didnt even bother setting the node).
          Hide
          Shalin Shekhar Mangar added a comment -

          I think we already do this, look at DistributedUpdateProcessor.java around line 883, if we are unable to set the LIR node, we start a thread to keep retrying the node set.

          Umm, it looks the reverse to me. If we are unable to set the LIR node or if there is an exception then sendRecoveryCommand=false and we do not create the LeaderInitiatedRecoveryThread at all?

          Show
          Shalin Shekhar Mangar added a comment - I think we already do this, look at DistributedUpdateProcessor.java around line 883, if we are unable to set the LIR node, we start a thread to keep retrying the node set. Umm, it looks the reverse to me. If we are unable to set the LIR node or if there is an exception then sendRecoveryCommand=false and we do not create the LeaderInitiatedRecoveryThread at all?
          Hide
          Shalin Shekhar Mangar added a comment -

          Here's a patch which:

          1. Adds retryOnConnLoss in ZkController's ensureReplicaInLeaderInitiatedRecovery, updateLeaderInitiatedRecoveryState and markShardAsDownIfLeader method.
          2. Starts a LIR thread if leader cannot mark replica as down on connection loss. Earlier a session loss or connection loss both would skip starting the LIR thread.

          I'm still running Solr's integration and jepsen tests.

          This causes a subtle change in behavior which is best analyzed with two different scenarios:

          1. Leader fails to send an update to replica but also suffers a temporary blip in its ZK connection during the DistributedUpdateProcessor's doFinish method
            1. Currently, a few indexing threads will hang but eventually succeed in marking the 'replica' as down and the leader will start a new LIR thread to ask the replica to recover.
            2. With this patch, the indexing threads do not hang but a connection loss exception is thrown. At this point, we started a new LIR thread to ask the replica to recover. Although this removes the safety of explicitly marking the 'replica' as down, the LIR thread does provide us a timeout-based safety of making sure that the replica does recover from the leader.
          2. Leader fails to send an update to replica but also suffers a long network partition between itself and ZK server during DUP.doFinish method.
            1. Currently, a few indexing threads will hang in ZkController.ensureReplicaInLeaderInitiatedRecovery until the ZK operations time out because of connection loss or session loss and no LIR thread will be created. This seems okay because the current connection loss timeout value is higher than ZK session expiration time and session loss means that ZK has determined that our session has expired already. In both cases, a new leader election should have happened and there's no need to put the replica as 'down'.
            2. With this patch, the difference is that the indexing threads do not hang and the ensureReplicaInLeaderInitiatedRecovery returns immediately with a connection loss exception. A new LIR thread is started in this scenario. This is also fine because we were not able to mark the replica as 'down' and we aren't sure that the session has expired so it is important that we start the LIR thread to ask the replica to recover. Even if a new leader has been elected, there's no major harm done by asking the replica to recover.

          So, net-net this patch doesn't seem to introduce any new problems in the system.

          Show
          Shalin Shekhar Mangar added a comment - Here's a patch which: Adds retryOnConnLoss in ZkController's ensureReplicaInLeaderInitiatedRecovery, updateLeaderInitiatedRecoveryState and markShardAsDownIfLeader method. Starts a LIR thread if leader cannot mark replica as down on connection loss. Earlier a session loss or connection loss both would skip starting the LIR thread. I'm still running Solr's integration and jepsen tests. This causes a subtle change in behavior which is best analyzed with two different scenarios: Leader fails to send an update to replica but also suffers a temporary blip in its ZK connection during the DistributedUpdateProcessor's doFinish method Currently, a few indexing threads will hang but eventually succeed in marking the 'replica' as down and the leader will start a new LIR thread to ask the replica to recover. With this patch, the indexing threads do not hang but a connection loss exception is thrown. At this point, we started a new LIR thread to ask the replica to recover. Although this removes the safety of explicitly marking the 'replica' as down, the LIR thread does provide us a timeout-based safety of making sure that the replica does recover from the leader. Leader fails to send an update to replica but also suffers a long network partition between itself and ZK server during DUP.doFinish method. Currently, a few indexing threads will hang in ZkController.ensureReplicaInLeaderInitiatedRecovery until the ZK operations time out because of connection loss or session loss and no LIR thread will be created. This seems okay because the current connection loss timeout value is higher than ZK session expiration time and session loss means that ZK has determined that our session has expired already. In both cases, a new leader election should have happened and there's no need to put the replica as 'down'. With this patch, the difference is that the indexing threads do not hang and the ensureReplicaInLeaderInitiatedRecovery returns immediately with a connection loss exception. A new LIR thread is started in this scenario. This is also fine because we were not able to mark the replica as 'down' and we aren't sure that the session has expired so it is important that we start the LIR thread to ask the replica to recover. Even if a new leader has been elected, there's no major harm done by asking the replica to recover. So, net-net this patch doesn't seem to introduce any new problems in the system.
          Hide
          Shalin Shekhar Mangar added a comment -

          Hmm, this last patch isn't quite right because it can create multiple LIR threads for the same replica on connection loss.

          For example, I found the following in the logs in one of the nodes. Here 4 LIR threads were created to ask the same replica to recover:

          2015-07-29 13:21:24.629 INFO  (updateExecutor-2-thread-18-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully after running for 0 secs
          2015-07-29 13:21:24.978 INFO  (updateExecutor-2-thread-19-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.c.ZkStateReader Updating data for jepsen5x3 to ver 95
          2015-07-29 13:21:24.978 WARN  (updateExecutor-2-thread-19-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread Stop trying to send recovery command to downed replica core=jepsen5x3_shard2_replica1,coreNodeName=core_node2 on n1:8983_solr because core_node1 is no longer the leader! New leader is core_node2
          2015-07-29 13:21:24.978 INFO  (updateExecutor-2-thread-19-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully after running for 39 secs
          2015-07-29 13:21:24.979 INFO  (updateExecutor-2-thread-21-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.c.ZkStateReader Updating data for jepsen5x3 to ver 95
          2015-07-29 13:21:24.979 WARN  (updateExecutor-2-thread-21-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread Stop trying to send recovery command to downed replica core=jepsen5x3_shard2_replica1,coreNodeName=core_node2 on n1:8983_solr because core_node1 is no longer the leader! New leader is core_node2
          2015-07-29 13:21:24.979 INFO  (updateExecutor-2-thread-21-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully after running for 28 secs
          2015-07-29 13:21:24.981 INFO  (updateExecutor-2-thread-22-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.c.ZkStateReader Updating data for jepsen5x3 to ver 95
          2015-07-29 13:21:24.981 WARN  (updateExecutor-2-thread-22-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread Stop trying to send recovery command to downed replica core=jepsen5x3_shard2_replica1,coreNodeName=core_node2 on n1:8983_solr because core_node1 is no longer the leader! New leader is core_node2
          2015-07-29 13:21:24.981 INFO  (updateExecutor-2-thread-22-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully after running for 33 secs
          
          Show
          Shalin Shekhar Mangar added a comment - Hmm, this last patch isn't quite right because it can create multiple LIR threads for the same replica on connection loss. For example, I found the following in the logs in one of the nodes. Here 4 LIR threads were created to ask the same replica to recover: 2015-07-29 13:21:24.629 INFO (updateExecutor-2-thread-18-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http: ////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully after running for 0 secs 2015-07-29 13:21:24.978 INFO (updateExecutor-2-thread-19-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http: ////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.c.ZkStateReader Updating data for jepsen5x3 to ver 95 2015-07-29 13:21:24.978 WARN (updateExecutor-2-thread-19-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http: ////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread Stop trying to send recovery command to downed replica core=jepsen5x3_shard2_replica1,coreNodeName=core_node2 on n1:8983_solr because core_node1 is no longer the leader! New leader is core_node2 2015-07-29 13:21:24.978 INFO (updateExecutor-2-thread-19-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http: ////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully after running for 39 secs 2015-07-29 13:21:24.979 INFO (updateExecutor-2-thread-21-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http: ////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.c.ZkStateReader Updating data for jepsen5x3 to ver 95 2015-07-29 13:21:24.979 WARN (updateExecutor-2-thread-21-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http: ////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread Stop trying to send recovery command to downed replica core=jepsen5x3_shard2_replica1,coreNodeName=core_node2 on n1:8983_solr because core_node1 is no longer the leader! New leader is core_node2 2015-07-29 13:21:24.979 INFO (updateExecutor-2-thread-21-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http: ////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully after running for 28 secs 2015-07-29 13:21:24.981 INFO (updateExecutor-2-thread-22-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http: ////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.c.ZkStateReader Updating data for jepsen5x3 to ver 95 2015-07-29 13:21:24.981 WARN (updateExecutor-2-thread-22-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http: ////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread Stop trying to send recovery command to downed replica core=jepsen5x3_shard2_replica1,coreNodeName=core_node2 on n1:8983_solr because core_node1 is no longer the leader! New leader is core_node2 2015-07-29 13:21:24.981 INFO (updateExecutor-2-thread-22-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http: ////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully after running for 33 secs
          Hide
          Shalin Shekhar Mangar added a comment -

          This patch moves all LIR related activity inside the LIR thread. The LIR thread now publishes LIR state, publishes node state and then starts a recovery loop depending on whether LIR state was published successfully or if it failed because of session expiry or connection loss. The indexing thread only consults the local replica map to ensure that only 1 LIR thread is started for any given replica. This ensures that the indexing thread never needs to wait for ZK operations needed for LIR. All tests pass except for HttpPartitionTest.testLeaderInitiatedRecoveryCRUD whose assumptions about the LIR workflow are no longer correct.

          Still running more tests.

          Show
          Shalin Shekhar Mangar added a comment - This patch moves all LIR related activity inside the LIR thread. The LIR thread now publishes LIR state, publishes node state and then starts a recovery loop depending on whether LIR state was published successfully or if it failed because of session expiry or connection loss. The indexing thread only consults the local replica map to ensure that only 1 LIR thread is started for any given replica. This ensures that the indexing thread never needs to wait for ZK operations needed for LIR. All tests pass except for HttpPartitionTest.testLeaderInitiatedRecoveryCRUD whose assumptions about the LIR workflow are no longer correct. Still running more tests.
          Hide
          Ramkumar Aiyengar added a comment -

          A couple of comments, looks sensible overall..

                log.info("Node " + replicaNodeName +
                        " is not live, so skipping leader-initiated recovery for replica: core={} coreNodeName={}",
                    replicaCoreName, replicaCoreNodeName);
                // publishDownState will be false to avoid publishing the "down" state too many times
                // as many errors can occur together and will each call into this method (SOLR-6189)
          

          It goes ahead and does `publishDownState` still if `forcePublishState` is true, is that intentional? The caller does check for if the replica is live, but there could a race. Similarly, if our state is suspect due to zk disconnect/session (the block before this), should the force be respected?

                // if the replica's state is not DOWN right now, make it so ...
                // we only really need to try to send the recovery command if the node itself is "live"
                if (getZkStateReader().getClusterState().liveNodesContain(replicaNodeName)) {
          
                  LeaderInitiatedRecoveryThread lirThread =
          

          The comment doesn't make sense as the code has moved to LIRT.

          Show
          Ramkumar Aiyengar added a comment - A couple of comments, looks sensible overall.. log.info( "Node " + replicaNodeName + " is not live, so skipping leader-initiated recovery for replica: core={} coreNodeName={}" , replicaCoreName, replicaCoreNodeName); // publishDownState will be false to avoid publishing the "down" state too many times // as many errors can occur together and will each call into this method (SOLR-6189) It goes ahead and does `publishDownState` still if `forcePublishState` is true, is that intentional? The caller does check for if the replica is live, but there could a race. Similarly, if our state is suspect due to zk disconnect/session (the block before this), should the force be respected? // if the replica's state is not DOWN right now, make it so ... // we only really need to try to send the recovery command if the node itself is "live" if (getZkStateReader().getClusterState().liveNodesContain(replicaNodeName)) { LeaderInitiatedRecoveryThread lirThread = The comment doesn't make sense as the code has moved to LIRT.
          Hide
          Shalin Shekhar Mangar added a comment -

          Bulk move to 5.4 after 5.3 release.

          Show
          Shalin Shekhar Mangar added a comment - Bulk move to 5.4 after 5.3 release.
          Hide
          Shalin Shekhar Mangar added a comment -

          Patch updated to trunk.

          Thanks for the review Ramkumar Aiyengar and sorry for the delay in getting back to you.

          It goes ahead and does `publishDownState` still if `forcePublishState` is true, is that intentional?

          Yes, because if the replica somehow became 'active' when the LIR state is still 'down', we want to force publish its state again. The forcePublishState=true is only set in this one scenario.

          The caller does check for if the replica is live, but there could a race. Similarly, if our state is suspect due to zk disconnect/session (the block before this), should the force be respected?

          I think you're right. We should short-circuit the publishing part complete if replica is not live or if our state is suspect.

          This patch incorporates both of your review comments.

          Show
          Shalin Shekhar Mangar added a comment - Patch updated to trunk. Thanks for the review Ramkumar Aiyengar and sorry for the delay in getting back to you. It goes ahead and does `publishDownState` still if `forcePublishState` is true, is that intentional? Yes, because if the replica somehow became 'active' when the LIR state is still 'down', we want to force publish its state again. The forcePublishState=true is only set in this one scenario. The caller does check for if the replica is live, but there could a race. Similarly, if our state is suspect due to zk disconnect/session (the block before this), should the force be respected? I think you're right. We should short-circuit the publishing part complete if replica is not live or if our state is suspect. This patch incorporates both of your review comments.
          Hide
          Shalin Shekhar Mangar added a comment -

          My last patch had a merge error which was causing chaos monkey test failures. This patch fixes them.

          This still has a few nocommits – the ZkControllerTest leaks threads which is due to LeaderInitiatedRecoveryThread taking on some of the responsibilities of ZkController. I think the LeaderInitiatedRecoveryThread has enough serious features that it should have its own test and not piggy back on ZkController. I'll attempt to refactor the class to make it more testable and write a unit test for it.

          Show
          Shalin Shekhar Mangar added a comment - My last patch had a merge error which was causing chaos monkey test failures. This patch fixes them. This still has a few nocommits – the ZkControllerTest leaks threads which is due to LeaderInitiatedRecoveryThread taking on some of the responsibilities of ZkController. I think the LeaderInitiatedRecoveryThread has enough serious features that it should have its own test and not piggy back on ZkController. I'll attempt to refactor the class to make it more testable and write a unit test for it.
          Hide
          Shalin Shekhar Mangar added a comment -
          1. Adds a new test: TestLeaderInitiatedRecoveryThread
          2. Removes ZkControllerTest.testEnsureReplicaInLeaderInitiatedRecovery which is no longer correct
          3. Removes portions of HttpPartitionTest.testLeaderInitiatedRecoveryCRUD which are no longer relevant to the new code
          4. Fixed a bug in LeaderInitiatedRecoveryThread which would send recovery messages even when a node was not live. This is tested in the new test.
          Show
          Shalin Shekhar Mangar added a comment - Adds a new test: TestLeaderInitiatedRecoveryThread Removes ZkControllerTest.testEnsureReplicaInLeaderInitiatedRecovery which is no longer correct Removes portions of HttpPartitionTest.testLeaderInitiatedRecoveryCRUD which are no longer relevant to the new code Fixed a bug in LeaderInitiatedRecoveryThread which would send recovery messages even when a node was not live. This is tested in the new test.
          Hide
          Shalin Shekhar Mangar added a comment -
          1. Removed some debug logging in ExecutorUtil that I accidentally left behind.
          2. Removed the nocommit in DistributedUpdateProcessor which had a comment to the effect of "not a StdNode, recovery command still gets sent once" which isn't true because we short circuit errors on RetryNode at the beginning of the loop.

          I think this is ready.

          Show
          Shalin Shekhar Mangar added a comment - Removed some debug logging in ExecutorUtil that I accidentally left behind. Removed the nocommit in DistributedUpdateProcessor which had a comment to the effect of "not a StdNode, recovery command still gets sent once" which isn't true because we short circuit errors on RetryNode at the beginning of the loop. I think this is ready.
          Hide
          ASF subversion and git services added a comment -

          Commit 1702067 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1702067 ]

          SOLR-7819: ZK connection loss or session timeout do not stall indexing threads anymore and LIR activity is moved to a background thread

          Show
          ASF subversion and git services added a comment - Commit 1702067 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1702067 ] SOLR-7819 : ZK connection loss or session timeout do not stall indexing threads anymore and LIR activity is moved to a background thread
          Hide
          ASF subversion and git services added a comment -

          Commit 1702213 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1702213 ]

          SOLR-7819: ZK connection loss or session timeout do not stall indexing threads anymore and LIR activity is moved to a background thread

          Show
          ASF subversion and git services added a comment - Commit 1702213 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1702213 ] SOLR-7819 : ZK connection loss or session timeout do not stall indexing threads anymore and LIR activity is moved to a background thread
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Ramkumar for the review.

          Show
          Shalin Shekhar Mangar added a comment - Thanks Ramkumar for the review.

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Shalin Shekhar Mangar
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development