Solr
  1. Solr
  2. SOLR-5495

Recovery strategy for leader partitioned from replica case.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9
    • Component/s: None
    • Labels:
      None

      Description

      We need to work out a strategy for the case of:

      Leader and replicas can still talk to ZooKeeper, Leader cannot talk to replica.

      We punted on this in initial design, but I'd like to get something in.

      1. SOLR-5495.patch
        75 kB
        Timothy Potter
      2. SOLR-5495.patch
        75 kB
        Timothy Potter
      3. SOLR-5495.patch
        47 kB
        Timothy Potter

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          In this case, the only real communication path is through zookeeper.

          Couple ideas I have thought about - seems tricky in general though:

          • have the leader publish the replica as down - this is tricky because the replica may be publishing it's own states - perhaps we publish it with a special marker, and the overseer will not write a new state for that replica until one is published acking that it has seen the marker and acted accordingly?
          • have a zk queue that leaders can publish to, asking for a recovery - replicas monitor the queue and check it at startup - if they are in it, they enter recovery and remove the queue entry - Overseer could also periodically clear the queue

          Just a quick 30 sec dump of some initial thoughts...looking for other ideas and may offer some more myself.

          Show
          Mark Miller added a comment - In this case, the only real communication path is through zookeeper. Couple ideas I have thought about - seems tricky in general though: have the leader publish the replica as down - this is tricky because the replica may be publishing it's own states - perhaps we publish it with a special marker, and the overseer will not write a new state for that replica until one is published acking that it has seen the marker and acted accordingly? have a zk queue that leaders can publish to, asking for a recovery - replicas monitor the queue and check it at startup - if they are in it, they enter recovery and remove the queue entry - Overseer could also periodically clear the queue Just a quick 30 sec dump of some initial thoughts...looking for other ideas and may offer some more myself.
          Hide
          Timothy Potter added a comment -

          Hi Mark,

          I'm finally starting work on this ... I like both of your ideas above; of the two, I like the latter a little more so will start there and see what shakes out.

          Show
          Timothy Potter added a comment - Hi Mark, I'm finally starting work on this ... I like both of your ideas above; of the two, I like the latter a little more so will start there and see what shakes out.
          Hide
          Timothy Potter added a comment -

          Here is a first patch that shows the direction I'm heading with this. A few things to note about this patch that are worth discussing (sorry it's a bit long winded but want to be sure we're all on the same page about this solution):

          1) I found a handy class named SocketProxy in the ActiveMQ project and "borrowed" it here to help simulate network partitions, such as between the leader and replica, while keeping the ZooKeeper connection alive. I'm aware of the IpTables class (from SOLR-5482) but the advantage of SocketProxy is it is just native Java so runs w/o sudo and on any platforms that don't have iptables. This of course requires a little trickery when setting up your test as you need to insert the proxy in-front of the Jetty nodes, which is being accomplished by setting a proxyPort on the JettySolrRunner, see HttpPartitionTest.createJetty. I'm guessing we can build this into a test base class if we like this approach and think it will be useful for other tests.

          2) I ended up going with Mark's idea #1 except I don't see where / why we need to worry about the replica publishing it's own state? In other words, what really matters, is that the leader cannot send a request to the replica, so to me, the leader's view of the replica is what matters. In my patch, the leader will publish the state of the replica as "down" when it encounters a communication error when trying to send a request to a replica. See ZkController.ensureReplicaInLeaderInitiatedRecovery() method, which is called from the DistributedUpdateProcessor.doFinish() method.

          So I've thought about this in some detail and I think it will work itself out without us having to coordinate state changes. So let's just say the leader set the state to "down" and for some weird reason (which I can't really see how it would happen), the replica reset it's state to "active". This would make the replica a candidate for receiving requests again, which would just lead to another error, leading to the leader re-setting the state to "down". In a nutshell, if the leader can't talk to the replica over http, it's state gets set to "down".

          One idea I did have for this is to have the leader pass the ClusterState.zkClusterStateVersion along in every request, thus allowing the replica to compare the version it is working with and if they are different, then have the replica force a state update from ZK and act accordingly. It shouldn't be too bad to implement this if we think it will be useful? Version would be passed along like the distrib.update param is today.

          3) Even if more coordination is needed for #2 ^ at some point the replica gets marked as being in the down state. This ensures the leader stops trying to send requests to that replica (via the DOWN filter in the call to getReplicaProps to determine the set of Nodes to forward requests to). The leader also needs to determine if it should send the CoreAdminAction.REQUESTRECOVERY command to the downed replica's core. This occurs over HTTP, which I think is correct because if the leader can't send the recover command to the replica, then sending docs is futile as well. What I've done here is to build upon the existing code in DistributedUpdateProcessor's doFinish method to attempt sending that command every 5 secs for up to 10 minutes so long as the node is still listed as a /live_nodes in ZK. If that changes, I stop trying to hit that node from the leader since a node that is no longer live will do full recovery when it comes back.

          I like this leader-initiated recovery approach because the leader's view of the replica is what matters, so I felt creating a self-initiating recovery process by which the replica realizes its state got changed by the leader doesn't do much if the HTTP connection between the leader and replica is still down.

          4) Of course, there's no guarantee that connectivity will be restored within 10 minutes, so the re-try loop described in #3 ^ will timeout and the leader will stop trying to tell the replica to recover. At this point, the replica should be marked down so at least the leader is no longer trying to send requests to it, so I think the shard is in a safe state wrt consistency but after the 10 minutes, there's nothing to tell the replica to recover from the down state. Do we want the leader to just try forever? Seems like not ... Maybe this is where an ops alert could be inserted to have someone go investigate why the partition is longer than 10 minutes. Appreciate any advice on how to handle this better.

          5) You'll notice that I'm using a HashSet containing replicaUrl's in ZkController to keep track of replicas that are in the "leader-initiated" recovery process, see: ZkController.replicasInLeaderInitiatedRecoveryHandling. This approach is needed because there are many DistributedUpdateProcessor's that may be receiving a flood of errors concurrently when connectivity to a replica is lost. I didn't want the leader trying to set the state to DOWN more than once when it sees a bunch of errors or to have more than one thread per replica trying to send the recovery command. There might be a better location for this code than the ZkController (maybe ZkStateReader).

          As for testing, I think the unit test (HttpPartitionTest) is pretty close to the scenario we're trying to capture in this ticket.

          Specifically, it tests the following process:

          a. setup proxies in-front of all Jetty servers (3 in this test) by overriding the createJetty method.
          b. create a collection with 1 shard and rf=2
          c. send doc 1 to leader, which gets forwarded to replica successfully
          d. partition occurs (using SocketProxy); however the ZK connection between the replica remains in tact (which is the crux of this issue); the leader remains the same throughout this test
          e. send doc 2 to leader
          f. leader send doc 2 to replica fails due to comm error, asynchronous call to doFinish starts the leader-initiated recovery process
          g. leader marks replica as being down, which means it will stop trying to send requests to the replica until the situation improves as the ZkStateReader.getReplicaProps() filters out "downed" nodes. At this point, the leader is also trying to tell the replica to recover from a background thread.
          h. partition is restored
          i. send doc 3
          j. replica recovery succeeds asynchronously, test waits until it sees the replica in the "active" state
          k. test verifies both the leader and the replica have docs 1,2,3 using requests to the /get handler

          Next, the test performs the same basic process but for 1000 docs while dropping and restoring the connectivity between the leader and replica every 100 docs.

          I should mention that without the code in this patch, the replica will most certainly be out of sync and not know it, which of course is a no-no for a CP system (btw: I used real test-driven development methodology here by writing the test first and then implementing until the test passes).

          The one minor concern I have with this test right now is the Thread.sleep(2000) before restoring connectivity with the proxy. I had to introduce this because the test was progressing too fast for the recovery process to kick-in, thus leading to test failures. I think this is OK to wait a little bit because that is more reflective of a running cluster and things do take a little time to propagate around the cluster. Just wanted to draw attention to this so you're clear it was intentional to give things time to work.

          Show
          Timothy Potter added a comment - Here is a first patch that shows the direction I'm heading with this. A few things to note about this patch that are worth discussing (sorry it's a bit long winded but want to be sure we're all on the same page about this solution): 1) I found a handy class named SocketProxy in the ActiveMQ project and "borrowed" it here to help simulate network partitions, such as between the leader and replica, while keeping the ZooKeeper connection alive. I'm aware of the IpTables class (from SOLR-5482 ) but the advantage of SocketProxy is it is just native Java so runs w/o sudo and on any platforms that don't have iptables. This of course requires a little trickery when setting up your test as you need to insert the proxy in-front of the Jetty nodes, which is being accomplished by setting a proxyPort on the JettySolrRunner, see HttpPartitionTest.createJetty. I'm guessing we can build this into a test base class if we like this approach and think it will be useful for other tests. 2) I ended up going with Mark's idea #1 except I don't see where / why we need to worry about the replica publishing it's own state? In other words, what really matters, is that the leader cannot send a request to the replica, so to me, the leader's view of the replica is what matters. In my patch, the leader will publish the state of the replica as "down" when it encounters a communication error when trying to send a request to a replica. See ZkController.ensureReplicaInLeaderInitiatedRecovery() method, which is called from the DistributedUpdateProcessor.doFinish() method. So I've thought about this in some detail and I think it will work itself out without us having to coordinate state changes. So let's just say the leader set the state to "down" and for some weird reason (which I can't really see how it would happen), the replica reset it's state to "active". This would make the replica a candidate for receiving requests again, which would just lead to another error, leading to the leader re-setting the state to "down". In a nutshell, if the leader can't talk to the replica over http, it's state gets set to "down". One idea I did have for this is to have the leader pass the ClusterState.zkClusterStateVersion along in every request, thus allowing the replica to compare the version it is working with and if they are different, then have the replica force a state update from ZK and act accordingly. It shouldn't be too bad to implement this if we think it will be useful? Version would be passed along like the distrib.update param is today. 3) Even if more coordination is needed for #2 ^ at some point the replica gets marked as being in the down state. This ensures the leader stops trying to send requests to that replica (via the DOWN filter in the call to getReplicaProps to determine the set of Nodes to forward requests to). The leader also needs to determine if it should send the CoreAdminAction.REQUESTRECOVERY command to the downed replica's core. This occurs over HTTP, which I think is correct because if the leader can't send the recover command to the replica, then sending docs is futile as well. What I've done here is to build upon the existing code in DistributedUpdateProcessor's doFinish method to attempt sending that command every 5 secs for up to 10 minutes so long as the node is still listed as a /live_nodes in ZK. If that changes, I stop trying to hit that node from the leader since a node that is no longer live will do full recovery when it comes back. I like this leader-initiated recovery approach because the leader's view of the replica is what matters, so I felt creating a self-initiating recovery process by which the replica realizes its state got changed by the leader doesn't do much if the HTTP connection between the leader and replica is still down. 4) Of course, there's no guarantee that connectivity will be restored within 10 minutes, so the re-try loop described in #3 ^ will timeout and the leader will stop trying to tell the replica to recover. At this point, the replica should be marked down so at least the leader is no longer trying to send requests to it, so I think the shard is in a safe state wrt consistency but after the 10 minutes, there's nothing to tell the replica to recover from the down state. Do we want the leader to just try forever? Seems like not ... Maybe this is where an ops alert could be inserted to have someone go investigate why the partition is longer than 10 minutes. Appreciate any advice on how to handle this better. 5) You'll notice that I'm using a HashSet containing replicaUrl's in ZkController to keep track of replicas that are in the "leader-initiated" recovery process, see: ZkController.replicasInLeaderInitiatedRecoveryHandling. This approach is needed because there are many DistributedUpdateProcessor's that may be receiving a flood of errors concurrently when connectivity to a replica is lost. I didn't want the leader trying to set the state to DOWN more than once when it sees a bunch of errors or to have more than one thread per replica trying to send the recovery command. There might be a better location for this code than the ZkController (maybe ZkStateReader). As for testing, I think the unit test (HttpPartitionTest) is pretty close to the scenario we're trying to capture in this ticket. Specifically, it tests the following process: a. setup proxies in-front of all Jetty servers (3 in this test) by overriding the createJetty method. b. create a collection with 1 shard and rf=2 c. send doc 1 to leader, which gets forwarded to replica successfully d. partition occurs (using SocketProxy); however the ZK connection between the replica remains in tact (which is the crux of this issue); the leader remains the same throughout this test e. send doc 2 to leader f. leader send doc 2 to replica fails due to comm error, asynchronous call to doFinish starts the leader-initiated recovery process g. leader marks replica as being down, which means it will stop trying to send requests to the replica until the situation improves as the ZkStateReader.getReplicaProps() filters out "downed" nodes. At this point, the leader is also trying to tell the replica to recover from a background thread. h. partition is restored i. send doc 3 j. replica recovery succeeds asynchronously, test waits until it sees the replica in the "active" state k. test verifies both the leader and the replica have docs 1,2,3 using requests to the /get handler Next, the test performs the same basic process but for 1000 docs while dropping and restoring the connectivity between the leader and replica every 100 docs. I should mention that without the code in this patch, the replica will most certainly be out of sync and not know it, which of course is a no-no for a CP system (btw: I used real test-driven development methodology here by writing the test first and then implementing until the test passes). The one minor concern I have with this test right now is the Thread.sleep(2000) before restoring connectivity with the proxy. I had to introduce this because the test was progressing too fast for the recovery process to kick-in, thus leading to test failures. I think this is OK to wait a little bit because that is more reflective of a running cluster and things do take a little time to propagate around the cluster. Just wanted to draw attention to this so you're clear it was intentional to give things time to work.
          Hide
          Mark Miller added a comment -

          which I can't really see how it would happen

          Consider that we might have other states in the future as well - and / or, there can be backed up requests. Or what about races with the replica doing it's own recovery and having a down tossed in there by the leader. It's tricky to promise anything.

          This would make the replica a candidate for receiving requests again, which would just lead to another error

          Is that guaranteed though? What if a connection recovers and it doesn't lead to another error?

          If it all bears out in testing though, perhaps that's all fine.

          I found a handy class named SocketProxy

          +1 - it's great to be able to do this as part of the standard tests in Java.

          Show
          Mark Miller added a comment - which I can't really see how it would happen Consider that we might have other states in the future as well - and / or, there can be backed up requests. Or what about races with the replica doing it's own recovery and having a down tossed in there by the leader. It's tricky to promise anything. This would make the replica a candidate for receiving requests again, which would just lead to another error Is that guaranteed though? What if a connection recovers and it doesn't lead to another error? If it all bears out in testing though, perhaps that's all fine. I found a handy class named SocketProxy +1 - it's great to be able to do this as part of the standard tests in Java.
          Hide
          Mark Miller added a comment -

          I'll try and do a code review soon by the way.

          Show
          Mark Miller added a comment - I'll try and do a code review soon by the way.
          Hide
          Mark Miller added a comment -

          Or what about races

          To expand, I guess my main worry is around a leader telling a replica it's down right before it publishes it's active, so the down is basically ignored. I have not thought it thorough fully, but that is the type of thing I'm worried about. We don't want any races around a replica realizing it's been marked as down - we have to make sure it receives that message and doesn't ever think it properly recovered when it's missing a doc it would have buffered or something.

          Show
          Mark Miller added a comment - Or what about races To expand, I guess my main worry is around a leader telling a replica it's down right before it publishes it's active, so the down is basically ignored. I have not thought it thorough fully, but that is the type of thing I'm worried about. We don't want any races around a replica realizing it's been marked as down - we have to make sure it receives that message and doesn't ever think it properly recovered when it's missing a doc it would have buffered or something.
          Hide
          Anshum Gupta added a comment -

          Just to be sure, I'd want to add that SOLR-5991 might intersect (not overlap) with the work on this one. I'll put a note in there (as well) about this one so whoever plans to work on that knows about changes happening as a part of this issue.

          Show
          Anshum Gupta added a comment - Just to be sure, I'd want to add that SOLR-5991 might intersect (not overlap) with the work on this one. I'll put a note in there (as well) about this one so whoever plans to work on that knows about changes happening as a part of this issue.
          Hide
          Timothy Potter added a comment -

          Thanks for the feedback! Mainly I just wanted clarification on that issue and if your intuition tells you it may be an issue, then that's sufficient for me to think harder and come up with something more robust around the leader marking the replica as "down".

          Show
          Timothy Potter added a comment - Thanks for the feedback! Mainly I just wanted clarification on that issue and if your intuition tells you it may be an issue, then that's sufficient for me to think harder and come up with something more robust around the leader marking the replica as "down".
          Hide
          Timothy Potter added a comment -

          Here's an updated patch that builds upon the previous one, same basic approach of leader-initiated recovery but with some added coordination between the leader and partitioned replica using a znode: /collections/<collection>/leader_initiated_recovery/<shard>/<replicaCoreName> (see ZkController.getLeaderInitiatedRecoveryZnodePath).

          The basic idea here is in addition to the leader marking the replica down, a separate znode is used to track the replica's transition from down -> recovering -> active. So the leader marks the replica as down (which removes it from participating in queries and update requests) and also creates this special znode. When the replica finally gets the "you need to recover" command from the leader, it changes the value of this znode to "recovering". When recovery succeeds, the replica deletes the znode as it's no longer needed. If the leader, while trying to send the recovery command (see LeaderInitiatedRecoveryThread), sees the replica as being "active" but the znode wasn't ack'd, then the leader can set the state to down again. As stated before, I don't see where the replica would do this, but if it happens, we now have a better way to handle it. Bottom line is with this special znode, the replica cannot stay in the "active" state until it acks the leader's command to recover by transitioning the znode appropriately.

          The special znode is also useful if the nagging leader fails before the bad replica receives the message. The idea here is that the new leader can query ZK for any of these "leader-initiated-recovery" znodes for its shard and if there are any in the "down" state, then it can start up the nag loop for each bad replica; a znode in the down state means the replica hasn't received the recovery command yet (see: ElectionContext$ShardLeaderElectionContext.startLeaderInitiatedRecoveryOnReplicas).

          There is a unit test that covers the leader failing over to a new leader and resuming the "nag" loop on the downed replica. There's one area where I'm not 100% sure if it is correct yet ... in the shouldIBeLeader method in ShardLeaderElectionContext, I check to see if a previous leader marked this core "down" and if so, return false to indicate this node should not be the leader. I think this works OK for RF=3 but I'm worried about RF=2 situations where this check prevents a leader from being elected. The main idea behind this check is that if the leader forces the shard state to "down", the core.getCoreDescriptor().getCloudDescriptor().getLastPublished() method can still return active so I needed this additional check on the znode. I suppose we could try to update the lastPublished state when it changes but didn't see how to go about that? (or if that was even a good idea).

          Another area where I'm not 100% sold on is the 10-minute max wait and then timeout loop in the LeaderInitiatedRecoveryThread. 10 mins is arbitrary but it seems like it shouldn't just run forever. One idea I had was to use JMX to raise some event / notification to allow monitoring tools to alert ops team of this issue. Curious if there's anything else in SolrCloud related to notifying of issues that need ops attention?

          Lastly, I did give some thought to a self-initiating recovery approach where the "trying to recover" loop runs on the replica itself as that is more immune to leader changes and there's already a recover retry loop in place via the RecoveryStrategy thread. As I understand it, a self-initiating approach would work something like:

          1) leader receives error when trying to forward update request to replica
          2) leader marks replica as down in ZK
          3) replica receives state change notification (at some point), replica must iterate over all cores hosted on that node looking for cores marked as down
          4) for each "down" core on the node found in step 3, try recovering in a loop

          This is all straight-forward to implement. However, the main problem with this approach is in step 4, when starting recovery, the replica updates its state to "recovering" in ZK immediately. When a replica is "recovering" the leader still tries to forward updates to it (the updates get stashed in the tlog until recovery is complete). This works in normal circumstances because the replica assumes there is no partition between it and the leader so it's OK to go into the recovering state and continue receiving updates. The problem here though is the network may still be partitioned, so the leader keeps trying to forward docs and receiving errors. From the leader's perspective, we're right back at step 1 above.

          Of course, it would be possible to introduce a new state that would prevent the leader from sending updates while the replica sorted itself out, but I'm hesitant to introduce a new state as that has broader repercussions in the code base. I'm mentioning this here in case someone else has some better ideas around this self-initiating approach.

          Show
          Timothy Potter added a comment - Here's an updated patch that builds upon the previous one, same basic approach of leader-initiated recovery but with some added coordination between the leader and partitioned replica using a znode: /collections/<collection>/leader_initiated_recovery/<shard>/<replicaCoreName> (see ZkController.getLeaderInitiatedRecoveryZnodePath). The basic idea here is in addition to the leader marking the replica down, a separate znode is used to track the replica's transition from down -> recovering -> active. So the leader marks the replica as down (which removes it from participating in queries and update requests) and also creates this special znode. When the replica finally gets the "you need to recover" command from the leader, it changes the value of this znode to "recovering". When recovery succeeds, the replica deletes the znode as it's no longer needed. If the leader, while trying to send the recovery command (see LeaderInitiatedRecoveryThread), sees the replica as being "active" but the znode wasn't ack'd, then the leader can set the state to down again. As stated before, I don't see where the replica would do this, but if it happens, we now have a better way to handle it. Bottom line is with this special znode, the replica cannot stay in the "active" state until it acks the leader's command to recover by transitioning the znode appropriately. The special znode is also useful if the nagging leader fails before the bad replica receives the message. The idea here is that the new leader can query ZK for any of these "leader-initiated-recovery" znodes for its shard and if there are any in the "down" state, then it can start up the nag loop for each bad replica; a znode in the down state means the replica hasn't received the recovery command yet (see: ElectionContext$ShardLeaderElectionContext.startLeaderInitiatedRecoveryOnReplicas). There is a unit test that covers the leader failing over to a new leader and resuming the "nag" loop on the downed replica. There's one area where I'm not 100% sure if it is correct yet ... in the shouldIBeLeader method in ShardLeaderElectionContext, I check to see if a previous leader marked this core "down" and if so, return false to indicate this node should not be the leader. I think this works OK for RF=3 but I'm worried about RF=2 situations where this check prevents a leader from being elected. The main idea behind this check is that if the leader forces the shard state to "down", the core.getCoreDescriptor().getCloudDescriptor().getLastPublished() method can still return active so I needed this additional check on the znode. I suppose we could try to update the lastPublished state when it changes but didn't see how to go about that? (or if that was even a good idea). Another area where I'm not 100% sold on is the 10-minute max wait and then timeout loop in the LeaderInitiatedRecoveryThread. 10 mins is arbitrary but it seems like it shouldn't just run forever. One idea I had was to use JMX to raise some event / notification to allow monitoring tools to alert ops team of this issue. Curious if there's anything else in SolrCloud related to notifying of issues that need ops attention? Lastly, I did give some thought to a self-initiating recovery approach where the "trying to recover" loop runs on the replica itself as that is more immune to leader changes and there's already a recover retry loop in place via the RecoveryStrategy thread. As I understand it, a self-initiating approach would work something like: 1) leader receives error when trying to forward update request to replica 2) leader marks replica as down in ZK 3) replica receives state change notification (at some point), replica must iterate over all cores hosted on that node looking for cores marked as down 4) for each "down" core on the node found in step 3, try recovering in a loop This is all straight-forward to implement. However, the main problem with this approach is in step 4, when starting recovery, the replica updates its state to "recovering" in ZK immediately. When a replica is "recovering" the leader still tries to forward updates to it (the updates get stashed in the tlog until recovery is complete). This works in normal circumstances because the replica assumes there is no partition between it and the leader so it's OK to go into the recovering state and continue receiving updates. The problem here though is the network may still be partitioned, so the leader keeps trying to forward docs and receiving errors. From the leader's perspective, we're right back at step 1 above. Of course, it would be possible to introduce a new state that would prevent the leader from sending updates while the replica sorted itself out, but I'm hesitant to introduce a new state as that has broader repercussions in the code base. I'm mentioning this here in case someone else has some better ideas around this self-initiating approach.
          Hide
          Mark Miller added a comment -

          Awesome! Hope to read that closely this weekend.

          Show
          Mark Miller added a comment - Awesome! Hope to read that closely this weekend.
          Hide
          Otis Gospodnetic added a comment -

          One idea I had was to use JMX to raise some event / notification to allow monitoring tools to alert ops team of this issue. Curious if there's anything else in SolrCloud related to notifying of issues that need ops attention?

          +1 for bringing this up - not that I'm aware of, but I also didn't look very closely. This may deserve a standalone JIRA! Perhaps relying on ZK for notifications would work?

          Show
          Otis Gospodnetic added a comment - One idea I had was to use JMX to raise some event / notification to allow monitoring tools to alert ops team of this issue. Curious if there's anything else in SolrCloud related to notifying of issues that need ops attention? +1 for bringing this up - not that I'm aware of, but I also didn't look very closely. This may deserve a standalone JIRA! Perhaps relying on ZK for notifications would work?
          Hide
          Timothy Potter added a comment -

          Updated patch to fix compilation error after backing out SOLR-5473 and tighten up the unit test, such as reducing the amount of time it waits before healing a partition. Also cleaned up a few minor issues in the LeaderInitiatedRecoveryThread loop.

          Show
          Timothy Potter added a comment - Updated patch to fix compilation error after backing out SOLR-5473 and tighten up the unit test, such as reducing the amount of time it waits before healing a partition. Also cleaned up a few minor issues in the LeaderInitiatedRecoveryThread loop.
          Hide
          Mark Miller added a comment -

          I'd like to dig into this more, but on a glance this morning, this looks like great stuff - thanks Tim!

          Show
          Mark Miller added a comment - I'd like to dig into this more, but on a glance this morning, this looks like great stuff - thanks Tim!
          Hide
          Timothy Potter added a comment -

          Thanks for the review Mark! I think there are still some weird interactions going on with this code and the waitForLeaderToSeeDownState stuff as I'm seeing some exceptions like the following in a good sized cluster when I knock over replicas during heavy indexing. Leader doesn't see down state, it sees the "recovering" state.

          2014-05-07 02:34:03,112 [Thread-3531] ERROR solr.cloud.ZkController - There was a problem making a request to the leader:org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://host:8985/solr
          at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:562)
          at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
          at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
          at org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1528)
          at org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:372)
          at org.apache.solr.cloud.ZkController.access$000(ZkController.java:87)
          at org.apache.solr.cloud.ZkController$1.command(ZkController.java:229)
          at org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)

          In short, I there are still a few little issues that didn't show up in unit testing. So I'm going to flog this area of the code a bit more tomorrow morning!

          Show
          Timothy Potter added a comment - Thanks for the review Mark! I think there are still some weird interactions going on with this code and the waitForLeaderToSeeDownState stuff as I'm seeing some exceptions like the following in a good sized cluster when I knock over replicas during heavy indexing. Leader doesn't see down state, it sees the "recovering" state. 2014-05-07 02:34:03,112 [Thread-3531] ERROR solr.cloud.ZkController - There was a problem making a request to the leader:org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://host:8985/solr at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:562) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) at org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1528) at org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:372) at org.apache.solr.cloud.ZkController.access$000(ZkController.java:87) at org.apache.solr.cloud.ZkController$1.command(ZkController.java:229) at org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166) In short, I there are still a few little issues that didn't show up in unit testing. So I'm going to flog this area of the code a bit more tomorrow morning!
          Hide
          ASF subversion and git services added a comment -

          Commit 1593312 from Timothy Potter in branch 'dev/trunk'
          [ https://svn.apache.org/r1593312 ]

          SOLR-5495: Hardening recovery scenarios after the leader receives an error trying to forward an update request to a replica.

          Show
          ASF subversion and git services added a comment - Commit 1593312 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1593312 ] SOLR-5495 : Hardening recovery scenarios after the leader receives an error trying to forward an update request to a replica.
          Hide
          ASF subversion and git services added a comment -

          Commit 1593791 from Timothy Potter in branch 'dev/trunk'
          [ https://svn.apache.org/r1593791 ]

          SOLR-5495: Fix HttpPartitionTest to dynamically select the port the Jetty and the SocketProxy binds to, was causing Jenkins failures.

          Show
          ASF subversion and git services added a comment - Commit 1593791 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1593791 ] SOLR-5495 : Fix HttpPartitionTest to dynamically select the port the Jetty and the SocketProxy binds to, was causing Jenkins failures.
          Hide
          ASF subversion and git services added a comment -

          Commit 1596103 from Timothy Potter in branch 'dev/trunk'
          [ https://svn.apache.org/r1596103 ]

          SOLR-5495: Raise the amount of time the test waits for replicas to become active after partitions are healed (to address intermittent Jenkins failures)

          Show
          ASF subversion and git services added a comment - Commit 1596103 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1596103 ] SOLR-5495 : Raise the amount of time the test waits for replicas to become active after partitions are healed (to address intermittent Jenkins failures)
          Hide
          ASF subversion and git services added a comment -

          Commit 1596107 from Timothy Potter in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1596107 ]

          SOLR-5495: Port over from trunk.

          Show
          ASF subversion and git services added a comment - Commit 1596107 from Timothy Potter in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1596107 ] SOLR-5495 : Port over from trunk.
          Hide
          Anshum Gupta added a comment -

          The CHANGES.txt entry for trunk and 4x are in different sections.
          trunk: 5.0 section
          4x: 4.9 section

          Show
          Anshum Gupta added a comment - The CHANGES.txt entry for trunk and 4x are in different sections. trunk: 5.0 section 4x: 4.9 section
          Hide
          ASF subversion and git services added a comment -

          Commit 1596315 from Timothy Potter in branch 'dev/trunk'
          [ https://svn.apache.org/r1596315 ]

          SOLR-5495: Re-arrange location of SOLR-5495 and SOLR-5468 in CHANGES.txt

          Show
          ASF subversion and git services added a comment - Commit 1596315 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1596315 ] SOLR-5495 : Re-arrange location of SOLR-5495 and SOLR-5468 in CHANGES.txt
          Hide
          ASF subversion and git services added a comment -

          Commit 1596636 from Timothy Potter in branch 'dev/trunk'
          [ https://svn.apache.org/r1596636 ]

          SOLR-5495: Print cluster state in assertion failure messages if a leader cannot be found to determine root cause of HttpPartitionTest failures in Jenkins.

          Show
          ASF subversion and git services added a comment - Commit 1596636 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1596636 ] SOLR-5495 : Print cluster state in assertion failure messages if a leader cannot be found to determine root cause of HttpPartitionTest failures in Jenkins.
          Hide
          ASF subversion and git services added a comment -

          Commit 1596637 from Timothy Potter in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1596637 ]

          SOLR-5495: Print cluster state in assertion failure messages if a leader cannot be found to determine root cause of HttpPartitionTest failures in Jenkins

          Show
          ASF subversion and git services added a comment - Commit 1596637 from Timothy Potter in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1596637 ] SOLR-5495 : Print cluster state in assertion failure messages if a leader cannot be found to determine root cause of HttpPartitionTest failures in Jenkins
          Hide
          Mark Miller added a comment -

          What's the current status of this? I see a lot of commits, but still in progress. I actually still hope to review it, but who knows. I have not touched anything in like a month or more now so I expect a huge backlog of crap will stand before me.

          Show
          Mark Miller added a comment - What's the current status of this? I see a lot of commits, but still in progress. I actually still hope to review it, but who knows. I have not touched anything in like a month or more now so I expect a huge backlog of crap will stand before me.
          Hide
          Timothy Potter added a comment -

          Hi Mark,

          I think it is ready to go, but was hoping for a review from you before closing. It's been committed to trunk and branch_4x. Tests seem to be passing on Jenkins and I've seen it work in some large clusters. If you can give it a quick once-over, I'd appreciate it.

          Tim

          Show
          Timothy Potter added a comment - Hi Mark, I think it is ready to go, but was hoping for a review from you before closing. It's been committed to trunk and branch_4x. Tests seem to be passing on Jenkins and I've seen it work in some large clusters. If you can give it a quick once-over, I'd appreciate it. Tim
          Hide
          Timothy Potter added a comment -

          Marking this as resolved as it's included in the 4.9 release. Would definitely appreciate a review from Mark and others when convenient ... can open a new JIRA for any issues found with this implementation going forward.

          Show
          Timothy Potter added a comment - Marking this as resolved as it's included in the 4.9 release. Would definitely appreciate a review from Mark and others when convenient ... can open a new JIRA for any issues found with this implementation going forward.
          Hide
          Mark Miller added a comment -

          I did a quick review of the code and read your comments above more thoroughly. I did not do a low level review. From that mid level though, this looks like a great change and even if there are any issues, the changes look like good improvements and we should just work through anything that comes up as a result of them.

          As I work on anything in that area, I'll look at some parts more closely.

          Show
          Mark Miller added a comment - I did a quick review of the code and read your comments above more thoroughly. I did not do a low level review. From that mid level though, this looks like a great change and even if there are any issues, the changes look like good improvements and we should just work through anything that comes up as a result of them. As I work on anything in that area, I'll look at some parts more closely.
          Hide
          Timothy Potter added a comment -

          Hi Mark,

          Awesome, thanks for the review ... there's one area in the CoreAdminHandler waitForState that could use your review.

          // TODO: This is funky but I've seen this in testing where the replica asks the
          // leader to be in recovery? Need to track down how that happens ... in the meantime,
          // this is a safeguard
          boolean leaderDoesNotNeedRecovery = (onlyIfLeader != null &&
          onlyIfLeader &&
          core.getName().equals(nodeProps.getStr("core")) &&
          ZkStateReader.RECOVERING.equals(waitForState) &&
          ZkStateReader.ACTIVE.equals(localState) &&
          ZkStateReader.ACTIVE.equals(state));

          Basically, at some point, I was seeing replicas ask active leaders to recover, which I didn't think was a valid thing to do. I actually haven't seen this occur in any of my testing so maybe I was just confused. We can definitely remove that code if it's not valid, but wanted to make you aware that I had it in there

          Show
          Timothy Potter added a comment - Hi Mark, Awesome, thanks for the review ... there's one area in the CoreAdminHandler waitForState that could use your review. // TODO: This is funky but I've seen this in testing where the replica asks the // leader to be in recovery? Need to track down how that happens ... in the meantime, // this is a safeguard boolean leaderDoesNotNeedRecovery = (onlyIfLeader != null && onlyIfLeader && core.getName().equals(nodeProps.getStr("core")) && ZkStateReader.RECOVERING.equals(waitForState) && ZkStateReader.ACTIVE.equals(localState) && ZkStateReader.ACTIVE.equals(state)); Basically, at some point, I was seeing replicas ask active leaders to recover, which I didn't think was a valid thing to do. I actually haven't seen this occur in any of my testing so maybe I was just confused. We can definitely remove that code if it's not valid, but wanted to make you aware that I had it in there

            People

            • Assignee:
              Timothy Potter
              Reporter:
              Mark Miller
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development