Solr
  1. Solr
  2. SOLR-8069

Ensure that only the valid ZooKeeper registered leader can put a replica into Leader Initiated Recovery.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      I've seen this twice now. Need to work on a test.

      When some issues hit all the replicas at once, you can end up in a situation where the rightful leader was put or put itself into LIR. Even on restart, this rightful leader won't take leadership and you have to manually clear the LIR nodes.

      It seems that if all the replicas participate in election on startup, LIR should just be cleared.

      1. SOLR-8069.patch
        21 kB
        Mark Miller
      2. SOLR-8069.patch
        19 kB
        Mark Miller

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          Timothy Potter, any immediate thoughts on this?

          Show
          Mark Miller added a comment - Timothy Potter , any immediate thoughts on this?
          Hide
          Jessica Cheng Mallet added a comment -

          We have definitely seen this as well, even after commit for SOLR-7109 added zookeeper multi transaction to ZkController.markShardAsDownIfLeader, which is supposed to predicate setting the LiR node on the setter's still having the same election znode it thinks it has when it's a leader.

          Hmmm, reading the code now I'm not sure it's doing exactly the right thing since it calls getLeaderSeqPath, which just takes the current ElectionContext from electionContexts, which isn't necessarily the one the node had when it decided to mark someone else down, right? Shalin Shekhar Mangar thoughts?

          Show
          Jessica Cheng Mallet added a comment - We have definitely seen this as well, even after commit for SOLR-7109 added zookeeper multi transaction to ZkController.markShardAsDownIfLeader, which is supposed to predicate setting the LiR node on the setter's still having the same election znode it thinks it has when it's a leader. Hmmm, reading the code now I'm not sure it's doing exactly the right thing since it calls getLeaderSeqPath, which just takes the current ElectionContext from electionContexts, which isn't necessarily the one the node had when it decided to mark someone else down, right? Shalin Shekhar Mangar thoughts?
          Hide
          Timothy Potter added a comment -

          Immediate thought is ugh! I'm surprised to hear this is still happening after 7109. I'd like to dig in a bit more, but agreed on:

          It seems that if all the replicas participate in election on startup, LIR should just be cleared.

          Show
          Timothy Potter added a comment - Immediate thought is ugh! I'm surprised to hear this is still happening after 7109. I'd like to dig in a bit more, but agreed on: It seems that if all the replicas participate in election on startup, LIR should just be cleared.
          Hide
          Mark Miller added a comment -

          Not quite working yet, but what about this approach instead?

          Show
          Mark Miller added a comment - Not quite working yet, but what about this approach instead?
          Hide
          Jessica Cheng Mallet added a comment -

          I still struggle with the safety of getting the ElectionContext from electionContexts, because what's mapped there could change from under this thread. What about if we write down the election node path (e.g. 238121947050958365-core_node2-n_0000000006) into the leader znode as a leader props, so that whenever we're actually checking that we're the leader, we can get that election node path back and do the zk multi checking for that particular election node path?

          Ugh, but then I guess lots of places are actually looking at the cluster state's leader instead of the leader node. >_< Why are there separate places for marking the leader? I don't know how to reason with the asynchronous nature of cluster state's update wrt actual leader election...

          Show
          Jessica Cheng Mallet added a comment - I still struggle with the safety of getting the ElectionContext from electionContexts, because what's mapped there could change from under this thread. What about if we write down the election node path (e.g. 238121947050958365-core_node2-n_0000000006) into the leader znode as a leader props, so that whenever we're actually checking that we're the leader, we can get that election node path back and do the zk multi checking for that particular election node path? Ugh, but then I guess lots of places are actually looking at the cluster state's leader instead of the leader node. >_< Why are there separate places for marking the leader? I don't know how to reason with the asynchronous nature of cluster state's update wrt actual leader election...
          Hide
          Jessica Cheng Mallet added a comment -

          Actually, thinking about it – why do we have the leader property in cluster state at all? If it's simply to publish leadership to solrj, it seems that on the server-side we should still use the leader znode as the "source of truth" so that we can have guarantees of consistent view along with the zk transactions. If solrj's view falls behind due to the asynchronous nature of having the Overseer update the state, at least on the server side we can check the leader znode.

          Any historical reason why leadership information is in two places?

          Show
          Jessica Cheng Mallet added a comment - Actually, thinking about it – why do we have the leader property in cluster state at all? If it's simply to publish leadership to solrj, it seems that on the server-side we should still use the leader znode as the "source of truth" so that we can have guarantees of consistent view along with the zk transactions. If solrj's view falls behind due to the asynchronous nature of having the Overseer update the state, at least on the server side we can check the leader znode. Any historical reason why leadership information is in two places?
          Hide
          Mark Miller added a comment -

          I still struggle with the safety of getting the ElectionContext from electionContexts, because what's mapped there could change from under this thread.

          That is why I check before and after we get the context that we locally think we are the leader. The idea is, if we locally are connected to zk and think we are leader before and after getting the latest context, we have near real confidence that we are the leader and can do still do as we please.

          There really is nothing tricky about the leader being advertised in clusterstate - it's simply slightly stale state that is updated by Overseer. I don't see how it complicates an approach to this?

          Show
          Mark Miller added a comment - I still struggle with the safety of getting the ElectionContext from electionContexts, because what's mapped there could change from under this thread. That is why I check before and after we get the context that we locally think we are the leader. The idea is, if we locally are connected to zk and think we are leader before and after getting the latest context, we have near real confidence that we are the leader and can do still do as we please. There really is nothing tricky about the leader being advertised in clusterstate - it's simply slightly stale state that is updated by Overseer. I don't see how it complicates an approach to this?
          Hide
          Mark Miller added a comment -

          Well here is one approach.

          In either case, I'd like to add the checks against CloudDescriptor#isLeader - this is our most up to date and real time information about whether or not we are connected to zk and think we are the leader - if this is false, we don't want to do anything that a leader should do. If we are going to have best effort checks against zk itself to ensure we are still the leader, this important defensive check should also generally be included.

          Otherwise, I think the current code does not really work - it's really easy to grab the latest context and the latest context should usually look right? This code addresses that rather large hole. I'm open to doing more, but I have not grasped the full implementation of it yet given the state available to the various LIR methods.

          Show
          Mark Miller added a comment - Well here is one approach. In either case, I'd like to add the checks against CloudDescriptor#isLeader - this is our most up to date and real time information about whether or not we are connected to zk and think we are the leader - if this is false, we don't want to do anything that a leader should do. If we are going to have best effort checks against zk itself to ensure we are still the leader, this important defensive check should also generally be included. Otherwise, I think the current code does not really work - it's really easy to grab the latest context and the latest context should usually look right? This code addresses that rather large hole. I'm open to doing more, but I have not grasped the full implementation of it yet given the state available to the various LIR methods.
          Hide
          Timothy Potter added a comment -

          Quick pass over the patch looks good to me (a few non-related changes in HdfsCollectionsAPIDistributedZkTest.java leaked into this patch). I'm focused on other un-related issue at the moment so will take a closer look in the AM when I'm fresh, but I like the approach.

          Show
          Timothy Potter added a comment - Quick pass over the patch looks good to me (a few non-related changes in HdfsCollectionsAPIDistributedZkTest.java leaked into this patch). I'm focused on other un-related issue at the moment so will take a closer look in the AM when I'm fresh, but I like the approach.
          Hide
          Jessica Cheng Mallet added a comment -

          Yes, I think this is definitely an improvement. I'm just not sure if it gets everything covered. I suppose "we have near real confidence that we are the leader and can do still do as we please" is probably good enough – though I haven't convinced myself yet through playing with complex scenarios of repeated leadership changes – thus I prefer the simple logic of "do this action only if our zookeeper session state is exactly what it was when we decided to do it". Anyhow, this is probably beyond the scope of this JIRA.

          BTW, we tend to see this most when a "bad" query is issued (e.g. doing non-cursorMark deep paging of page 50,000). Presumably it creates GC on each replica it hits (since the request is retried) and a series of leadership changes happen. Along with complication of GC pauses, the states are quite difficult to reason through.

          Show
          Jessica Cheng Mallet added a comment - Yes, I think this is definitely an improvement. I'm just not sure if it gets everything covered. I suppose "we have near real confidence that we are the leader and can do still do as we please" is probably good enough – though I haven't convinced myself yet through playing with complex scenarios of repeated leadership changes – thus I prefer the simple logic of "do this action only if our zookeeper session state is exactly what it was when we decided to do it". Anyhow, this is probably beyond the scope of this JIRA. BTW, we tend to see this most when a "bad" query is issued (e.g. doing non-cursorMark deep paging of page 50,000). Presumably it creates GC on each replica it hits (since the request is retried) and a series of leadership changes happen. Along with complication of GC pauses, the states are quite difficult to reason through.
          Hide
          Mark Miller added a comment -

          I think the thought game comes down to:

          We check if locally think we are the leader (which requires being connected to zk).

          We get the current leader context.

          We check if locally think we are the leader.

          If all that passes, we assume we have context for when we were the leader. Now publishing only works if that same leader is registered.

          So where are the holes?

          There does not seem to be a lot of room to get the wrong context? In what scenario could we think we are the leader before and after the getContext call and end up with the wrong context?

          And if we have the leaders context, the multi update ensures the update only happens if that context is still the leader.

          Show
          Mark Miller added a comment - I think the thought game comes down to: We check if locally think we are the leader (which requires being connected to zk). We get the current leader context. We check if locally think we are the leader. If all that passes, we assume we have context for when we were the leader. Now publishing only works if that same leader is registered. So where are the holes? There does not seem to be a lot of room to get the wrong context? In what scenario could we think we are the leader before and after the getContext call and end up with the wrong context? And if we have the leaders context, the multi update ensures the update only happens if that context is still the leader.
          Hide
          Jessica Cheng Mallet added a comment - - edited

          The scenario that I have in mind is if somehow we're switching leadership back and forth due to nodes going into GC after receiving retries of an expensive query, what if a node is a leader at time T1, decided to set another node in LiR but went to GC before it did, so that it lost the leadership. Then, the other node briefly gained leadership at T2 and maybe processed an update or two but then also went to GC and lost its leadership. Then, the first node wakes up from GC and became the leader once more at T3--and then this code execute. My question is if it's absolutely safe for this node to set the other node in LiR simply because it's the leader now, even though when it decided to set the LiR, it was the leader at T1.

          Show
          Jessica Cheng Mallet added a comment - - edited The scenario that I have in mind is if somehow we're switching leadership back and forth due to nodes going into GC after receiving retries of an expensive query, what if a node is a leader at time T1, decided to set another node in LiR but went to GC before it did, so that it lost the leadership. Then, the other node briefly gained leadership at T2 and maybe processed an update or two but then also went to GC and lost its leadership. Then, the first node wakes up from GC and became the leader once more at T3--and then this code execute. My question is if it's absolutely safe for this node to set the other node in LiR simply because it's the leader now, even though when it decided to set the LiR, it was the leader at T1.
          Hide
          Shalin Shekhar Mangar added a comment -

          Hmmm, reading the code now I'm not sure it's doing exactly the right thing since it calls getLeaderSeqPath, which just takes the current ElectionContext from electionContexts, which isn't necessarily the one the node had when it decided to mark someone else down, right? Shalin Shekhar Mangar thoughts?

          Right, we should acquire the leader sequence path at the beginning of the update instead of so late in the game. I believe Mark's patch has the same problem but it is somewhat diluted by checking against CloudDescriptor.isLeader.

          Show
          Shalin Shekhar Mangar added a comment - Hmmm, reading the code now I'm not sure it's doing exactly the right thing since it calls getLeaderSeqPath, which just takes the current ElectionContext from electionContexts, which isn't necessarily the one the node had when it decided to mark someone else down, right? Shalin Shekhar Mangar thoughts? Right, we should acquire the leader sequence path at the beginning of the update instead of so late in the game. I believe Mark's patch has the same problem but it is somewhat diluted by checking against CloudDescriptor.isLeader.
          Hide
          Mark Miller added a comment -

          The difference is that this patch ensures we are still the leader when we get the context - rather than blindly getting the current context.

          is somewhat diluted

          I think it goes from being a large hole still to closed really. Someone might have another idea for an improvement, but I don't see the scenario that really sneaks by this yet.

          My question is if it's absolutely safe for this node to set the other node in LiR simply because it's the leader now,

          I think of course it is. It's valid for the leader and only the leader to set anyone as down.

          Show
          Mark Miller added a comment - The difference is that this patch ensures we are still the leader when we get the context - rather than blindly getting the current context. is somewhat diluted I think it goes from being a large hole still to closed really. Someone might have another idea for an improvement, but I don't see the scenario that really sneaks by this yet. My question is if it's absolutely safe for this node to set the other node in LiR simply because it's the leader now, I think of course it is. It's valid for the leader and only the leader to set anyone as down.
          Hide
          Mark Miller added a comment - - edited

          thus I prefer the simple logic of "do this action only if our zookeeper session state is exactly what it was when we decided to do it". Anyhow, this is probably beyond the scope of this JIRA.

          I don't see an easy way to do that in this case. Almost all the solutions that fit with the code have the exact same holes / races. I think the local leader check around getting the leader context is the strongest thing I can think of so far other than adding further defensive checks.

          I don't know that much more is needed though. If the context returned is from the leader, great, its zkparentversion will will match. If the context is somehow not the right one, it won't match. We get a context and only if it's the context for the leader in ZK do we do anything rather than just if the context has a node in line. I'd say that is a pretty strong improvement.

          This should only work if the node is a valid leader by its local state and by ZooKeeper.

          Show
          Mark Miller added a comment - - edited thus I prefer the simple logic of "do this action only if our zookeeper session state is exactly what it was when we decided to do it". Anyhow, this is probably beyond the scope of this JIRA. I don't see an easy way to do that in this case. Almost all the solutions that fit with the code have the exact same holes / races. I think the local leader check around getting the leader context is the strongest thing I can think of so far other than adding further defensive checks. I don't know that much more is needed though. If the context returned is from the leader, great, its zkparentversion will will match. If the context is somehow not the right one, it won't match. We get a context and only if it's the context for the leader in ZK do we do anything rather than just if the context has a node in line. I'd say that is a pretty strong improvement. This should only work if the node is a valid leader by its local state and by ZooKeeper.
          Hide
          Jessica Cheng Mallet added a comment -

          I think of course it is. It's valid for the leader and only the leader to set anyone as down.

          It's definitely only valid for the leader to set anyone down, but it doesn't mean that the leader should set someone down based on old leadership decision. This is the only place I'm unsure about.

          I don't see an easy way to do that in this case. Almost all the solutions that fit with the code have the exact same holes / races.

          If we're willing to make more changes, one way I see this work is to write down the election node path as a prop in the leader znode (this is now written via zk transaction from your other commit). Then, have the isLeader logic in DistributedUpdateProcessor be based on reading the leader znode, and at that point record down the election node path as well. Then, when setting LiR, predicate the ZK transaction on the election node path read in the beginning of DistributedUpdateProcessor.

          Show
          Jessica Cheng Mallet added a comment - I think of course it is. It's valid for the leader and only the leader to set anyone as down. It's definitely only valid for the leader to set anyone down, but it doesn't mean that the leader should set someone down based on old leadership decision. This is the only place I'm unsure about. I don't see an easy way to do that in this case. Almost all the solutions that fit with the code have the exact same holes / races. If we're willing to make more changes, one way I see this work is to write down the election node path as a prop in the leader znode (this is now written via zk transaction from your other commit). Then, have the isLeader logic in DistributedUpdateProcessor be based on reading the leader znode, and at that point record down the election node path as well. Then, when setting LiR, predicate the ZK transaction on the election node path read in the beginning of DistributedUpdateProcessor.
          Hide
          Mark Miller added a comment -

          but it doesn't mean that the leader should set someone down based on old leadership decision.

          I think it does. A leader can do this. It doesn't matter if it had a valid reason to do it or not.

          Show
          Mark Miller added a comment - but it doesn't mean that the leader should set someone down based on old leadership decision. I think it does. A leader can do this. It doesn't matter if it had a valid reason to do it or not.
          Hide
          Mark Miller added a comment -

          one way I see this

          That really seems the same as just getting the context earlier in the request.

          Given the different ways LIR might be started and used, it really seemed simpler to try and localize the changes rather than tie them more into the request lifecycle.

          Show
          Mark Miller added a comment - one way I see this That really seems the same as just getting the context earlier in the request. Given the different ways LIR might be started and used, it really seemed simpler to try and localize the changes rather than tie them more into the request lifecycle.
          Hide
          Mark Miller added a comment - - edited

          predicate the ZK transaction on the election node

          As I think about this, I think I really prefer the approach in the patch - with that, we use ZK to ensure ONLY the leader can put a replica into LIR. It doesn't matter what clumsy things happen elsewhere in the code, with this multi, only one replica in the shard, only the leader as recently properly enforced by ZK will be able to put a replica into LIR. I like that property vs a multi on election nodes.

          Show
          Mark Miller added a comment - - edited predicate the ZK transaction on the election node As I think about this, I think I really prefer the approach in the patch - with that, we use ZK to ensure ONLY the leader can put a replica into LIR. It doesn't matter what clumsy things happen elsewhere in the code, with this multi, only one replica in the shard, only the leader as recently properly enforced by ZK will be able to put a replica into LIR. I like that property vs a multi on election nodes.
          Hide
          Ramkumar Aiyengar added a comment -

          Late to the party here.. We experienced the same issue, and Christine Poerschke was trying to create a test case for this. My initial thought was why we even check LIR when we are about to become the leader? Shouldn't the double way sync cover us even if we are behind due to losing documents?

          Show
          Ramkumar Aiyengar added a comment - Late to the party here.. We experienced the same issue, and Christine Poerschke was trying to create a test case for this. My initial thought was why we even check LIR when we are about to become the leader? Shouldn't the double way sync cover us even if we are behind due to losing documents?
          Hide
          Ramkumar Aiyengar added a comment -

          The case we hit was when we cold stopped/started the cloud. This was on 4.10.4, so may not be valid now. Let's say you have R1 and R2.

          • R1 is the leader and both R1 and R2 are stopped at the same time.
          • R2's stops accepting requests but hasn't updated ZK as yet, when R1 sends a update to R2, it fails and puts R2 in LIR.
          • R2 shuts down first, then R1.
          • R1 starts up first, finds it should be the leader.
          • R2 decides it should follow and tries to recover.
          • R1 decides it can't be leader due to LIR and steps down. But by then R2 is in recovery, doesn't step up, and we have no one stepping forward.
          Show
          Ramkumar Aiyengar added a comment - The case we hit was when we cold stopped/started the cloud. This was on 4.10.4, so may not be valid now. Let's say you have R1 and R2. R1 is the leader and both R1 and R2 are stopped at the same time. R2's stops accepting requests but hasn't updated ZK as yet, when R1 sends a update to R2, it fails and puts R2 in LIR. R2 shuts down first, then R1. R1 starts up first, finds it should be the leader. R2 decides it should follow and tries to recover. R1 decides it can't be leader due to LIR and steps down. But by then R2 is in recovery, doesn't step up, and we have no one stepping forward.
          Hide
          Mark Miller added a comment -

          initial thought was why we even check LIR when we are about to become the leader?

          I think if everyone participates in the election that makes sense. I've started working on that as a separate patch.

          I still like the idea of making it so that by zk decree only the current leader can put a replica into LIR as one of two improvements.

          Show
          Mark Miller added a comment - initial thought was why we even check LIR when we are about to become the leader? I think if everyone participates in the election that makes sense. I've started working on that as a separate patch. I still like the idea of making it so that by zk decree only the current leader can put a replica into LIR as one of two improvements.
          Hide
          Ramkumar Aiyengar added a comment -

          That makes sense, but for my understanding, why is it a bad idea even if not everyone is participating?

          Show
          Ramkumar Aiyengar added a comment - That makes sense, but for my understanding, why is it a bad idea even if not everyone is participating?
          Hide
          Mark Miller added a comment -

          I guess I worry about cases where a bad replica was marked as LIR by the leader and the shard goes down. It comes back with two nodes that were LIR but not the good replicas - do we want one of them to become the leader and lose data? We know they are probably not good actual leader candidates and the best way to prevent data loss is manual intervention if possible.

          Show
          Mark Miller added a comment - I guess I worry about cases where a bad replica was marked as LIR by the leader and the shard goes down. It comes back with two nodes that were LIR but not the good replicas - do we want one of them to become the leader and lose data? We know they are probably not good actual leader candidates and the best way to prevent data loss is manual intervention if possible.
          Hide
          Ramkumar Aiyengar added a comment -

          Got it. You've to include those in recovery along with the participants since the ones which have gone into recovery are not going to help in anyway (an alternative would be for them to abort recovery and rejoin of no one is around). But in your case, if I understand it right, detecting that there are down replicas (which might come back as good leaders) would certainly be a good idea.

          Show
          Ramkumar Aiyengar added a comment - Got it. You've to include those in recovery along with the participants since the ones which have gone into recovery are not going to help in anyway (an alternative would be for them to abort recovery and rejoin of no one is around). But in your case, if I understand it right, detecting that there are down replicas (which might come back as good leaders) would certainly be a good idea.
          Hide
          Timothy Potter added a comment -

          Hi Ram,

          In your scenario, why would R1 be in LIR? What put it there?

          Show
          Timothy Potter added a comment - Hi Ram, In your scenario, why would R1 be in LIR? What put it there?
          Hide
          Ramkumar Aiyengar added a comment -

          R1 was not in LIR, but it came up while R2 was still at the lead and decided to recover, before R2 stepped down due to being in LIR.

          Show
          Ramkumar Aiyengar added a comment - R1 was not in LIR, but it came up while R2 was still at the lead and decided to recover, before R2 stepped down due to being in LIR.
          Hide
          Jessica Cheng Mallet added a comment -

          I think it does. A leader can do this. It doesn't matter if it had a valid reason to do it or not.

          If you believe that this is true, I do agree that your patch will accomplish the check that at the moment you're setting someone else down, you're the leader. If we're going with this policy though, I think if at this moment it realizes that it's not the leader, it should actually fail the request because it shouldn't accept it on the real leader's behalf. E.g. if it's a node that was a leader but has just been network-partitioned off (but clusterstate change hasn't been made since it's asynchronous) and wasn't able to actually forward the request to the real leader.

          Show
          Jessica Cheng Mallet added a comment - I think it does. A leader can do this. It doesn't matter if it had a valid reason to do it or not. If you believe that this is true, I do agree that your patch will accomplish the check that at the moment you're setting someone else down, you're the leader. If we're going with this policy though, I think if at this moment it realizes that it's not the leader, it should actually fail the request because it shouldn't accept it on the real leader's behalf. E.g. if it's a node that was a leader but has just been network-partitioned off (but clusterstate change hasn't been made since it's asynchronous) and wasn't able to actually forward the request to the real leader.
          Hide
          Mark Miller added a comment -

          If you believe that this is true, I do agree that your patch will accomplish the check that at the moment you're setting someone else down, you're the leader.

          If the leader cannot set a replica into LIR at any time for any reason, I think we have trouble in general.

          I'm not sure I fully follow the rest. I can't wrap my head around LIR causing requests to fail or not...that doesn't make a lot of sense to me.

          Show
          Mark Miller added a comment - If you believe that this is true, I do agree that your patch will accomplish the check that at the moment you're setting someone else down, you're the leader. If the leader cannot set a replica into LIR at any time for any reason, I think we have trouble in general. I'm not sure I fully follow the rest. I can't wrap my head around LIR causing requests to fail or not...that doesn't make a lot of sense to me.
          Hide
          Mark Miller added a comment -

          I think if everyone participates in the election that makes sense. I've started working on that as a separate patch.

          I've spun off this part of the issue to SOLR-8075.

          Show
          Mark Miller added a comment - I think if everyone participates in the election that makes sense. I've started working on that as a separate patch. I've spun off this part of the issue to SOLR-8075 .
          Hide
          Mark Miller added a comment -

          Having had some time to consider this patch, I think this is the right place to commit for now. I think further improvements should be spun out into other JIRA issues.

          Show
          Mark Miller added a comment - Having had some time to consider this patch, I think this is the right place to commit for now. I think further improvements should be spun out into other JIRA issues.
          Hide
          Anshum Gupta added a comment -

          This makes sense and it's also pretty contained. Here are a suggestions:

          • That should be CoreDescriptor in the comment.
            ZkController.java
            +                leaderCd); // core node name of current leader
            
          • Unused import MockCoreContainer in HttpPartitionTest
          • In ZkController.markShardAsDownIfLeader(), was the move from using getLeaderSeqPath to new org.apache.hadoop.fs.Path(((ShardLeaderElectionContextBase)context).leaderPath).getParent().toString() intentional ?
          Show
          Anshum Gupta added a comment - This makes sense and it's also pretty contained. Here are a suggestions: That should be CoreDescriptor in the comment. ZkController.java + leaderCd); // core node name of current leader Unused import MockCoreContainer in HttpPartitionTest In ZkController.markShardAsDownIfLeader(), was the move from using getLeaderSeqPath to new org.apache.hadoop.fs.Path(((ShardLeaderElectionContextBase)context).leaderPath).getParent().toString() intentional ?
          Hide
          Mark Miller added a comment -

          Thanks for taking a look Anshum.

          intentional ?

          Yup, that's really all of the magic. The CloudDescriptor#isLeader stuff is really just a little extra sugar on top.

          Show
          Mark Miller added a comment - Thanks for taking a look Anshum. intentional ? Yup, that's really all of the magic. The CloudDescriptor#isLeader stuff is really just a little extra sugar on top.
          Hide
          Anshum Gupta added a comment -

          Just looked at it again, it indeed is .

          In that case let's yank out getLeaderSeqPath. It's not needed elsewhere anyways.

          Show
          Anshum Gupta added a comment - Just looked at it again, it indeed is . In that case let's yank out getLeaderSeqPath. It's not needed elsewhere anyways.
          Hide
          Mark Miller added a comment -

          if it's a node that was a leader but has just been network-partitioned off (but clusterstate change hasn't been made since it's asynchronous) and wasn't able to actually forward the request to the real leader.

          I don't think we really protect against such cases where there is only a single leader that can accept an update because all its replicas go bad and then it goes away but the replicas come back. That is what min replication factor on the request is meant to handle. For full data promises, you want to use it - an achieved replication factor of 1 is not going to be fault tolerant.

          Show
          Mark Miller added a comment - if it's a node that was a leader but has just been network-partitioned off (but clusterstate change hasn't been made since it's asynchronous) and wasn't able to actually forward the request to the real leader. I don't think we really protect against such cases where there is only a single leader that can accept an update because all its replicas go bad and then it goes away but the replicas come back. That is what min replication factor on the request is meant to handle. For full data promises, you want to use it - an achieved replication factor of 1 is not going to be fault tolerant.
          Hide
          Jessica Cheng Mallet added a comment -

          I don't think we really protect against such cases where there is only a single leader that can accept an update

          This is not the scenario I'm describing. If you have 3 replicas and one that was the leader gets partitioned off, one of the other 2 will get elected and they can carry on. However, during this transition time, because the cluster state update hasn't been completed or propagated through watches, the old leader can still get trailing updates from the client. In a normal case where the updates are successfully forwarded to all replicas, no one cares. But in this case, the old leader cannot forward the update to others (because it's partitioned off), so it should not reply success to the client because that would be wrong (it is not the leader and it does not have the right to tell the others to recover).

          Show
          Jessica Cheng Mallet added a comment - I don't think we really protect against such cases where there is only a single leader that can accept an update This is not the scenario I'm describing. If you have 3 replicas and one that was the leader gets partitioned off, one of the other 2 will get elected and they can carry on. However, during this transition time, because the cluster state update hasn't been completed or propagated through watches, the old leader can still get trailing updates from the client. In a normal case where the updates are successfully forwarded to all replicas, no one cares. But in this case, the old leader cannot forward the update to others (because it's partitioned off), so it should not reply success to the client because that would be wrong (it is not the leader and it does not have the right to tell the others to recover).
          Hide
          Mark Miller added a comment -

          The old leader has defensive checks that should make that pretty unlikely.

          But yes, it's the same thing. Only one node ack'd your update. We don't protect against that and you have to use min rep factor.

          Show
          Mark Miller added a comment - The old leader has defensive checks that should make that pretty unlikely. But yes, it's the same thing. Only one node ack'd your update. We don't protect against that and you have to use min rep factor.
          Hide
          Gregory Chanan added a comment -

          Most of this seems like just adding defensive checks, which seem reasonable.

          List<Op> ops = new ArrayList<>(2);

          nit: this should be 3 in two places.

          +      ops.add(Op.check(new org.apache.hadoop.fs.Path(((ShardLeaderElectionContextBase)context).leaderPath).getParent().toString(), leaderZkNodeParentVersion));
                 ops.add(Op.setData(znodePath, znodeData, -1));
          

          What happens if the leaderZkNodeParentVersion doesn't match? Presumably that's a possibility or else why add the check. We don't want to loop and see if we get an updated version in electionContexts? I'm certainly not well versed in this area of the code but checking isLeader seems a little roundabount – isn't the leaderZkNodeParentVersion what we actually care about? What happens if we think we are the leader but the version doesn't match? What does that mean? Certainly we can optimistically try whatever we pulled out of electionContexts the first time, as you've done here to avoid a zk trip.

          Show
          Gregory Chanan added a comment - Most of this seems like just adding defensive checks, which seem reasonable. List<Op> ops = new ArrayList<>(2); nit: this should be 3 in two places. + ops.add(Op.check( new org.apache.hadoop.fs.Path(((ShardLeaderElectionContextBase)context).leaderPath).getParent().toString(), leaderZkNodeParentVersion)); ops.add(Op.setData(znodePath, znodeData, -1)); What happens if the leaderZkNodeParentVersion doesn't match? Presumably that's a possibility or else why add the check. We don't want to loop and see if we get an updated version in electionContexts? I'm certainly not well versed in this area of the code but checking isLeader seems a little roundabount – isn't the leaderZkNodeParentVersion what we actually care about? What happens if we think we are the leader but the version doesn't match? What does that mean? Certainly we can optimistically try whatever we pulled out of electionContexts the first time, as you've done here to avoid a zk trip.
          Hide
          Mark Miller added a comment -

          The defensive checks are just sugar like I said. If we are going to check zk if we are still leader it makes sense to check our local reckoning first.

          The meat of the change is the parent version check. If it fails, we don't care. The leader has moved on - we don't care about retries.

          Show
          Mark Miller added a comment - The defensive checks are just sugar like I said. If we are going to check zk if we are still leader it makes sense to check our local reckoning first. The meat of the change is the parent version check. If it fails, we don't care. The leader has moved on - we don't care about retries.
          Hide
          Mark Miller added a comment -

          I was out on the phone last night - a fuller reply:

          What happens if the leaderZkNodeParentVersion doesn't match?

          The leader cannot update the zk node as we want.

          Presumably that's a possibility or else why add the check.

          It's the whole point of the patch?

          I'm certainly not well versed in this area of the code but checking isLeader seems a little roundabount

          There is no reason to go to zk if we already know we are not the leader locally - what is roundabout about it?

          What does that mean?

          That the fix worked??

          Show
          Mark Miller added a comment - I was out on the phone last night - a fuller reply: What happens if the leaderZkNodeParentVersion doesn't match? The leader cannot update the zk node as we want. Presumably that's a possibility or else why add the check. It's the whole point of the patch? I'm certainly not well versed in this area of the code but checking isLeader seems a little roundabount There is no reason to go to zk if we already know we are not the leader locally - what is roundabout about it? What does that mean? That the fix worked??
          Hide
          ASF subversion and git services added a comment -

          Commit 1704836 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1704836 ]

          SOLR-8069: Ensure that only the valid ZooKeeper registered leader can put a replica into Leader Initiated Recovery.

          Show
          ASF subversion and git services added a comment - Commit 1704836 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1704836 ] SOLR-8069 : Ensure that only the valid ZooKeeper registered leader can put a replica into Leader Initiated Recovery.
          Hide
          Mark Miller added a comment -

          So this adds sensible local isLeader checks where we were already checking ZK, it passes the core descriptor instead of just a name to LIR so it has a lot more context to work with, and it ensures that only the registered ZK leader can put a replica into LIR.

          Barring any bugs in the current code, let's open further issues for other changes / improvements.

          Show
          Mark Miller added a comment - So this adds sensible local isLeader checks where we were already checking ZK, it passes the core descriptor instead of just a name to LIR so it has a lot more context to work with, and it ensures that only the registered ZK leader can put a replica into LIR. Barring any bugs in the current code, let's open further issues for other changes / improvements.
          Hide
          ASF subversion and git services added a comment -

          Commit 1704837 from Mark Miller in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1704837 ]

          SOLR-8069: Ensure that only the valid ZooKeeper registered leader can put a replica into Leader Initiated Recovery.

          Show
          ASF subversion and git services added a comment - Commit 1704837 from Mark Miller in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1704837 ] SOLR-8069 : Ensure that only the valid ZooKeeper registered leader can put a replica into Leader Initiated Recovery.
          Hide
          Shalin Shekhar Mangar added a comment -

          There's a reproducible failure in the test added by SOLR-8075 caused by assertion error on asserts added in this issue.

          1 tests failed.
          FAILED:  org.apache.solr.cloud.LeaderInitiatedRecoveryOnShardRestartTest.testRestartWithAllInLIR
          
          Error Message:
          Captured an uncaught exception in thread: Thread[id=43491, name=coreZkRegister-5997-thread-1, state=RUNNABLE, group=TGRP-LeaderInitiatedRecoveryOnShardRestartTest]
          
          Stack Trace:
          com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=43491, name=coreZkRegister-5997-thread-1, state=RUNNABLE, group=TGRP-LeaderInitiatedRecoveryOnShardRestartTest]
          Caused by: java.lang.AssertionError
                  at __randomizedtesting.SeedInfo.seed([7F78F76DDF75FAD1]:0)
                  at org.apache.solr.cloud.ZkController.updateLeaderInitiatedRecoveryState(ZkController.java:2133)
                  at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:434)
                  at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:197)
                  at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:157)
                  at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:346)
                  at org.apache.solr.cloud.ZkController.joinElection(ZkController.java:1113)
                  at org.apache.solr.cloud.ZkController.register(ZkController.java:926)
                  at org.apache.solr.cloud.ZkController.register(ZkController.java:881)
                  at org.apache.solr.core.ZkContainer$2.run(ZkContainer.java:183)
                  at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                  at java.lang.Thread.run(Thread.java:745)
          

          The assertion is that leaderCd != null fails because ShardLeaderElectionContext.runLeaderProcess calls ZkController.updateLeaderInitiatedRecoveryState with a null core descriptor which is by design because if you are marking a replica as 'active' then you don't necessarily need to be a leader.

          Show
          Shalin Shekhar Mangar added a comment - There's a reproducible failure in the test added by SOLR-8075 caused by assertion error on asserts added in this issue. 1 tests failed. FAILED: org.apache.solr.cloud.LeaderInitiatedRecoveryOnShardRestartTest.testRestartWithAllInLIR Error Message: Captured an uncaught exception in thread: Thread [id=43491, name=coreZkRegister-5997-thread-1, state=RUNNABLE, group=TGRP-LeaderInitiatedRecoveryOnShardRestartTest] Stack Trace: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread [id=43491, name=coreZkRegister-5997-thread-1, state=RUNNABLE, group=TGRP-LeaderInitiatedRecoveryOnShardRestartTest] Caused by: java.lang.AssertionError at __randomizedtesting.SeedInfo.seed([7F78F76DDF75FAD1]:0) at org.apache.solr.cloud.ZkController.updateLeaderInitiatedRecoveryState(ZkController.java:2133) at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:434) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:197) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:157) at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:346) at org.apache.solr.cloud.ZkController.joinElection(ZkController.java:1113) at org.apache.solr.cloud.ZkController.register(ZkController.java:926) at org.apache.solr.cloud.ZkController.register(ZkController.java:881) at org.apache.solr.core.ZkContainer$2.run(ZkContainer.java:183) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang. Thread .run( Thread .java:745) The assertion is that leaderCd != null fails because ShardLeaderElectionContext.runLeaderProcess calls ZkController.updateLeaderInitiatedRecoveryState with a null core descriptor which is by design because if you are marking a replica as 'active' then you don't necessarily need to be a leader.
          Hide
          Mark Miller added a comment -

          The problem is unrelated to this issue. That assert is correct and it's catching a bug with SOLR-8075 or something. It's passing null when it should pass the core descriptor.

          Show
          Mark Miller added a comment - The problem is unrelated to this issue. That assert is correct and it's catching a bug with SOLR-8075 or something. It's passing null when it should pass the core descriptor.

            People

            • Assignee:
              Mark Miller
              Reporter:
              Mark Miller
            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development