Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3042 Automatic failover support for NN HA
  3. HDFS-3192

Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

    Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: ha, namenode
    • Labels:
      None

      Activity

      Hide
      Todd Lipcon added a comment -

      This discussion trailed off, and in some offline discussion with Hari, it seems like everyone is on board with the current mechanisms, resolving this as wontfix. If we change our mind in the future, we can re-open this.

      Show
      Todd Lipcon added a comment - This discussion trailed off, and in some offline discussion with Hari, it seems like everyone is on board with the current mechanisms, resolving this as wontfix. If we change our mind in the future, we can re-open this.
      Hide
      Todd Lipcon added a comment -

      The state diagram is included in the design doc attached to HDFS-2185. Please comment with an example scenario in which you think there is an incorrect behavior - I don't know of any aside from HADOOP-8217, but if you know of some I'd be really happy to address them rather than find out about them from a broken customer

      Show
      Todd Lipcon added a comment - The state diagram is included in the design doc attached to HDFS-2185 . Please comment with an example scenario in which you think there is an incorrect behavior - I don't know of any aside from HADOOP-8217 , but if you know of some I'd be really happy to address them rather than find out about them from a broken customer
      Hide
      Hari Mankude added a comment -

      Would it be possible to post state transition diagram for the failover controller and its interactions with NN? There are concerns about the correctness of situations where zkfc2 is directing nn1 to change states and vice versa.

      Show
      Hari Mankude added a comment - Would it be possible to post state transition diagram for the failover controller and its interactions with NN? There are concerns about the correctness of situations where zkfc2 is directing nn1 to change states and vice versa.
      Hide
      Hari Mankude added a comment -

      2a) If the local node is still accessible, then the other node will have already called transitionToStandby(), in which case our own call will be a no-op (since we're already in standby state). Everything is correct, because the transitionToStandby() call flushes everything and gracefully closes its edit log writer.

      This is very dangerous and can result in all sorts of races.

      1. ZKFC2 initiates transitionToStandby() to NN1.
      2. Meanwhile, RPC does not start on NN1.
      3. ZKFC2 loses the znode.
      4. ZKFC1 now takes over
      5. ZKFC1 does a becomeActive() on NN1.
      6. transitionToStandby() starts executing converting NN1 to standby. There are no active NNs in the cluster now.

      ZKFC should communicate ONLY with its local NN. Otherwise, it will result in all sorts of messy race conditions. Communication between NNs should be via zookeeper znodes, editlogs and datanodes.

      1) NN1 writing to edits log

      2) ZKFC1 loses lease, but doesn't know about it yet
      3) ZKFC2 gets lease
      4) NN2 becomes active, starts writing logs
      5) NN1 writes some edits. World explodes.
      6) ZKFC1 gets asynchronous notification from ZK that it lots its session. Anything you do at this point is too late.

      Doing RPC to your own NN is subject to way more race conditions because we have no way of enforcing an ordering between NN1 going standby and NN2 becoming active. NN2 has to verify that NN1 is either standby or effectively dead before becoming active. The only way to do that is to first (a) ask it to be standby, or (b) fence.

      I disagree. Doing RPC to your own NN is the safest mechanism that is available in the HA environment. It is definitely safer than doing the RPC to a remote NN. Do you agree?
      I would like to make sure that I consider fencing required also and I am not suggesting this method as an alternative to fencing. Instead, this method will ensure that there are lesser situations where complicated algorithm of fencing would have to be used and ensures that there is less probability of error.

      The "self-resign" in step 6 is insufficient. We have to fence between step 3 and step 4. Whatever NN1 happens to do after that point doesn't help anything because it's too late.

      I am not talking about self-resign in this situation. Self-resign as per this jira will happen only if ZKFC1 is dead. In the above example, ZKFC1 is not dead.
      For the above example, ZKFC1 should abort NN1 when znode state change has happened and restart NN1.

      Show
      Hari Mankude added a comment - 2a) If the local node is still accessible, then the other node will have already called transitionToStandby(), in which case our own call will be a no-op (since we're already in standby state). Everything is correct, because the transitionToStandby() call flushes everything and gracefully closes its edit log writer. This is very dangerous and can result in all sorts of races. 1. ZKFC2 initiates transitionToStandby() to NN1. 2. Meanwhile, RPC does not start on NN1. 3. ZKFC2 loses the znode. 4. ZKFC1 now takes over 5. ZKFC1 does a becomeActive() on NN1. 6. transitionToStandby() starts executing converting NN1 to standby. There are no active NNs in the cluster now. ZKFC should communicate ONLY with its local NN. Otherwise, it will result in all sorts of messy race conditions. Communication between NNs should be via zookeeper znodes, editlogs and datanodes. 1) NN1 writing to edits log 2) ZKFC1 loses lease, but doesn't know about it yet 3) ZKFC2 gets lease 4) NN2 becomes active, starts writing logs 5) NN1 writes some edits. World explodes. 6) ZKFC1 gets asynchronous notification from ZK that it lots its session. Anything you do at this point is too late. Doing RPC to your own NN is subject to way more race conditions because we have no way of enforcing an ordering between NN1 going standby and NN2 becoming active. NN2 has to verify that NN1 is either standby or effectively dead before becoming active. The only way to do that is to first (a) ask it to be standby, or (b) fence. I disagree. Doing RPC to your own NN is the safest mechanism that is available in the HA environment. It is definitely safer than doing the RPC to a remote NN. Do you agree? I would like to make sure that I consider fencing required also and I am not suggesting this method as an alternative to fencing. Instead, this method will ensure that there are lesser situations where complicated algorithm of fencing would have to be used and ensures that there is less probability of error. The "self-resign" in step 6 is insufficient. We have to fence between step 3 and step 4. Whatever NN1 happens to do after that point doesn't help anything because it's too late. I am not talking about self-resign in this situation. Self-resign as per this jira will happen only if ZKFC1 is dead. In the above example, ZKFC1 is not dead. For the above example, ZKFC1 should abort NN1 when znode state change has happened and restart NN1.
      Hide
      Todd Lipcon added a comment -

      Maybe the confusion is about readers? The one "hole" we have now is that, if there are no writes happening on the active, then it may happily continue to think it's active until the next write. I'd be in favor of solving this by adding code which writes a "no-op" edit to the edit log once every second or so. This bounds the amount of time in which it may think it's active and respond to read requests – since the no-op edit will cause an abort if it loses its write access. Does that satisfy the issue you're raising?

      Show
      Todd Lipcon added a comment - Maybe the confusion is about readers? The one "hole" we have now is that, if there are no writes happening on the active, then it may happily continue to think it's active until the next write. I'd be in favor of solving this by adding code which writes a "no-op" edit to the edit log once every second or so. This bounds the amount of time in which it may think it's active and respond to read requests – since the no-op edit will cause an abort if it loses its write access. Does that satisfy the issue you're raising?
      Hide
      Todd Lipcon added a comment -

      Are you suggesting that ZKFC1 does transitionToStandby() when it loses znode?

      It currently does, but I don't think it has to – since ZKFC2 will call NN1.transitionToStandby (see below).

      On an active NN, there is a high probability that it might abort

      There are two possible scenarios:
      1) the local node finds out about the session expiration before the standby. In this case, it will call transitionToStandby, the local node will flush and close its edit logs, and gracefully transition.
      2) the local node finds out after the other node. In this case, the other node will have already initiated the fencing process.
      2a) If the local node is still accessible, then the other node will have already called transitionToStandby(), in which case our own call will be a no-op (since we're already in standby state). Everything is correct, because the transitionToStandby() call flushes everything and gracefully closes its edit log writer.
      2b) If the local node is inaccessible (eg network down) then the other node initiates non-graceful fencing. If it does STONITH, then our node will go down, and the discussion is moot. If it does storage fencing, then our node no longer has access to write to storage. This will prevent transitionToStandby() from succeeding, since it will try to finalize its current edit log segment (which involves mutating the fenced-off storage). So, it will correctly abort.

      I don't think that doing tryGraceFulFence() from NN2 to NN1 is safe. First of all, this is opening up one more channel of communication between NN1 and NN2 and this is subject to various races sequences, split-brain etc.

      Doing RPC to your own NN is subject to way more race conditions because we have no way of enforcing an ordering between NN1 going standby and NN2 becoming active. NN2 has to verify that NN1 is either standby or effectively dead before becoming active. The only way to do that is to first (a) ask it to be standby, or (b) fence.

      The lack of correct-ness in relying on self-resign is the example I gave above:

      1) NN1 writing to edits log
      2) ZKFC1 loses lease, but doesn't know about it yet
      3) ZKFC2 gets lease
      4) NN2 becomes active, starts writing logs
      5) NN1 writes some edits. World explodes.
      6) ZKFC1 gets asynchronous notification from ZK that it lots its session. Anything you do at this point is too late.

      The "self-resign" in step 6 is insufficient. We have to fence between step 3 and step 4. Whatever NN1 happens to do after that point doesn't help anything because it's too late.

      Show
      Todd Lipcon added a comment - Are you suggesting that ZKFC1 does transitionToStandby() when it loses znode? It currently does, but I don't think it has to – since ZKFC2 will call NN1.transitionToStandby (see below). On an active NN, there is a high probability that it might abort There are two possible scenarios: 1) the local node finds out about the session expiration before the standby. In this case, it will call transitionToStandby, the local node will flush and close its edit logs, and gracefully transition. 2) the local node finds out after the other node. In this case, the other node will have already initiated the fencing process. 2a) If the local node is still accessible, then the other node will have already called transitionToStandby(), in which case our own call will be a no-op (since we're already in standby state). Everything is correct, because the transitionToStandby() call flushes everything and gracefully closes its edit log writer. 2b) If the local node is inaccessible (eg network down) then the other node initiates non-graceful fencing. If it does STONITH, then our node will go down, and the discussion is moot. If it does storage fencing, then our node no longer has access to write to storage. This will prevent transitionToStandby() from succeeding, since it will try to finalize its current edit log segment (which involves mutating the fenced-off storage). So, it will correctly abort. I don't think that doing tryGraceFulFence() from NN2 to NN1 is safe. First of all, this is opening up one more channel of communication between NN1 and NN2 and this is subject to various races sequences, split-brain etc. Doing RPC to your own NN is subject to way more race conditions because we have no way of enforcing an ordering between NN1 going standby and NN2 becoming active. NN2 has to verify that NN1 is either standby or effectively dead before becoming active. The only way to do that is to first (a) ask it to be standby, or (b) fence. The lack of correct-ness in relying on self-resign is the example I gave above: 1) NN1 writing to edits log 2) ZKFC1 loses lease, but doesn't know about it yet 3) ZKFC2 gets lease 4) NN2 becomes active, starts writing logs 5) NN1 writes some edits. World explodes. 6) ZKFC1 gets asynchronous notification from ZK that it lots its session. Anything you do at this point is too late. The "self-resign" in step 6 is insufficient. We have to fence between step 3 and step 4. Whatever NN1 happens to do after that point doesn't help anything because it's too late.
      Hide
      Hari Mankude added a comment -

      Can you explain why it has to restart, instead of just transitioning to standby? What do you mean by "in limbo" here?

      "in limbo" implies that NN1 thinks that it is active even though NN2 has taken over since it has not tried to access editlogs. So, it is not behaving as standby and keeping up with active. Are you suggesting that ZKFC1 does transitionToStandby() when it loses znode? On an active NN, there is a high probability that it might abort. Also, does transitionToStandby() guarantee that all the active-state threads have quisced?

      Before issuing an "uncontrolled abort", the ZKFC2 will always try to do a "graceful fence" – ie ask it to self-resign via an RPC. See the tryGracefulFence function in the FailoverController class.

      I don't think that doing tryGraceFulFence() from NN2 to NN1 is safe. First of all, this is opening up one more channel of communication between NN1 and NN2 and this is subject to various races sequences, split-brain etc. I think self-resign is much safer than trygracefulfence(). So far, I dont see a lack of correctness argument in our discussion. Is my description correct here?

      Show
      Hari Mankude added a comment - Can you explain why it has to restart, instead of just transitioning to standby? What do you mean by "in limbo" here? "in limbo" implies that NN1 thinks that it is active even though NN2 has taken over since it has not tried to access editlogs. So, it is not behaving as standby and keeping up with active. Are you suggesting that ZKFC1 does transitionToStandby() when it loses znode? On an active NN, there is a high probability that it might abort. Also, does transitionToStandby() guarantee that all the active-state threads have quisced? Before issuing an "uncontrolled abort", the ZKFC2 will always try to do a "graceful fence" – ie ask it to self-resign via an RPC. See the tryGracefulFence function in the FailoverController class. I don't think that doing tryGraceFulFence() from NN2 to NN1 is safe. First of all, this is opening up one more channel of communication between NN1 and NN2 and this is subject to various races sequences, split-brain etc. I think self-resign is much safer than trygracefulfence(). So far, I dont see a lack of correctness argument in our discussion. Is my description correct here?
      Hide
      Todd Lipcon added a comment -

      So in step #6, irrespective of when ZKFC1 gets the notification, ZKFC1 has to restart NN1. Otherwise, we don't know as to how long NN1 will stay in limbo.

      Can you explain why it has to restart, instead of just transitioning to standby? What do you mean by "in limbo" here?

      Also, NN1 could resign much earlier without having go through uncontrolled abort via fencing

      Before issuing an "uncontrolled abort", the ZKFC2 will always try to do a "graceful fence" – ie ask it to self-resign via an RPC. See the tryGracefulFence function in the FailoverController class.

      Having the other node asking it to resign is better than having it ask itself to resign – the reason being that this is the only way the other node can be sure that it's "in the clear" to start writing to the logs. (a "self-resignation" might come too late). Since the other node always has to verify the resignation before it starts to write, there's nothing extra gained by having it resign itself first. It's just a redundancy.

      Show
      Todd Lipcon added a comment - So in step #6, irrespective of when ZKFC1 gets the notification, ZKFC1 has to restart NN1. Otherwise, we don't know as to how long NN1 will stay in limbo. Can you explain why it has to restart, instead of just transitioning to standby? What do you mean by "in limbo" here? Also, NN1 could resign much earlier without having go through uncontrolled abort via fencing Before issuing an "uncontrolled abort", the ZKFC2 will always try to do a "graceful fence" – ie ask it to self-resign via an RPC. See the tryGracefulFence function in the FailoverController class. Having the other node asking it to resign is better than having it ask itself to resign – the reason being that this is the only way the other node can be sure that it's "in the clear" to start writing to the logs. (a "self-resignation" might come too late). Since the other node always has to verify the resignation before it starts to write, there's nothing extra gained by having it resign itself first. It's just a redundancy.
      Hide
      Hari Mankude added a comment -

      Excellent comment regarding quorum and active aborting when it cannot write to (n/2 +1) number of editlog entries.

      Should say editlog locations.

      Show
      Hari Mankude added a comment - Excellent comment regarding quorum and active aborting when it cannot write to (n/2 +1) number of editlog entries. Should say editlog locations.
      Hide
      Hari Mankude added a comment -

      Excellent comment regarding quorum and active aborting when it cannot write to (n/2 +1) number of editlog entries. I was getting worried that transitionToStandby() without NN restart was being suggested in this scenario also. Thanks for clarifying this for me.

      Before we get back to ZKFC dead implies NN should die situation, let us understand a specific scenario here

      1) NN1 writing to edits log

      2) ZKFC1 loses lease, but doesn't know about it yet

      3) ZKFC2 gets lease

      4) NN2 becomes active, starts writing logs

      5) NN1 writes some edits. World explodes.

      6) ZKFC1 gets asynchronous notification from ZK that it lots its session. Anything you do at this point is too late.

      Let us assume that NN1 has no edit logs to write. The reason could be that we implement an ip failover also and DFSClients now automatically start talking to NN2 after the ip failover. There are no edit log entries being created at NN1. So, NN1 stays active and never behaves as a standby.

      So in step #6, irrespective of when ZKFC1 gets the notification, ZKFC1 has to restart NN1. Otherwise, we don't know as to how long NN1 will stay in limbo.

      Now coming to the scenario that I was referring to, ZKFC on NN1 is either dead or is in gc. NN1 could stay immobile similarly. Also, NN1 could resign much earlier without having go through uncontrolled abort via fencing. It is always a much stronger correctness argument when NN1 can self-resign with the information that is available rather than being forced to resign.

      Show
      Hari Mankude added a comment - Excellent comment regarding quorum and active aborting when it cannot write to (n/2 +1) number of editlog entries. I was getting worried that transitionToStandby() without NN restart was being suggested in this scenario also. Thanks for clarifying this for me. Before we get back to ZKFC dead implies NN should die situation, let us understand a specific scenario here 1) NN1 writing to edits log 2) ZKFC1 loses lease, but doesn't know about it yet 3) ZKFC2 gets lease 4) NN2 becomes active, starts writing logs 5) NN1 writes some edits. World explodes. 6) ZKFC1 gets asynchronous notification from ZK that it lots its session. Anything you do at this point is too late. Let us assume that NN1 has no edit logs to write. The reason could be that we implement an ip failover also and DFSClients now automatically start talking to NN2 after the ip failover. There are no edit log entries being created at NN1. So, NN1 stays active and never behaves as a standby. So in step #6, irrespective of when ZKFC1 gets the notification, ZKFC1 has to restart NN1. Otherwise, we don't know as to how long NN1 will stay in limbo. Now coming to the scenario that I was referring to, ZKFC on NN1 is either dead or is in gc. NN1 could stay immobile similarly. Also, NN1 could resign much earlier without having go through uncontrolled abort via fencing. It is always a much stronger correctness argument when NN1 can self-resign with the information that is available rather than being forced to resign.
      Hide
      Todd Lipcon added a comment -

      I think there is confusion here over the terminology "loses quorum".

      I agree completely about the following: if any NN fails to sync its edit logs, it needs to abort. This is already what it does today - no changes necessary.
      If the edit log happens to be implemented using a quorum protocol (HDFS-3092 or HDFS-3077 for example), then that behavior should be maintained. The JournalManager implementation needs to throw an exception in response to logSync(). That will cause the NN to abort.

      That's all that's necessary for correctness - an NN won't ack "success" to a write unless it successfully syncs it, and will abort rather than rollback, since we have no rollback capability.

      In the above sense, "loses quorum" really means "loses write access to the edit logs".

      If instead you're talking about "loses quorum" as "loses its ZK session", then no abort is necessary, because it may still be able to write to its edits. So long as it's getting "success" back from editLog.logSync(), then the edits are being persisted. It is the responsibility of the next active to fence access to the shared edits. It may do so in one of two ways:
      1) Edits fencing: ensure that the next write to the edits mechanism throws IOE. In the case of FileJournalManager on NAS, this is done via an RPC to the NAS system to fence the given export.
      2) STONITH: ensure that the next write fails because power has been yanked from the machine.

      Alternatively, the new active may first try a "graceful transition":
      3) Gracefully ask the prior active to stop writing. The prior active flushes anything buffered, successfully syncs, and then enters standby mode.

      Notably, "self-stonith upon losing the ZK lease" is not an option, because it may take arbitrarily long before it notices. EG:
      1) NN1 writing to edits log
      2) ZKFC1 loses lease, but doesn't know about it yet
      3) ZKFC2 gets lease
      4) NN2 becomes active, starts writing logs
      5) NN1 writes some edits. World explodes.
      6) ZKFC1 gets asynchronous notification from ZK that it lots its session. Anything you do at this point is too late.

      Before step 4, NN2 must use a fencing mechanism. Regardless of whatever steps NN1 or ZKFC1 might take in step 6.

      Show
      Todd Lipcon added a comment - I think there is confusion here over the terminology "loses quorum". I agree completely about the following: if any NN fails to sync its edit logs, it needs to abort. This is already what it does today - no changes necessary. If the edit log happens to be implemented using a quorum protocol ( HDFS-3092 or HDFS-3077 for example), then that behavior should be maintained. The JournalManager implementation needs to throw an exception in response to logSync(). That will cause the NN to abort. That's all that's necessary for correctness - an NN won't ack "success" to a write unless it successfully syncs it, and will abort rather than rollback, since we have no rollback capability. In the above sense, "loses quorum" really means "loses write access to the edit logs". If instead you're talking about "loses quorum" as "loses its ZK session", then no abort is necessary, because it may still be able to write to its edits. So long as it's getting "success" back from editLog.logSync(), then the edits are being persisted. It is the responsibility of the next active to fence access to the shared edits. It may do so in one of two ways: 1) Edits fencing: ensure that the next write to the edits mechanism throws IOE. In the case of FileJournalManager on NAS, this is done via an RPC to the NAS system to fence the given export. 2) STONITH: ensure that the next write fails because power has been yanked from the machine. Alternatively, the new active may first try a "graceful transition": 3) Gracefully ask the prior active to stop writing. The prior active flushes anything buffered, successfully syncs, and then enters standby mode. Notably, "self-stonith upon losing the ZK lease" is not an option, because it may take arbitrarily long before it notices. EG: 1) NN1 writing to edits log 2) ZKFC1 loses lease, but doesn't know about it yet 3) ZKFC2 gets lease 4) NN2 becomes active, starts writing logs 5) NN1 writes some edits. World explodes. 6) ZKFC1 gets asynchronous notification from ZK that it lots its session. Anything you do at this point is too late . Before step 4, NN2 must use a fencing mechanism. Regardless of whatever steps NN1 or ZKFC1 might take in step 6.
      Hide
      Hari Mankude added a comment -

      bq. Why is the behaviour different from what happens when zkfc loses the ephemeral node? Currently zkfc when it loses the ephemeral node will shutdown the active NN

      No, it doesn't - it will transition it to standby. But, as I commented elsewhere, this is redundant, because the new active is actually going to fence it anyway before taking over.

      Well this is incorrect behaviour. It does not handle the situation that I mentioned earlier about requiring rollbacks. The transition to standby will have result in old active having incorrect in-memory state. The only way around this is to shutdown the active. The reason is that as soon zkfc on NN1 has lost the ephemeral znode, it is possible that zkfc on NN2 has taken over the znode and NN2 has started fencing the journals. There is no way to gracefully coordinate this with NN1. This would result in NN1 getting quorum loss which in turn could leave the in-memory state in NN1 in an inconsistent shape. Do you agree again that in-memory state of NN1 is inconsistent with the editlogs?

      Similarly if active NN does not hear from zkfc, it implies that zkfc is dead, going through gc pause essentially resulting in loss of ephemeral node.

      But this can reduce uptime. For example, imagine an administrator accidentally changes the ACL on zookeeper. This causes both ZKFCs to get an authentication error and crash at the same time. With your design, both NNs will then commit suicide. With the existing implementation, the system will continue to run in its existing state – i.e no new failovers will occur, but whoever is active will remain active.

      Firstly, how often does some change ACLs in zookeeper? Secondly, why is ZKFC dying when this happens? ZKFC must be more robust than NN. NN is a resource that is controlled by ZKFC. We should make zkfc more robust to handle zookeeper acl changes if this is a common occurance.

      If active NN loses quorum, it has to shutdown

      Yes, it has to shut down before it does any edits, or it has to be fenced by the next active. Notification of session loss is asynchronous. The same is true of your proposal. In either case it can take arbitrarily long before it "notices" that it should not be active. So we still require that the new active fence it before it becomes active. So, this proposal doesn't solve any problems.

      My proposal was not meant to handle active NN losing quorum. My proposal is shutdown NN when ZKFC has died or is in a gc pause.

      My comment was with regards to earlier comment regarding doing a transitionToStandby(). Do you agree that active NN has invalid in-memory state and cannot go through transitionToStandby() when it loses quorum? There seems to be two solutions.
      1. Implement rollback for various types of editlog entries and then do transitionToStandby() OR
      2. Shutdown NN when it loses quorum
      Does this sound right?

      Show
      Hari Mankude added a comment - bq. Why is the behaviour different from what happens when zkfc loses the ephemeral node? Currently zkfc when it loses the ephemeral node will shutdown the active NN No, it doesn't - it will transition it to standby. But, as I commented elsewhere, this is redundant, because the new active is actually going to fence it anyway before taking over. Well this is incorrect behaviour. It does not handle the situation that I mentioned earlier about requiring rollbacks. The transition to standby will have result in old active having incorrect in-memory state. The only way around this is to shutdown the active. The reason is that as soon zkfc on NN1 has lost the ephemeral znode, it is possible that zkfc on NN2 has taken over the znode and NN2 has started fencing the journals. There is no way to gracefully coordinate this with NN1. This would result in NN1 getting quorum loss which in turn could leave the in-memory state in NN1 in an inconsistent shape. Do you agree again that in-memory state of NN1 is inconsistent with the editlogs? Similarly if active NN does not hear from zkfc, it implies that zkfc is dead, going through gc pause essentially resulting in loss of ephemeral node. But this can reduce uptime. For example, imagine an administrator accidentally changes the ACL on zookeeper. This causes both ZKFCs to get an authentication error and crash at the same time. With your design, both NNs will then commit suicide. With the existing implementation, the system will continue to run in its existing state – i.e no new failovers will occur, but whoever is active will remain active. Firstly, how often does some change ACLs in zookeeper? Secondly, why is ZKFC dying when this happens? ZKFC must be more robust than NN. NN is a resource that is controlled by ZKFC. We should make zkfc more robust to handle zookeeper acl changes if this is a common occurance. If active NN loses quorum, it has to shutdown Yes, it has to shut down before it does any edits, or it has to be fenced by the next active. Notification of session loss is asynchronous. The same is true of your proposal. In either case it can take arbitrarily long before it "notices" that it should not be active. So we still require that the new active fence it before it becomes active. So, this proposal doesn't solve any problems. My proposal was not meant to handle active NN losing quorum. My proposal is shutdown NN when ZKFC has died or is in a gc pause. My comment was with regards to earlier comment regarding doing a transitionToStandby(). Do you agree that active NN has invalid in-memory state and cannot go through transitionToStandby() when it loses quorum? There seems to be two solutions. 1. Implement rollback for various types of editlog entries and then do transitionToStandby() OR 2. Shutdown NN when it loses quorum Does this sound right?
      Hide
      Todd Lipcon added a comment -

      I thought we are not going to have external stonith using special devices and that is mainly the reason why we are going through hoops to implement fencing in journal daemons.

      In the current design, which uses a filer, we require external stonith devices. There is no correct way of doing it without either stonith or storage fencing.

      The proposal with the journal-daemon based fencing is essentailly the same as storage fencing - just that we do it with our own software storage instead of a NAS/SAN.

      Why is the behaviour different from what happens when zkfc loses the ephemeral node? Currently zkfc when it loses the ephemeral node will shutdown the active NN

      No, it doesn't - it will transition it to standby. But, as I commented elsewhere, this is redundant, because the new active is actually going to fence it anyway before taking over.

      Similarly if active NN does not hear from zkfc, it implies that zkfc is dead, going through gc pause essentially resulting in loss of ephemeral node.

      But this can reduce uptime. For example, imagine an administrator accidentally changes the ACL on zookeeper. This causes both ZKFCs to get an authentication error and crash at the same time. With your design, both NNs will then commit suicide. With the existing implementation, the system will continue to run in its existing state – i.e no new failovers will occur, but whoever is active will remain active.

      If active NN loses quorum, it has to shutdown

      Yes, it has to shut down before it does any edits, or it has to be fenced by the next active. Notification of session loss is asynchronous. The same is true of your proposal. In either case it can take arbitrarily long before it "notices" that it should not be active. So we still require that the new active fence it before it becomes active. So, this proposal doesn't solve any problems.

      In fact, one of the most of the difficult APIs to implement correctly would be transitionToStandby() from active state.

      We already have that implemented. It syncs any existing edits, and then stops allowing new ones. We allow failover from one node to another without aborting, so long as it's graceful. This is perfectly correct. If we need to do a non-graceful failover, we fence the node by STONITH or by disallowing further access to the edit logs (which indirectly causes the node to abort, since logSync() fails).

      It seems you're trying to solve problems we've already solved.

      Show
      Todd Lipcon added a comment - I thought we are not going to have external stonith using special devices and that is mainly the reason why we are going through hoops to implement fencing in journal daemons. In the current design, which uses a filer, we require external stonith devices. There is no correct way of doing it without either stonith or storage fencing. The proposal with the journal-daemon based fencing is essentailly the same as storage fencing - just that we do it with our own software storage instead of a NAS/SAN. Why is the behaviour different from what happens when zkfc loses the ephemeral node? Currently zkfc when it loses the ephemeral node will shutdown the active NN No, it doesn't - it will transition it to standby. But, as I commented elsewhere, this is redundant, because the new active is actually going to fence it anyway before taking over. Similarly if active NN does not hear from zkfc, it implies that zkfc is dead, going through gc pause essentially resulting in loss of ephemeral node. But this can reduce uptime. For example, imagine an administrator accidentally changes the ACL on zookeeper. This causes both ZKFCs to get an authentication error and crash at the same time. With your design, both NNs will then commit suicide. With the existing implementation, the system will continue to run in its existing state – i.e no new failovers will occur, but whoever is active will remain active. If active NN loses quorum, it has to shutdown Yes, it has to shut down before it does any edits, or it has to be fenced by the next active. Notification of session loss is asynchronous. The same is true of your proposal. In either case it can take arbitrarily long before it "notices" that it should not be active. So we still require that the new active fence it before it becomes active. So, this proposal doesn't solve any problems. In fact, one of the most of the difficult APIs to implement correctly would be transitionToStandby() from active state. We already have that implemented. It syncs any existing edits, and then stops allowing new ones. We allow failover from one node to another without aborting, so long as it's graceful. This is perfectly correct. If we need to do a non-graceful failover, we fence the node by STONITH or by disallowing further access to the edit logs (which indirectly causes the node to abort, since logSync() fails). It seems you're trying to solve problems we've already solved.
      Hide
      Hari Mankude added a comment -

      Why add multiple stonith paths, given we need external stonith anyway? It just adds to the complexity by increasing the number of scenarios we have to debug, etc.

      I thought we are not going to have external stonith using special devices and that is mainly the reason why we are going through hoops to implement fencing in journal daemons.

      That is to say: if the ZKFC dies, then it will lose its lock, and the other node will stonith this one when it takes over. What's the benefit of having it abort itself at the same time? In fact, it seems to be detrimental, because if it stays up, the other node can do a graceful transitionToStandby() call rather than having to do something more drastic like a full abort.

      I disagree about two items here.

      1. Why is the behaviour different from what happens when zkfc loses the ephemeral node? Currently zkfc when it loses the ephemeral node will shutdown the active NN. Similarly if active NN does not hear from zkfc, it implies that zkfc is dead, going through gc pause essentially resulting in loss of ephemeral node.

      2. If active NN loses quorum, it has to shutdown. There is no way to do a transitionToStandby() especially since the log is updated after NN metadata is updated and there is no way to roll back the last update. This is just one of the issues that we are aware of where a rollback would be necessary. There might be other situations where rollback is required. In fact, one of the most of the difficult APIs to implement correctly would be transitionToStandby() from active state.

      Show
      Hari Mankude added a comment - Why add multiple stonith paths, given we need external stonith anyway? It just adds to the complexity by increasing the number of scenarios we have to debug, etc. I thought we are not going to have external stonith using special devices and that is mainly the reason why we are going through hoops to implement fencing in journal daemons. That is to say: if the ZKFC dies, then it will lose its lock, and the other node will stonith this one when it takes over. What's the benefit of having it abort itself at the same time? In fact, it seems to be detrimental, because if it stays up, the other node can do a graceful transitionToStandby() call rather than having to do something more drastic like a full abort. I disagree about two items here. 1. Why is the behaviour different from what happens when zkfc loses the ephemeral node? Currently zkfc when it loses the ephemeral node will shutdown the active NN. Similarly if active NN does not hear from zkfc, it implies that zkfc is dead, going through gc pause essentially resulting in loss of ephemeral node. 2. If active NN loses quorum, it has to shutdown. There is no way to do a transitionToStandby() especially since the log is updated after NN metadata is updated and there is no way to roll back the last update. This is just one of the issues that we are aware of where a rollback would be necessary. There might be other situations where rollback is required. In fact, one of the most of the difficult APIs to implement correctly would be transitionToStandby() from active state.
      Hide
      Todd Lipcon added a comment -

      Why add multiple stonith paths, given we need external stonith anyway? It just adds to the complexity by increasing the number of scenarios we have to debug, etc.

      That is to say: if the ZKFC dies, then it will lose its lock, and the other node will stonith this one when it takes over. What's the benefit of having it abort itself at the same time? In fact, it seems to be detrimental, because if it stays up, the other node can do a graceful transitionToStandby() call rather than having to do something more drastic like a full abort.

      Show
      Todd Lipcon added a comment - Why add multiple stonith paths, given we need external stonith anyway? It just adds to the complexity by increasing the number of scenarios we have to debug, etc. That is to say: if the ZKFC dies, then it will lose its lock, and the other node will stonith this one when it takes over. What's the benefit of having it abort itself at the same time? In fact, it seems to be detrimental, because if it stays up, the other node can do a graceful transitionToStandby() call rather than having to do something more drastic like a full abort.
      Hide
      Hari Mankude added a comment -

      Why?

      I think it's an advantage that the FC may die and come back, or that you may start the FCs after the NNs.

      Well, if FC restarts the health monitoring within the timeout period, then NN will not die. However, if FC is having a gc pause or is not restarting, then NN should die. This is the first level of protection where in if NN is healthy, it can stonith itself.

      Show
      Hari Mankude added a comment - Why? I think it's an advantage that the FC may die and come back, or that you may start the FCs after the NNs. Well, if FC restarts the health monitoring within the timeout period, then NN will not die. However, if FC is having a gc pause or is not restarting, then NN should die. This is the first level of protection where in if NN is healthy, it can stonith itself.
      Hide
      Suresh Srinivas added a comment -

      Hari, agree with Aaron that this should not be a subtask of HDFS-3092.

      Show
      Suresh Srinivas added a comment - Hari, agree with Aaron that this should not be a subtask of HDFS-3092 .
      Hide
      Aaron T. Myers added a comment -

      Also, why is this a sub-task of HDFS-3092? Seems like it should be a sub-task of the auto-failover JIRA.

      Show
      Aaron T. Myers added a comment - Also, why is this a sub-task of HDFS-3092 ? Seems like it should be a sub-task of the auto-failover JIRA.
      Hide
      Todd Lipcon added a comment -

      Why?

      I think it's an advantage that the FC may die and come back, or that you may start the FCs after the NNs.

      Show
      Todd Lipcon added a comment - Why? I think it's an advantage that the FC may die and come back, or that you may start the FCs after the NNs.
      Hide
      Hari Mankude added a comment -

      FC via healthmonitor will periodically poll the status of the NN via getServiceStatus(). If active NN has not heard from FC for timeout number of seconds, it should exit. This timeout has to be tied to the timeout of the ephemeral node in zookeeper and it should be a fraction of the ephemeral node timeout.

      This ensures that if FC has died, then active NN also failfasts.

      Show
      Hari Mankude added a comment - FC via healthmonitor will periodically poll the status of the NN via getServiceStatus(). If active NN has not heard from FC for timeout number of seconds, it should exit. This timeout has to be tied to the timeout of the ephemeral node in zookeeper and it should be a fraction of the ephemeral node timeout. This ensures that if FC has died, then active NN also failfasts.

        People

        • Assignee:
          Hari Mankude
          Reporter:
          Hari Mankude
        • Votes:
          0 Vote for this issue
          Watchers:
          4 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development