Details

      Description

      This jira is for a ZK-based FailoverController daemon. The FailoverController is a separate daemon from the NN that does the following:

      • Initiates leader election (via ZK) when necessary
      • Performs health monitoring (aka failure detection)
      • Performs fail-over (standby to active and active to standby transitions)
      • Heartbeats to ensure the liveness

      It should have the same/similar interface as the Linux HA RM to aid pluggability.

      1. zkfc-design.tex
        15 kB
        Todd Lipcon
      2. zkfc-design.pdf
        165 kB
        Todd Lipcon
      3. zkfc-design.pdf
        232 kB
        Todd Lipcon
      4. zkfc-design.pdf
        234 kB
        Todd Lipcon
      5. zkfc-design.pdf
        243 kB
        Todd Lipcon
      6. hdfs-2185.txt
        62 kB
        Todd Lipcon
      7. hdfs-2185.txt
        15 kB
        Todd Lipcon
      8. hdfs-2185.txt
        16 kB
        Todd Lipcon
      9. hdfs-2185.txt
        17 kB
        Todd Lipcon
      10. hdfs-2185.txt
        18 kB
        Todd Lipcon
      11. Failover_Controller.jpg
        138 kB
        Bikas Saha

        Issue Links

          Activity

          Hide
          Florian Haas added a comment -

          Allow me to comment that the Pacemaker stack already fulfills all of the above.

          • Initiates leader election (via ZK) when necessary – Pacemaker calls this a Designated Coordinator, which is elected automatically.
          • Performs health monitoring (aka failure detection) – Pacemaker does this via the monitor action of resource agents, which follow the Open Cluster Framework (OCF) standard
          • Performs fail-over (standby to active and active to standby transitions) – Pacemaker does this automatically, including fencing, quorum and other vital concepts
          • Heartbeats to ensure the liveness – Pacemaker does this over one of two cluster communication layers it supports, those being Heartbeat and Corosync.

          Why reinvent the wheel?

          Show
          Florian Haas added a comment - Allow me to comment that the Pacemaker stack already fulfills all of the above. Initiates leader election (via ZK) when necessary – Pacemaker calls this a Designated Coordinator , which is elected automatically. Performs health monitoring (aka failure detection) – Pacemaker does this via the monitor action of resource agents , which follow the Open Cluster Framework (OCF) standard Performs fail-over (standby to active and active to standby transitions) – Pacemaker does this automatically, including fencing, quorum and other vital concepts Heartbeats to ensure the liveness – Pacemaker does this over one of two cluster communication layers it supports, those being Heartbeat and Corosync. Why reinvent the wheel?
          Hide
          Aaron T. Myers added a comment -

          Hey Florian, I don't think anyone would disagree with you that Pacemaker already provides much of this functionality. The design document in HDFS-1623 discusses both using Pacemaker or a Hadoop-specific failover controller. The intention of HDFS-1623 is absolutely to provide the necessary hooks to be able to support either one.

          Show
          Aaron T. Myers added a comment - Hey Florian, I don't think anyone would disagree with you that Pacemaker already provides much of this functionality. The design document in HDFS-1623 discusses both using Pacemaker or a Hadoop-specific failover controller. The intention of HDFS-1623 is absolutely to provide the necessary hooks to be able to support either one.
          Hide
          Uma Maheswara Rao G added a comment -

          Hi Eli,

          Are you planning to post the design document for this?

          Small question.

          •Performs health monitoring (aka failure detection)

          Here, do we need separate monitoring logic? As I know, ZK client watchers can give the call backs if we register the watchers rite?
          (or) you are planning something alternative here?

          If you post the design doc, it would be great.

          -thanks
          Uma

          Show
          Uma Maheswara Rao G added a comment - Hi Eli, Are you planning to post the design document for this? Small question. •Performs health monitoring (aka failure detection) Here, do we need separate monitoring logic? As I know, ZK client watchers can give the call backs if we register the watchers rite? (or) you are planning something alternative here? If you post the design doc, it would be great. -thanks Uma
          Hide
          Todd Lipcon added a comment -

          Here's a design sketch – I have only done a little bit of implementation but nothing really fleshed out yet. So, it might change a bit during the course of implementation. But feedback on the general approach would be appreciated!

          Goals

          • Ensure that only a single NN can be active at a time.
            • Use ZK as a lock manager to satisfy this requirement.
          • Perform health monitoring of the active NN to trigger a fail-over should it become unhealthy.
          • Automatically fail-over in the case that one of the hosts fails (eg power/network outage)
          • Allow manual (administratively initiated) graceful failover
          • Initiate fencing of the previously active NN in the case of non-graceful failovers.

          Overall design

          The ZooKeeper FailoverController (ZKFC) is a separate process/program which runs next to each of the HA NameNodes in the cluster. It does not directly spawn/supervise the NN JVM process, but rather runs on the same machine and communicates with it via localhost RPC.

          The ZKFC is designed to be as simple as possible to decrease likelihood of bugs which might trigger a false fail-over. It is also designed to use only a very small amount of memory, so that it will never have lengthy GC pauses. This allows us to set a fairly low time-out on the ZK session in order to detect machine failures quickly.

          Configuration

          The ZKFC needs the following pieces of configuration:

          • list of zookeeper servers making up the ZK quorum (fail to start if this is not provided)
          • host/port for the HAServiceProtocol of the local NN (defaults to localhost:<well-known port>)
          • "base znode" at which to root all of the znodes used by the process

          Nodes in ZK:

          Everything should be rooted at the configured base znode. Within that, there should be a znode per nameservice ID. Within this /base/nameserviceId/ directory, there are the following znodes:

          • activeLock - an ephemeral node taken by the ZKFC before it asks its local NN to become active. This acts as a mutex on the active state and also as a failure detector.
          • activeNodeInfo - a non-ephemeral node written by the ZKFC after it succeeds in taking activeLock. This should have data like the IPC address, HTTP address, etc of the NN.

          The activeNodeInfo is non-ephemeral so that, when a new NN takes over from a failed one, it has enough information to fence the previous active in case it's still actually running.

          Runtime operation states

          For simplicity of testing, we can model the ZKFC as a state machine.

          LOCAL_NOT_READY

          The NN on the local host is down or not responding to RPCs. We start in this state.

          LOCAL_STANDBY

          The NN on the local host is in standby mode and ready to automatically transition to active if the former active dies.

          LOCAL_ACTIVE

          The NN on the local host is running and performing active duty.

          Inputs into state machine

          Three other classes interact with the state machine:

          ZK Controller

          A ZK thread connects to ZK and watches for the following events:

          • The previously active master has lost its ephemeral node
          • The ZK session is lost

          User-initiated failover controller

          By some means (RPC/signal/HTTP/etc) the user can request that the active NN's FC gracefully turn over the active state to a different NN.

          Health monitor

          A HealthMonitor thread heartbeats continuously to the local NN. It provides an event whenever the health state of the NN changes. For example:

          • NN has become unhealthy
          • Lost contact with NN
          • NN is now healthy

          Behavior of state machine

          LOCAL_NOT_READY state

          • System starts here
          • When HealthMonitor indicates the local NN is healthy:
            • Transition to LOCAL_STANDBY mode

          LOCAL_STANDBY

          • On health state change:
            • Transition to NOT_READY state if local NN goes down
          • On ZK state change:
            • If the old ZK "active" node was deleted, try to initiate automatic failover
            • If our own ZK session died, reconnect to ZK

          Failover process:

          • Try to create the "activeLock" ephemeral node in ZK
          • If we are unsuccessful, return to LOCAL_STANDBY
          • See if there is a "activeNodeInfo" node in ZK. If so:
            • The old NN may still be running (it didn't gracefully shut down).
            • Initiate fencing process.
            • If successful, delete the "activeNodeInfo" node in ZK.
          • Create an "activeNodeInfo" with our own information (ie NN IPC address, etc)
          • Send IPC to local NN to transitionToActive. If successful, go to LOCAL_ACTIVE

          LOCAL_ACTIVE

          • On health state change to unhealthy:
            • delete our active lock znode, go to LOCAL_NOT_READY. Another node will fence us.
          • On ZK connection loss or notice our znode got deleted:
            • another process is probably about to fence us... unless all nodes lost their connection, in which case we should "stay the course" and stay active here. Still need to figure this out
          • On administrative request to failover:
            • IPC to local node to transition to standby
            • Delete our active lock znode
            • Transition to LOCAL_STANDBY with some timeout set, allowing another failover controller to take the lock.
          Show
          Todd Lipcon added a comment - Here's a design sketch – I have only done a little bit of implementation but nothing really fleshed out yet. So, it might change a bit during the course of implementation. But feedback on the general approach would be appreciated! Goals Ensure that only a single NN can be active at a time. Use ZK as a lock manager to satisfy this requirement. Perform health monitoring of the active NN to trigger a fail-over should it become unhealthy. Automatically fail-over in the case that one of the hosts fails (eg power/network outage) Allow manual (administratively initiated) graceful failover Initiate fencing of the previously active NN in the case of non-graceful failovers. Overall design The ZooKeeper FailoverController (ZKFC) is a separate process/program which runs next to each of the HA NameNodes in the cluster. It does not directly spawn/supervise the NN JVM process, but rather runs on the same machine and communicates with it via localhost RPC. The ZKFC is designed to be as simple as possible to decrease likelihood of bugs which might trigger a false fail-over. It is also designed to use only a very small amount of memory, so that it will never have lengthy GC pauses. This allows us to set a fairly low time-out on the ZK session in order to detect machine failures quickly. Configuration The ZKFC needs the following pieces of configuration: list of zookeeper servers making up the ZK quorum (fail to start if this is not provided) host/port for the HAServiceProtocol of the local NN (defaults to localhost:<well-known port>) "base znode" at which to root all of the znodes used by the process Nodes in ZK: Everything should be rooted at the configured base znode. Within that, there should be a znode per nameservice ID. Within this /base/nameserviceId/ directory, there are the following znodes: activeLock - an ephemeral node taken by the ZKFC before it asks its local NN to become active. This acts as a mutex on the active state and also as a failure detector. activeNodeInfo - a non-ephemeral node written by the ZKFC after it succeeds in taking activeLock . This should have data like the IPC address, HTTP address, etc of the NN. The activeNodeInfo is non-ephemeral so that, when a new NN takes over from a failed one, it has enough information to fence the previous active in case it's still actually running. Runtime operation states For simplicity of testing, we can model the ZKFC as a state machine. LOCAL_NOT_READY The NN on the local host is down or not responding to RPCs. We start in this state. LOCAL_STANDBY The NN on the local host is in standby mode and ready to automatically transition to active if the former active dies. LOCAL_ACTIVE The NN on the local host is running and performing active duty. Inputs into state machine Three other classes interact with the state machine: ZK Controller A ZK thread connects to ZK and watches for the following events: The previously active master has lost its ephemeral node The ZK session is lost User-initiated failover controller By some means (RPC/signal/HTTP/etc) the user can request that the active NN's FC gracefully turn over the active state to a different NN. Health monitor A HealthMonitor thread heartbeats continuously to the local NN. It provides an event whenever the health state of the NN changes. For example: NN has become unhealthy Lost contact with NN NN is now healthy Behavior of state machine LOCAL_NOT_READY state System starts here When HealthMonitor indicates the local NN is healthy: Transition to LOCAL_STANDBY mode LOCAL_STANDBY On health state change: Transition to NOT_READY state if local NN goes down On ZK state change: If the old ZK "active" node was deleted, try to initiate automatic failover If our own ZK session died, reconnect to ZK Failover process: Try to create the "activeLock" ephemeral node in ZK If we are unsuccessful, return to LOCAL_STANDBY See if there is a "activeNodeInfo" node in ZK. If so: The old NN may still be running (it didn't gracefully shut down). Initiate fencing process. If successful, delete the "activeNodeInfo" node in ZK. Create an "activeNodeInfo" with our own information (ie NN IPC address, etc) Send IPC to local NN to transitionToActive. If successful, go to LOCAL_ACTIVE LOCAL_ACTIVE On health state change to unhealthy: delete our active lock znode, go to LOCAL_NOT_READY. Another node will fence us. On ZK connection loss or notice our znode got deleted: another process is probably about to fence us... unless all nodes lost their connection, in which case we should "stay the course" and stay active here. Still need to figure this out On administrative request to failover: IPC to local node to transition to standby Delete our active lock znode Transition to LOCAL_STANDBY with some timeout set, allowing another failover controller to take the lock.
          Hide
          Todd Lipcon added a comment -

          BTW, should add that another goal is to implement a client failover solution which uses the activeNodeInfo information to locate the active NN. We can probably borrow some code from Dhruba's AvatarNode patch for this.

          Show
          Todd Lipcon added a comment - BTW, should add that another goal is to implement a client failover solution which uses the activeNodeInfo information to locate the active NN. We can probably borrow some code from Dhruba's AvatarNode patch for this.
          Hide
          Uma Maheswara Rao G added a comment -

          Hi Todd,

          Small question before going through the proposal in detail.

          I think Zookeeper already has in-built "leader election recipe" implementations ready right. Are we going to reuse that implementations?
          Seems to me that, we are trying to implement the leader election again here.

          Couple of JIRAs from Zookeeper: ZOOKEEPER-1209, ZOOKEEPER-1095, ZOOKEEPER-1080

          Thanks
          Uma

          Show
          Uma Maheswara Rao G added a comment - Hi Todd, Small question before going through the proposal in detail. I think Zookeeper already has in-built "leader election recipe" implementations ready right. Are we going to reuse that implementations? Seems to me that, we are trying to implement the leader election again here. Couple of JIRAs from Zookeeper: ZOOKEEPER-1209 , ZOOKEEPER-1095 , ZOOKEEPER-1080 Thanks Uma
          Hide
          Todd Lipcon added a comment -

          Yea, this is very similar to the leader election recipe - I planned to base the code somewhat on that code for best practices. But the major difference is that we need to do fencing as well, which requires that we leave a non-ephemeral node behind when our ephemeral node expires, so the new NN can fence the old.

          Show
          Todd Lipcon added a comment - Yea, this is very similar to the leader election recipe - I planned to base the code somewhat on that code for best practices. But the major difference is that we need to do fencing as well, which requires that we leave a non-ephemeral node behind when our ephemeral node expires, so the new NN can fence the old.
          Hide
          Aaron T. Myers added a comment -

          Note also that the recipes included in ZK aren't actually built/packaged, so we'll need to copy/paste the code somewhere into Hadoop and built it ourselves anyway, even if we used the recipe as-is.

          Show
          Aaron T. Myers added a comment - Note also that the recipes included in ZK aren't actually built/packaged, so we'll need to copy/paste the code somewhere into Hadoop and built it ourselves anyway, even if we used the recipe as-is.
          Hide
          Aaron T. Myers added a comment -

          Per a recommendation from Patrick Hunt, we might also consider taking a look at the Netflix Curator, which includes a leader election recipe as well. It's Apache-licensed.

          Show
          Aaron T. Myers added a comment - Per a recommendation from Patrick Hunt, we might also consider taking a look at the Netflix Curator , which includes a leader election recipe as well. It's Apache-licensed.
          Hide
          Todd Lipcon added a comment -

          Twitter's also got a nice library of ZK stuff. But I think copy-paste is probably easier so we can customize it to our needs and not have to pull in lots of transitive dependencies

          Show
          Todd Lipcon added a comment - Twitter's also got a nice library of ZK stuff. But I think copy-paste is probably easier so we can customize it to our needs and not have to pull in lots of transitive dependencies
          Hide
          Uma Maheswara Rao G added a comment -

          Ok, Todd thanks for the clarification.
          ZOOKEEPER-1080 is the one we used for our internal HA implementation. Many cases has been handled based on the experiences ,testing and also running in production from last 6months.
          That is also has State machine implementation as you proposed.
          If you have some free go through once and if you find that is reasonable, we can take some code from there as well.
          Also i can help in preparing some part of the patches.

          Thanks
          Uma

          Show
          Uma Maheswara Rao G added a comment - Ok, Todd thanks for the clarification. ZOOKEEPER-1080 is the one we used for our internal HA implementation. Many cases has been handled based on the experiences ,testing and also running in production from last 6months. That is also has State machine implementation as you proposed. If you have some free go through once and if you find that is reasonable, we can take some code from there as well. Also i can help in preparing some part of the patches. Thanks Uma
          Hide
          Todd Lipcon added a comment -

          Great, thanks for the link, Uma. I will be sure to take a look.

          My plan is to finish off the checkpointing work next (HDFS-2291) and then go into a testing cycle for manual failover to make sure everything's robust. Unless we have a robust functional manual failover, automatic failover is just going to add some complication. After we're reasonably confident in the manual operation, we can start in earnest on the ZK-based automatic work. Do you agree?

          (of course it's good to start discussing design for the automatic one in parallel)

          Show
          Todd Lipcon added a comment - Great, thanks for the link, Uma. I will be sure to take a look. My plan is to finish off the checkpointing work next ( HDFS-2291 ) and then go into a testing cycle for manual failover to make sure everything's robust. Unless we have a robust functional manual failover, automatic failover is just going to add some complication. After we're reasonably confident in the manual operation, we can start in earnest on the ZK-based automatic work. Do you agree? (of course it's good to start discussing design for the automatic one in parallel)
          Hide
          Uma Maheswara Rao G added a comment -

          That's Great!
          Completely Agreed with you, for completing manual failover first.
          Ok, lets continue the discussions on design parallely whenever we find the time.

          Show
          Uma Maheswara Rao G added a comment - That's Great! Completely Agreed with you, for completing manual failover first. Ok, lets continue the discussions on design parallely whenever we find the time.
          Hide
          Suresh Srinivas added a comment -

          Todd, instead of incorporating HDFS-2681 into this, can we finish the ZK library as a part of that jira and focus this jira on FailoverController.

          Show
          Suresh Srinivas added a comment - Todd, instead of incorporating HDFS-2681 into this, can we finish the ZK library as a part of that jira and focus this jira on FailoverController.
          Hide
          Todd Lipcon added a comment -

          Sure, that makes sense. I'm a little skeptical that the ZK library can be done well entirely in isolation of having something to plug it into... but if it can be, certainly would work.

          Show
          Todd Lipcon added a comment - Sure, that makes sense. I'm a little skeptical that the ZK library can be done well entirely in isolation of having something to plug it into... but if it can be, certainly would work.
          Hide
          Bikas Saha added a comment -

          I have attached a state diagram for some ideas I had on how this could work. Think of the rectangles as the primary states of the controller. The ovals are actions that need to be taken before changing states. The black arrows are results of those actions and the blue arrows are external events. The blue arrows are notifications that can be received from the ZK leader election library added in HADOOP-7992 and the health notifications from the HAServiceProtocol.
          This expects one change in the HAServiceProtocol. That is to split becomeActive() into prepareToBecomeActive() and becomeActive(). prepareToBecomeActive() does the time consuming heavy lifting and the world might change by the time it completes. At that point, if the node is still the leader, it can quickly becomeActive(). Else it can becomeStandby().

          Show
          Bikas Saha added a comment - I have attached a state diagram for some ideas I had on how this could work. Think of the rectangles as the primary states of the controller. The ovals are actions that need to be taken before changing states. The black arrows are results of those actions and the blue arrows are external events. The blue arrows are notifications that can be received from the ZK leader election library added in HADOOP-7992 and the health notifications from the HAServiceProtocol. This expects one change in the HAServiceProtocol. That is to split becomeActive() into prepareToBecomeActive() and becomeActive(). prepareToBecomeActive() does the time consuming heavy lifting and the world might change by the time it completes. At that point, if the node is still the leader, it can quickly becomeActive(). Else it can becomeStandby().
          Hide
          Todd Lipcon added a comment -

          Here's an initial patch on this, with basic functionality working in the context of unit tests. This is dependent on HADOOP-8193. I will also split this into a Common patch and HDFS patch. Just wanted to post something so people could take a look early if they wanted.

          Show
          Todd Lipcon added a comment - Here's an initial patch on this, with basic functionality working in the context of unit tests. This is dependent on HADOOP-8193 . I will also split this into a Common patch and HDFS patch. Just wanted to post something so people could take a look early if they wanted.
          Hide
          Bikas Saha added a comment -

          It would be really great if there is a design document posted that explains the details. Thats usually a lot easier to understand (aside of actual white-boarding ) than real code. It helps in reading the code if the mental model of the design is made via a document. Specially since this is a new component altogether.

          Show
          Bikas Saha added a comment - It would be really great if there is a design document posted that explains the details. Thats usually a lot easier to understand (aside of actual white-boarding ) than real code. It helps in reading the code if the mental model of the design is made via a document. Specially since this is a new component altogether.
          Hide
          Todd Lipcon added a comment -

          Hi Bikas. The important bits of the code are only ~200 lines. Is there really much value in a detailed design doc? In my opinion, if the code itself isn't clear and self-documenting enough to make the design obvious, then the code needs to be better. If there's anything unclear in the code, please let me know and I'll improve the javadocs and inline comments. A general overview of the design is posted above, though the code has less of a formal state machine approach.

          Show
          Todd Lipcon added a comment - Hi Bikas. The important bits of the code are only ~200 lines. Is there really much value in a detailed design doc? In my opinion, if the code itself isn't clear and self-documenting enough to make the design obvious, then the code needs to be better. If there's anything unclear in the code, please let me know and I'll improve the javadocs and inline comments. A general overview of the design is posted above, though the code has less of a formal state machine approach.
          Hide
          Todd Lipcon added a comment -

          Renaming JIRA – this will be used for the HDFS portion. I filed HADOOP-8206 for the common portion.

          Show
          Todd Lipcon added a comment - Renaming JIRA – this will be used for the HDFS portion. I filed HADOOP-8206 for the common portion.
          Hide
          Todd Lipcon added a comment -

          Attached patch is for the HDFS side after splitting out the common components. It includes a simple unit test which makes sure failover occurs when the NNs shut themselves down.

          I'll continue to add test cases and also run some cluster tests while this (and its prereqs HADOOP-8206 and HADOOP-8212) are under review.

          Show
          Todd Lipcon added a comment - Attached patch is for the HDFS side after splitting out the common components. It includes a simple unit test which makes sure failover occurs when the NNs shut themselves down. I'll continue to add test cases and also run some cluster tests while this (and its prereqs HADOOP-8206 and HADOOP-8212 ) are under review.
          Hide
          Todd Lipcon added a comment -

          Attached a brief design doc that explains this from an overall perspective (including the HealthMonitor and ActiveStandbyElector work already committed)

          Show
          Todd Lipcon added a comment - Attached a brief design doc that explains this from an overall perspective (including the HealthMonitor and ActiveStandbyElector work already committed)
          Hide
          Bikas Saha added a comment -

          Thanks! I will look at the doc.
          <rambling>I agree that the best documentation is the code. But that is for the developers. Putting up a design document helps people not familiar with the code or who dont have the time to extract the design from the code, to contribute potentially valuable inputs.
          This is specially true for a new piece of code like this one, where there is not much previous background to rely on.
          Once this is committed, then the code would be the best documentation. Updating docs would be good but is usually not practical.
          I have learned that the more I do to help my reviewers, the better reviews I get. E.g. if I were doing it, in this case, I would also include a diagram that briefly sketches the code/object organization/relationships with tips on where to start so that my review process is made easy.</rambling>

          Show
          Bikas Saha added a comment - Thanks! I will look at the doc. <rambling>I agree that the best documentation is the code. But that is for the developers. Putting up a design document helps people not familiar with the code or who dont have the time to extract the design from the code, to contribute potentially valuable inputs. This is specially true for a new piece of code like this one, where there is not much previous background to rely on. Once this is committed, then the code would be the best documentation. Updating docs would be good but is usually not practical. I have learned that the more I do to help my reviewers, the better reviews I get. E.g. if I were doing it, in this case, I would also include a diagram that briefly sketches the code/object organization/relationships with tips on where to start so that my review process is made easy.</rambling>
          Hide
          Aaron T. Myers added a comment -

          Patch looks pretty good, Todd. A few small comments:

          1. Is it intentional that you call "super.setConf(...)" in DFSZKFailoverController#setConf with the passed-conf as-is, instead of the slightly mutated localNNConf? If it is intentional, then please move the super.setConf(...) call higher in the method to make that clear.
          2. Is it necessary for DFSZKFailoverController#getLocalTarget to return a new NNHAServiceTarget object each time it is called? If so, please add a comment explaining why. If not, then let's just store a single instance of NNHAServiceTarget instead of the separate conf/nsId/nnId instance variables.
          3. I think you might end up with a race condition if you don't add a "waitForHAState(1, HAServiceState.STANDBY)" in TestDFSZKFailoverController#setup after starting the second ZKFCThread.
          4. A little too much indentation at the end of TestDFSZKFailoverController#setup.
          5. After the call to "cluster.restartNameNode(0)", I think we should assert that the second NN is still active. As-written, the test could succeed if the system failed back to the first NN upon restarting it. This assertion is sort of more a test of the common side of this patch, though, so feel free to punt.
          6. "Test-thread which runs a ZK Failover Controller corresponding to a given dummy service" - it's not a dummy service in this case.
          Show
          Aaron T. Myers added a comment - Patch looks pretty good, Todd. A few small comments: Is it intentional that you call "super.setConf(...)" in DFSZKFailoverController#setConf with the passed-conf as-is, instead of the slightly mutated localNNConf? If it is intentional, then please move the super.setConf(...) call higher in the method to make that clear. Is it necessary for DFSZKFailoverController#getLocalTarget to return a new NNHAServiceTarget object each time it is called? If so, please add a comment explaining why. If not, then let's just store a single instance of NNHAServiceTarget instead of the separate conf/nsId/nnId instance variables. I think you might end up with a race condition if you don't add a "waitForHAState(1, HAServiceState.STANDBY)" in TestDFSZKFailoverController#setup after starting the second ZKFCThread. A little too much indentation at the end of TestDFSZKFailoverController#setup. After the call to "cluster.restartNameNode(0)", I think we should assert that the second NN is still active. As-written, the test could succeed if the system failed back to the first NN upon restarting it. This assertion is sort of more a test of the common side of this patch, though, so feel free to punt. "Test-thread which runs a ZK Failover Controller corresponding to a given dummy service" - it's not a dummy service in this case.
          Hide
          Todd Lipcon added a comment -

          New patch addresses all of the above except for #3. I don't think there's a race condition here, since the NN itself always starts in standby mode, and the minicluster startup waits until it is fully active. Its associated ZKFC might be in the process of starting, but that's OK, because all of our assertions involve wait-loops until they reach the desired state.

          Show
          Todd Lipcon added a comment - New patch addresses all of the above except for #3. I don't think there's a race condition here, since the NN itself always starts in standby mode, and the minicluster startup waits until it is fully active. Its associated ZKFC might be in the process of starting, but that's OK, because all of our assertions involve wait-loops until they reach the desired state.
          Hide
          Aaron T. Myers added a comment -

          I don't think there's a race condition here, since the NN itself always starts in standby mode, and the minicluster startup waits until it is fully active.

          Ah, I thought that MiniDFSCluster might return from waitActive while still in the INITIALIZING state. My mistake. I agree with you, then, that there's no race condition.

          Latest patch looks good to me. +1 pending Jenkins and the commit of the Common patch.

          Show
          Aaron T. Myers added a comment - I don't think there's a race condition here, since the NN itself always starts in standby mode, and the minicluster startup waits until it is fully active. Ah, I thought that MiniDFSCluster might return from waitActive while still in the INITIALIZING state. My mistake. I agree with you, then, that there's no race condition. Latest patch looks good to me. +1 pending Jenkins and the commit of the Common patch.
          Hide
          Todd Lipcon added a comment -

          Tested this briefly on a cluster. In doing so, I made the following changes:

          • had to instantiate HdfsConfiguration rather than Configuration, to ensure hdfs-site.xml got loaded
          • added a "bin/hdfs zkfc" command to start the failover controller.
          Show
          Todd Lipcon added a comment - Tested this briefly on a cluster. In doing so, I made the following changes: had to instantiate HdfsConfiguration rather than Configuration, to ensure hdfs-site.xml got loaded added a "bin/hdfs zkfc" command to start the failover controller.
          Hide
          Sanjay Radia added a comment -

          Todd, thanks for posting the design doc. I am reading it. I noticed that there isn't a state transition diagram. I noticed that Bikas has posted a state transition diagram - not sure if you are using the same or another one. If not could you add one please - this part of the system is fairly subtle and a state transition diagram would help.

          Show
          Sanjay Radia added a comment - Todd, thanks for posting the design doc. I am reading it. I noticed that there isn't a state transition diagram. I noticed that Bikas has posted a state transition diagram - not sure if you are using the same or another one. If not could you add one please - this part of the system is fairly subtle and a state transition diagram would help.
          Hide
          Todd Lipcon added a comment -

          Hi Sanjay. I'll work on adding a state machine diagram to the design doc. My graphviz skills are rusty, but should have something by tonight.

          Show
          Todd Lipcon added a comment - Hi Sanjay. I'll work on adding a state machine diagram to the design doc. My graphviz skills are rusty, but should have something by tonight.
          Hide
          Aaron T. Myers added a comment -

          The latest changes look good to me. +1 pending Jenkins.

          It also occurs to me that this change warrants some updates to some user docs, notably ClusterSetup.apt.vm (to mention HADOOP_ZKFC_OPTS) and HDFSHighAvailability.apt.vm. These changes should be a separate JIRA, though, since those docs are not under hadoop-hdfs-project.

          Show
          Aaron T. Myers added a comment - The latest changes look good to me. +1 pending Jenkins. It also occurs to me that this change warrants some updates to some user docs, notably ClusterSetup.apt.vm (to mention HADOOP_ZKFC_OPTS) and HDFSHighAvailability.apt.vm. These changes should be a separate JIRA, though, since those docs are not under hadoop-hdfs-project.
          Hide
          Todd Lipcon added a comment -

          New revision of the design doc:

          • includes a diagram of the state machine
          • amends the description of the 'ZooKeeper crash' scenario. I had a mistake in my understanding of ZK's behavior before - now clarified.
          • adds mention of the INITIALIZING state per atm's feedback in another JIRA

          I also pushed the latex source for the design doc here:
          https://github.com/toddlipcon/zkfc-design

          Show
          Todd Lipcon added a comment - New revision of the design doc: includes a diagram of the state machine amends the description of the 'ZooKeeper crash' scenario. I had a mistake in my understanding of ZK's behavior before - now clarified. adds mention of the INITIALIZING state per atm's feedback in another JIRA I also pushed the latex source for the design doc here: https://github.com/toddlipcon/zkfc-design
          Hide
          Todd Lipcon added a comment -

          Meant to attach the pdf instead of the .tex file.

          Show
          Todd Lipcon added a comment - Meant to attach the pdf instead of the .tex file.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12520036/hdfs-2185.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.server.namenode.ha.TestDFSZKFailoverController
          org.apache.hadoop.hdfs.TestGetBlocks

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2103//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2103//artifact/trunk/hadoop-hdfs-project/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2103//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12520036/hdfs-2185.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.server.namenode.ha.TestDFSZKFailoverController org.apache.hadoop.hdfs.TestGetBlocks +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2103//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2103//artifact/trunk/hadoop-hdfs-project/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2103//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Canceling patch to address test failure. Looks like it may be a race. I'll loop the test to try to uncover it.

          Show
          Todd Lipcon added a comment - Canceling patch to address test failure. Looks like it may be a race. I'll loop the test to try to uncover it.
          Hide
          Todd Lipcon added a comment -

          I think the race we hit was due to HADOOP-8220. Let's get that one committed and then try resubmitting this one. With that bug fixed in my local tree, I was able to loop the new test for a number of iterations without any failures.

          Show
          Todd Lipcon added a comment - I think the race we hit was due to HADOOP-8220 . Let's get that one committed and then try resubmitting this one. With that bug fixed in my local tree, I was able to loop the new test for a number of iterations without any failures.
          Hide
          Sanjay Radia added a comment -

          Todd, I am even okay with a scanned diagram of hand drawn state machine. Right now I can't get a full understanding of the design without more details.

          Show
          Sanjay Radia added a comment - Todd, I am even okay with a scanned diagram of hand drawn state machine. Right now I can't get a full understanding of the design without more details.
          Hide
          Todd Lipcon added a comment -

          Hi Sanjay. There is a state machine in the PDF I uploaded yesterday. Please see section 2.6 of the latest attached PDF.

          Show
          Todd Lipcon added a comment - Hi Sanjay. There is a state machine in the PDF I uploaded yesterday. Please see section 2.6 of the latest attached PDF.
          Hide
          Bikas Saha added a comment -

          I am surprised that only an inElection state suffices instead of an inElectionActive and inElectionStandby. Arent there actions that need to be performed differently when the FC is active or standby?

          There is no state transition dependent on the result of transitionToActive(). If transitionToActive on NN fails then FC should quitElection IMO. Currently, it quits election only on HM events. Same for transitionToStandby on NN. If that fails then should we not do something?

          Which state is performing fencing? The state machine does not show fencing? Is it happening in the the HAService?

          IMO - this state diagram has to be more clear about handling the success/failure of each operation. That is key to determining the robustness of FC. FC needs to be super robust by design right?

          Show
          Bikas Saha added a comment - I am surprised that only an inElection state suffices instead of an inElectionActive and inElectionStandby. Arent there actions that need to be performed differently when the FC is active or standby? There is no state transition dependent on the result of transitionToActive(). If transitionToActive on NN fails then FC should quitElection IMO. Currently, it quits election only on HM events. Same for transitionToStandby on NN. If that fails then should we not do something? Which state is performing fencing? The state machine does not show fencing? Is it happening in the the HAService? IMO - this state diagram has to be more clear about handling the success/failure of each operation. That is key to determining the robustness of FC. FC needs to be super robust by design right?
          Hide
          Todd Lipcon added a comment -

          Attached new rev of the design document, which includes some more detail on failure cases.

          I did not include the whole fencing process in the state machine diagram, as it makes it too complex and hard to read. I did include the basic flow of it, though. Please refer to the text in sections 2.4 and 2.5 for more details.

          Show
          Todd Lipcon added a comment - Attached new rev of the design document, which includes some more detail on failure cases. I did not include the whole fencing process in the state machine diagram, as it makes it too complex and hard to read. I did include the basic flow of it, though. Please refer to the text in sections 2.4 and 2.5 for more details.
          Hide
          Todd Lipcon added a comment -

          I am surprised that only an inElection state suffices instead of an inElectionActive and inElectionStandby. Arent there actions that need to be performed differently when the FC is active or standby?

          It's somewhat surprising, but it seems to work fine this way. Do you see any issues?

          There is no state transition dependent on the result of transitionToActive(). If transitionToActive on NN fails then FC should quitElection IMO. Currently, it quits election only on HM events. Same for transitionToStandby on NN. If that fails then should we not do something?

          The new rev of the document adds more explanation of this.

          Which state is performing fencing? The state machine does not show fencing? Is it happening in the the HAService?

          It's a callback from the elector. New rev of the document shows the callback.

          Show
          Todd Lipcon added a comment - I am surprised that only an inElection state suffices instead of an inElectionActive and inElectionStandby. Arent there actions that need to be performed differently when the FC is active or standby? It's somewhat surprising, but it seems to work fine this way. Do you see any issues? There is no state transition dependent on the result of transitionToActive(). If transitionToActive on NN fails then FC should quitElection IMO. Currently, it quits election only on HM events. Same for transitionToStandby on NN. If that fails then should we not do something? The new rev of the document adds more explanation of this. Which state is performing fencing? The state machine does not show fencing? Is it happening in the the HAService? It's a callback from the elector. New rev of the document shows the callback.
          Hide
          Todd Lipcon added a comment -

          Rebased patch on the tip of HDFS-3042 branch.

          I've been looping the test case for the last 30 minutes with no failures, so I think the earlier failure on QA-bot was due to the HADOOP-8220 issue.

          The only differences in the patch are:

          • addition of the generated protobuf to findbugs exclude
          • addition of a setup() function for the ZK test to mkdir the test directory (necessary on our build environment)

          Since those changes are trivial, and this is now on a branch, I'll commit this to the branch based on Aaron's earlier +1. This will allow us to make continued progress in testing and bugfixing.

          Show
          Todd Lipcon added a comment - Rebased patch on the tip of HDFS-3042 branch. I've been looping the test case for the last 30 minutes with no failures, so I think the earlier failure on QA-bot was due to the HADOOP-8220 issue. The only differences in the patch are: addition of the generated protobuf to findbugs exclude addition of a setup() function for the ZK test to mkdir the test directory (necessary on our build environment) Since those changes are trivial, and this is now on a branch, I'll commit this to the branch based on Aaron's earlier +1. This will allow us to make continued progress in testing and bugfixing.
          Hide
          Todd Lipcon added a comment -

          Committed to auto-failover branch (HDFS-3042). Please feel free to continue commenting on the design – either here, or if you prefer, we can move the discussion to the umbrella task HDFS-3042. Apologies for having uploaded the document to this subtask instead of the supertask.

          Show
          Todd Lipcon added a comment - Committed to auto-failover branch ( HDFS-3042 ). Please feel free to continue commenting on the design – either here, or if you prefer, we can move the discussion to the umbrella task HDFS-3042 . Apologies for having uploaded the document to this subtask instead of the supertask.
          Hide
          Bikas Saha added a comment -

          I think you are missing the failure arc when transitionToStandby is called in InElection state.

          Is there any scope for admin operations in ZKFC. Will ZKFC receive and accept a signal (manual admin/auto machine reboot) to stop services? At that point, in InElection state, how will it know that it needs to send transitionToStandby or not (based on whether it is active or not)?

          Show
          Bikas Saha added a comment - I think you are missing the failure arc when transitionToStandby is called in InElection state. Is there any scope for admin operations in ZKFC. Will ZKFC receive and accept a signal (manual admin/auto machine reboot) to stop services? At that point, in InElection state, how will it know that it needs to send transitionToStandby or not (based on whether it is active or not)?
          Hide
          Todd Lipcon added a comment -

          I think you are missing the failure arc when transitionToStandby is called in InElection state.

          This failure handling isn't necessary, since any failure to transition will be shortly followed by the health monitor entering a bad state. This is because a failure in state transition causes the NN to abort.

          Is there any scope for admin operations in ZKFC. Will ZKFC receive and accept a signal (manual admin/auto machine reboot) to stop services? At that point, in InElection state, how will it know that it needs to send transitionToStandby or not (based on whether it is active or not)?

          I just added a section regarding manual failover operation in conjunction with automatic. I was hoping we could merge the automatic work back to trunk prior to adding this feature, though, treating it as an improvement.

          Show
          Todd Lipcon added a comment - I think you are missing the failure arc when transitionToStandby is called in InElection state. This failure handling isn't necessary, since any failure to transition will be shortly followed by the health monitor entering a bad state. This is because a failure in state transition causes the NN to abort. Is there any scope for admin operations in ZKFC. Will ZKFC receive and accept a signal (manual admin/auto machine reboot) to stop services? At that point, in InElection state, how will it know that it needs to send transitionToStandby or not (based on whether it is active or not)? I just added a section regarding manual failover operation in conjunction with automatic. I was hoping we could merge the automatic work back to trunk prior to adding this feature, though, treating it as an improvement.
          Hide
          Mingjie Lai added a comment -

          (still posting comments here since the design doc is attached here)

          @todd

          Thanks for adding the manual failover section 2.7 in the design doc.

          However I have some questions for what you described in 2.7.2:

          • HAAdmin makes an RPC failoverToYou() to the target ZKFC
          • target ZKFC sends an RPC concedeLock() to the currently active ZKFC.
          • the active sends a transitionToStandby() RPC to its local node

          IMO the chain of RPCs is quite complicated, not easy to debug and troubleshoot in operation. Because you're trying to resolve the 2 problems, auto and manual failover, at one place – ZKFC.

          How about seperate the 2 cases:

          • add commands at haadmin to start/stop autofailover
          • stop-autofailover requests all ZKFC to exitElection
          • start-autofailover requests all ZKFC to enterElction
          • haadmin is responsible for handle manual failover (as current implementation)
          • admins can only perform manual failover when autofailover is stopped
          • can be used to specify one particular active NN

          Pros:

          • existing manual fo code can be kept mostly
          • although new RPC is added to ZKFC but we don't need them to talk to each other. the manual failover logic is all handled at client – haadmin.
          • easier to extend to the case of multiple standby NNs

          cons:

          • administrator needs to explicitly start/stop autofailover, in addition to ZKFC process.
          Show
          Mingjie Lai added a comment - (still posting comments here since the design doc is attached here) @todd Thanks for adding the manual failover section 2.7 in the design doc. However I have some questions for what you described in 2.7.2: HAAdmin makes an RPC failoverToYou() to the target ZKFC target ZKFC sends an RPC concedeLock() to the currently active ZKFC. the active sends a transitionToStandby() RPC to its local node IMO the chain of RPCs is quite complicated, not easy to debug and troubleshoot in operation. Because you're trying to resolve the 2 problems, auto and manual failover, at one place – ZKFC. How about seperate the 2 cases: add commands at haadmin to start/stop autofailover stop-autofailover requests all ZKFC to exitElection start-autofailover requests all ZKFC to enterElction haadmin is responsible for handle manual failover (as current implementation) admins can only perform manual failover when autofailover is stopped can be used to specify one particular active NN Pros: existing manual fo code can be kept mostly although new RPC is added to ZKFC but we don't need them to talk to each other. the manual failover logic is all handled at client – haadmin. easier to extend to the case of multiple standby NNs cons: administrator needs to explicitly start/stop autofailover, in addition to ZKFC process.
          Hide
          Todd Lipcon added a comment -

          Hi Mingjie. Thanks for taking a look.

          The idea for the chain of RPCs is from talking with some folks here who work on Hadoop deployment. Their opinion was the following: currently, most of the Hadoop client tools are too "thick". For example, in the current manual failover implementation, the fencing is run on the admin client. This means that you have to run the haadmin command from a machine that has access to all of the necessary fencing scripts, key files, etc. That's a little bizarre – you would expect to configure these kinds of things only on the central location, not on the client.

          So, we decided that it makes sense to push the management of the whole failover process into the FCs themselves, and just use a single RPC to kick off the whole failover process. This keeps the client "thin".

          As for your proposed alternative, here are a few thoughts:

          existing manual fo code can be kept mostly

          We actually share much of the code already. But, the problem with using the existing code exactly as is, is that the failover controllers always expect to have complete "control" over the system. If the state of the NNs changes underneath the ZKFC, then the state in ZK will become inconsistent with the actual state of the system, and it's very easy to get into split brain scenarios. So, the idea is that, when auto-failover is enabled, all decisions must be made by ZKFCs. That way we can make sure the ZK state doesn't get out of sync.

          although new RPC is added to ZKFC but we don't need them to talk to each other. the manual failover logic is all handled at client – haadmin.

          As noted above I think this is a con, not a pro, because it requires configuring fencing scripts at the client, and likely requiring that the client have read-write access to ZK

          easier to extend to the case of multiple standby NNs

          I think the extension path to multiple standby is actually equally easy with both approaches. The solution in the ZKFC-managed implementation is to add a new znode like "PreferredActive" and have nodes avoid becoming active unless they're listed as preferred. The target node of the failover can just set itself to be preferred before asking the other node to cede the lock.

          Some other advantages that I probably didn't explain well in the design doc:

          • this design is fault tolerant. If the "target" node crashes in the middle of the process, then the old active will automatically regain the active state after its "rejoin" timeout elapses. With a client-managed setup, a well-meaning admin may ^C the process in the middle and leave the system with no active at all.
          • no need to introduce "disable/enable" to auto-failover. Just having both nodes quit the election wouldn't work, since one would end up quitting "before" the other, causing a blip where an unnecessary (random) failover occurred. We could carefully orchestrate the order of quitting, so the active quits last, but I think it still gets complicated.
          Show
          Todd Lipcon added a comment - Hi Mingjie. Thanks for taking a look. The idea for the chain of RPCs is from talking with some folks here who work on Hadoop deployment. Their opinion was the following: currently, most of the Hadoop client tools are too "thick". For example, in the current manual failover implementation, the fencing is run on the admin client. This means that you have to run the haadmin command from a machine that has access to all of the necessary fencing scripts, key files, etc. That's a little bizarre – you would expect to configure these kinds of things only on the central location, not on the client. So, we decided that it makes sense to push the management of the whole failover process into the FCs themselves, and just use a single RPC to kick off the whole failover process. This keeps the client "thin". As for your proposed alternative, here are a few thoughts: existing manual fo code can be kept mostly We actually share much of the code already. But, the problem with using the existing code exactly as is, is that the failover controllers always expect to have complete "control" over the system. If the state of the NNs changes underneath the ZKFC, then the state in ZK will become inconsistent with the actual state of the system, and it's very easy to get into split brain scenarios. So, the idea is that, when auto-failover is enabled, all decisions must be made by ZKFCs. That way we can make sure the ZK state doesn't get out of sync. although new RPC is added to ZKFC but we don't need them to talk to each other. the manual failover logic is all handled at client – haadmin. As noted above I think this is a con, not a pro, because it requires configuring fencing scripts at the client, and likely requiring that the client have read-write access to ZK easier to extend to the case of multiple standby NNs I think the extension path to multiple standby is actually equally easy with both approaches. The solution in the ZKFC-managed implementation is to add a new znode like "PreferredActive" and have nodes avoid becoming active unless they're listed as preferred. The target node of the failover can just set itself to be preferred before asking the other node to cede the lock. Some other advantages that I probably didn't explain well in the design doc: this design is fault tolerant. If the "target" node crashes in the middle of the process, then the old active will automatically regain the active state after its "rejoin" timeout elapses. With a client-managed setup, a well-meaning admin may ^C the process in the middle and leave the system with no active at all. no need to introduce "disable/enable" to auto-failover. Just having both nodes quit the election wouldn't work, since one would end up quitting "before" the other, causing a blip where an unnecessary (random) failover occurred. We could carefully orchestrate the order of quitting, so the active quits last, but I think it still gets complicated.
          Hide
          Mingjie Lai added a comment -

          Hi Todd.

          You raised good points regarding deployment.

          My proposal was to reduce the complexity of ZKFC. Right now a ZKFC needs to monitor the health of a NN, watch zookeeper, in addition, it will respond to RPCs from client and peer ZKFCs. I thought it makes sense to share some responsibilities to client. However I'm also okay with your design.

          Please help to clarify:

          So, the idea is that, when auto-failover is enabled, all decisions must be made by ZKFCs.

          (from the doc) The NameNodes enter a mode such that they only accept mutative HAServiceProtocol RPCs from ZKFCs

          I'm not quite sure how it can be guaranteed. NN cannot be aware of who issues a transition, right?

          no need to introduce "disable/enable" to auto-failover...

          I still think it makes sense to ops to have an option to turn on/off auto failover on-demand. In case of ZKFC issues, we still can have an alternative way to bypass it. However I'm neither sure it would help ops or confuse them.

          Show
          Mingjie Lai added a comment - Hi Todd. You raised good points regarding deployment. My proposal was to reduce the complexity of ZKFC. Right now a ZKFC needs to monitor the health of a NN, watch zookeeper, in addition, it will respond to RPCs from client and peer ZKFCs. I thought it makes sense to share some responsibilities to client. However I'm also okay with your design. Please help to clarify: So, the idea is that, when auto-failover is enabled, all decisions must be made by ZKFCs. (from the doc) The NameNodes enter a mode such that they only accept mutative HAServiceProtocol RPCs from ZKFCs I'm not quite sure how it can be guaranteed. NN cannot be aware of who issues a transition, right? no need to introduce "disable/enable" to auto-failover... I still think it makes sense to ops to have an option to turn on/off auto failover on-demand. In case of ZKFC issues, we still can have an alternative way to bypass it. However I'm neither sure it would help ops or confuse them.
          Hide
          Todd Lipcon added a comment -

          I'm not quite sure how it can be guaranteed. NN cannot be aware of who issues a transition, right?

          My plan was to add an enum flag to the RPCs like transitionToActive and transitionToStandby that would indicate who sent it. For example "CLI_FAILOVER", "ZKFC_FAILOVER", or "FORCE". The force option would be there so that if the admin really knows what he/she is doing, they could override the safety check. Otherwise the haadmin commands can prevent users from accidentally shooting themselves in the foot.

          I still think it makes sense to ops to have an option to turn on/off auto failover on-demand. In case of ZKFC issues, we still can have an alternative way to bypass it. However I'm neither sure it would help ops or confuse them.

          Thats a good point - it's useful for emergency situations. I think we can solve this with docs, though – if you want to stop automatic failovers, you need to first shut down the standby ZKFCs, then the active ZKFC. If you bring them down in the other order, it won't break things, but you might get a failover in the process. I think adding a programatic way to do this is a future improvement.

          Show
          Todd Lipcon added a comment - I'm not quite sure how it can be guaranteed. NN cannot be aware of who issues a transition, right? My plan was to add an enum flag to the RPCs like transitionToActive and transitionToStandby that would indicate who sent it. For example "CLI_FAILOVER", "ZKFC_FAILOVER", or "FORCE". The force option would be there so that if the admin really knows what he/she is doing, they could override the safety check. Otherwise the haadmin commands can prevent users from accidentally shooting themselves in the foot. I still think it makes sense to ops to have an option to turn on/off auto failover on-demand. In case of ZKFC issues, we still can have an alternative way to bypass it. However I'm neither sure it would help ops or confuse them. Thats a good point - it's useful for emergency situations. I think we can solve this with docs, though – if you want to stop automatic failovers, you need to first shut down the standby ZKFCs, then the active ZKFC. If you bring them down in the other order, it won't break things, but you might get a failover in the process. I think adding a programatic way to do this is a future improvement.

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Eli Collins
            • Votes:
              0 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development