Hadoop Common
  1. Hadoop Common
  2. HADOOP-8279

Auto-HA: Allow manual failover to be invoked from zkfc.

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: Auto Failover (HDFS-3042)
    • Fix Version/s: Auto Failover (HDFS-3042)
    • Component/s: auto-failover, ha
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      HADOOP-8247 introduces a configure flag to prevent potential status inconsistency between zkfc and namenode, by making auto and manual failover mutually exclusive.

      However, as described in 2.7.2 section of design doc at HDFS-2185, we should allow manual and auto failover co-exist, by:

      • adding some rpc interfaces at zkfc
      • manual failover shall be triggered by haadmin, and handled by zkfc if auto failover is enabled.
      1. hadoop-8279.txt
        82 kB
        Todd Lipcon
      2. hadoop-8279.txt
        82 kB
        Todd Lipcon
      3. hadoop-8279.txt
        81 kB
        Todd Lipcon
      4. hadoop-8279.txt
        76 kB
        Todd Lipcon
      5. hadoop-8279.txt
        68 kB
        Todd Lipcon

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          Committed to branch, thanks Aaron.

          Show
          Todd Lipcon added a comment - Committed to branch, thanks Aaron.
          Hide
          Aaron T. Myers added a comment -

          +1, the updated patch looks good to me. Thanks a lot for addressing my feedback, Todd.

          Show
          Aaron T. Myers added a comment - +1, the updated patch looks good to me. Thanks a lot for addressing my feedback, Todd.
          Hide
          Todd Lipcon added a comment -

          "-forceFence doesn't seem to have any real use cases with auto-HA so it isn't implemented." - I don't follow the reasoning. Seems like it should be just as applicable to auto-HA as manual, no?

          I chatted with Eli about this, since he's the one who originally added the "-forceFence" option. The original motivation was to test the fencing script, but with this "manual failover" that's probably not the best way to test it. Better would be to do something like kill -STOP the active NN, which will both trigger a failover and trigger fencing. Another option might be to create a new command like "-testFencer" which would (after requiring confirmation) shoot down the active. But since it's a corner case let's address as a follow-up improvement.

          "If the attempt to transition to standby succeeds, then the ZKFC will delete the breadcrumb node in ZooKeeper" - might want to specify which ZKFC will do the deletion.

          Changed to:

             * If the attempt to transition to standby succeeds, then the ZKFC receiving
             * this RPC will delete its own breadcrumb node in ZooKeeper. Thus, the
             * next node to become active will not run any fencing process. Otherwise,
             * the breadcrumb will be left, such that the next active will fence this
             * node.
          

          "If the node is healthy and not active, it sends an RPC to the current active, asking it to yield from the election." - it actually sends an RPC to the ZKFC associated with the current active.

          I actually removed the details here in ZKFCProtocol.java, electing instead to refer the reader to the implementation. I think it's better for the ZKFCProtocol javadocs to explain the "outward" behavior, and explain the actual implementation in the design doc and the inline comments in ZKFailoverController. It now reads:

             * If the node is healthy and not active, it will try to initiate a graceful
             * failover to become active, returning only when it has successfully become
             * active. See {@link ZKFailoverController#gracefulFailoverToYou()} for the
             * implementation details.
          

          "if the current active does not respond to the graceful request, throws an exception indicating the reason for failure." - I recommend you make it explicit which graceful request this is referring to. In fact, if the active NN fails to respond to the graceful request to transition to standby, it will be fenced. It's the failure of the active ZKFC to respond to the cedeActive calls that results in a failure of gracefulFailover.

          Per above, I changed this to only reference what a caller needs to know, instead of the underlying implementation.

             * If the node fails to successfully coordinate the failover, throws an
             * exception indicating the reason for failure.
          

          I think you need interface annotations on ZKFCRpcServer, or perhaps it can be made package-private?

          Good catch. It can't be package-private because DFSZKFailoverController is in an HDFS package. I annotated it LimitedPrivate to HDFS.

          In ZKFCProtocol#cedeActive you declare the parameter to be in millis, but in the ZKFCRpcServer#cedeActive implementation, you say the period is in seconds.

          Another good catch - I changed this late in the development of the patch and missed a spot. Fixed.

          I don't see much point in having both ZKFCRpcServer#stop and ZKFCRpcServer#join. Why not just call this.server.join in ZKFCRpcServer#stop?

          Combined the two into a stopAndJoin

          "periodically check health state since, because entering an" - doesn't quite parse.

          Fixed.

          I think the log message about the timeout elapsing in ZKFailoverController#waitForActiveAttempt should probably be at least at WARN level instead of INFO.

          Fixed.

          "It's possible that it's in standby but just about to go into active, no? Is there some race here?" - should this comment now be removed?

          This comment is basically about the situation described in HADOOP-8217, so it's still relevant.

          I recommend you change the value of DFS_HA_ZKFC_PORT_DEFAULT to something other than 8021. I've seen a lot of JTs in the wild with their default port set to 8021.

          Good point... I changed it to 8019.

          The design in the document posted to HDFS-2185 mentions introducing "-to" and "-from" parameters to the `haadmin -failover' command, but this implementation doesn't do that. That seems fine by me, but I'm curious why you chose to do it this way.

          I ended up not changing it just to keep the syntax consistent with what we've already got and avoid making this patch even longer. Let's discuss in a followup JIRA if we want to change the syntax for this command.

          Show
          Todd Lipcon added a comment - "-forceFence doesn't seem to have any real use cases with auto-HA so it isn't implemented." - I don't follow the reasoning. Seems like it should be just as applicable to auto-HA as manual, no? I chatted with Eli about this, since he's the one who originally added the "-forceFence" option. The original motivation was to test the fencing script, but with this "manual failover" that's probably not the best way to test it. Better would be to do something like kill -STOP the active NN, which will both trigger a failover and trigger fencing. Another option might be to create a new command like "-testFencer" which would (after requiring confirmation) shoot down the active. But since it's a corner case let's address as a follow-up improvement. "If the attempt to transition to standby succeeds, then the ZKFC will delete the breadcrumb node in ZooKeeper" - might want to specify which ZKFC will do the deletion. Changed to: * If the attempt to transition to standby succeeds, then the ZKFC receiving * this RPC will delete its own breadcrumb node in ZooKeeper. Thus, the * next node to become active will not run any fencing process. Otherwise, * the breadcrumb will be left, such that the next active will fence this * node. "If the node is healthy and not active, it sends an RPC to the current active, asking it to yield from the election." - it actually sends an RPC to the ZKFC associated with the current active. I actually removed the details here in ZKFCProtocol.java, electing instead to refer the reader to the implementation. I think it's better for the ZKFCProtocol javadocs to explain the "outward" behavior, and explain the actual implementation in the design doc and the inline comments in ZKFailoverController. It now reads: * If the node is healthy and not active, it will try to initiate a graceful * failover to become active, returning only when it has successfully become * active. See {@link ZKFailoverController#gracefulFailoverToYou()} for the * implementation details. "if the current active does not respond to the graceful request, throws an exception indicating the reason for failure." - I recommend you make it explicit which graceful request this is referring to. In fact, if the active NN fails to respond to the graceful request to transition to standby, it will be fenced. It's the failure of the active ZKFC to respond to the cedeActive calls that results in a failure of gracefulFailover. Per above, I changed this to only reference what a caller needs to know, instead of the underlying implementation. * If the node fails to successfully coordinate the failover, throws an * exception indicating the reason for failure. I think you need interface annotations on ZKFCRpcServer, or perhaps it can be made package-private? Good catch. It can't be package-private because DFSZKFailoverController is in an HDFS package. I annotated it LimitedPrivate to HDFS. In ZKFCProtocol#cedeActive you declare the parameter to be in millis, but in the ZKFCRpcServer#cedeActive implementation, you say the period is in seconds. Another good catch - I changed this late in the development of the patch and missed a spot. Fixed. I don't see much point in having both ZKFCRpcServer#stop and ZKFCRpcServer#join. Why not just call this.server.join in ZKFCRpcServer#stop? Combined the two into a stopAndJoin "periodically check health state since, because entering an" - doesn't quite parse. Fixed. I think the log message about the timeout elapsing in ZKFailoverController#waitForActiveAttempt should probably be at least at WARN level instead of INFO. Fixed. "It's possible that it's in standby but just about to go into active, no? Is there some race here?" - should this comment now be removed? This comment is basically about the situation described in HADOOP-8217 , so it's still relevant. I recommend you change the value of DFS_HA_ZKFC_PORT_DEFAULT to something other than 8021. I've seen a lot of JTs in the wild with their default port set to 8021. Good point... I changed it to 8019. The design in the document posted to HDFS-2185 mentions introducing "-to" and "-from" parameters to the `haadmin -failover' command, but this implementation doesn't do that. That seems fine by me, but I'm curious why you chose to do it this way. I ended up not changing it just to keep the syntax consistent with what we've already got and avoid making this patch even longer. Let's discuss in a followup JIRA if we want to change the syntax for this command.
          Hide
          Todd Lipcon added a comment -

          Hari: yes, it gracefully transitions the NN to standby state, and doesn't cause fencing. Fencing onyl results if the previous active has crashed (i.e. not responding to the request). Please refer to the design doc referenced in the description of the JIRA.

          Show
          Todd Lipcon added a comment - Hari: yes, it gracefully transitions the NN to standby state, and doesn't cause fencing. Fencing onyl results if the previous active has crashed (i.e. not responding to the request). Please refer to the design doc referenced in the description of the JIRA.
          Hide
          Hari Mankude added a comment -

          Todd,

          Is the feature going to switch the active NN to standby state or will it result in active NN getting fenced and hence going away?

          thanks

          Show
          Hari Mankude added a comment - Todd, Is the feature going to switch the active NN to standby state or will it result in active NN getting fenced and hence going away? thanks
          Hide
          Aaron T. Myers added a comment -

          Patch looks pretty good to me, Todd. A few little comments:

          1. "-forceFence doesn't seem to have any real use cases with auto-HA so it isn't implemented." - I don't follow the reasoning. Seems like it should be just as applicable to auto-HA as manual, no?
          2. "If the attempt to transition to standby succeeds, then the ZKFC will delete the breadcrumb node in ZooKeeper" - might want to specify which ZKFC will do the deletion.
          3. "If the node is healthy and not active, it sends an RPC to the current active, asking it to yield from the election." - it actually sends an RPC to the ZKFC associated with the current active.
          4. "if the current active does not respond to the graceful request, throws an exception indicating the reason for failure." - I recommend you make it explicit which graceful request this is referring to. In fact, if the active NN fails to respond to the graceful request to transition to standby, it will be fenced. It's the failure of the active ZKFC to respond to the cedeActive calls that results in a failure of gracefulFailover.
          5. I think you need interface annotations on ZKFCRpcServer, or perhaps it can be made package-private?
          6. In ZKFCProtocol#cedeActive you declare the parameter to be in millis, but in the ZKFCRpcServer#cedeActive implementation, you say the period is in seconds.
          7. I don't see much point in having both ZKFCRpcServer#stop and ZKFCRpcServer#join. Why not just call this.server.join in ZKFCRpcServer#stop?
          8. "periodically check health state since, because entering an" - doesn't quite parse.
          9. I think the log message about the timeout elapsing in ZKFailoverController#waitForActiveAttempt should probably be at least at WARN level instead of INFO.
          10. "It's possible that it's in standby but just about to go into active, no? Is there some race here?" - should this comment now be removed?
          11. I recommend you change the value of DFS_HA_ZKFC_PORT_DEFAULT to something other than 8021. I've seen a lot of JTs in the wild with their default port set to 8021.
          12. The design in the document posted to HDFS-2185 mentions introducing "-to" and "-from" parameters to the `haadmin -failover' command, but this implementation doesn't do that. That seems fine by me, but I'm curious why you chose to do it this way.
          Show
          Aaron T. Myers added a comment - Patch looks pretty good to me, Todd. A few little comments: "-forceFence doesn't seem to have any real use cases with auto-HA so it isn't implemented." - I don't follow the reasoning. Seems like it should be just as applicable to auto-HA as manual, no? "If the attempt to transition to standby succeeds, then the ZKFC will delete the breadcrumb node in ZooKeeper" - might want to specify which ZKFC will do the deletion. "If the node is healthy and not active, it sends an RPC to the current active, asking it to yield from the election." - it actually sends an RPC to the ZKFC associated with the current active. "if the current active does not respond to the graceful request, throws an exception indicating the reason for failure." - I recommend you make it explicit which graceful request this is referring to. In fact, if the active NN fails to respond to the graceful request to transition to standby, it will be fenced. It's the failure of the active ZKFC to respond to the cedeActive calls that results in a failure of gracefulFailover. I think you need interface annotations on ZKFCRpcServer, or perhaps it can be made package-private? In ZKFCProtocol#cedeActive you declare the parameter to be in millis, but in the ZKFCRpcServer#cedeActive implementation, you say the period is in seconds. I don't see much point in having both ZKFCRpcServer#stop and ZKFCRpcServer#join. Why not just call this.server.join in ZKFCRpcServer#stop? "periodically check health state since, because entering an" - doesn't quite parse. I think the log message about the timeout elapsing in ZKFailoverController#waitForActiveAttempt should probably be at least at WARN level instead of INFO. "It's possible that it's in standby but just about to go into active, no? Is there some race here?" - should this comment now be removed? I recommend you change the value of DFS_HA_ZKFC_PORT_DEFAULT to something other than 8021. I've seen a lot of JTs in the wild with their default port set to 8021. The design in the document posted to HDFS-2185 mentions introducing "-to" and "-from" parameters to the `haadmin -failover' command, but this implementation doesn't do that. That seems fine by me, but I'm curious why you chose to do it this way.
          Hide
          Todd Lipcon added a comment -

          Updated patch against tip of branch. Also amended the docs to reflect this improvement.

          Show
          Todd Lipcon added a comment - Updated patch against tip of branch. Also amended the docs to reflect this improvement.
          Hide
          Todd Lipcon added a comment -

          This rev should be ready for review. I did a bunch more testing, cleaned up the code with javadocs, comments, etc. I also manually tested with security, ACLs, etc, verified everything worked as expected. I also ran the zkfc manually through jcarder while doing these tests and verified no lock cycles.

          Show
          Todd Lipcon added a comment - This rev should be ready for review. I did a bunch more testing, cleaned up the code with javadocs, comments, etc. I also manually tested with security, ACLs, etc, verified everything worked as expected. I also ran the zkfc manually through jcarder while doing these tests and verified no lock cycles.
          Hide
          Todd Lipcon added a comment -

          New rev fixes a number of issues:

          • when ZKFC makes RPCs to the NN, it has to switch to its own UGI, so it has the right krb credentials
          • add a service ACL for ZKProtocol
          • fix a potential deadlock with lock inversion in waitForActiveAttempt

          I did some testing by running a pseudo-distributed secure cluster, and doing a failover back and forth in a while loop from another shell. Ran a couple hundred of these successfully. Also tested behavior when one of the NNs was down, etc.

          I also ran one of the zkfcs under jcarder to look for potential deadlocks (only found the one mentioned above)

          Will continue testing this tomorrow, and also do some code cleanup. Mingjie, if you have a chance to test this, would love to hear how it works for you.

          Show
          Todd Lipcon added a comment - New rev fixes a number of issues: when ZKFC makes RPCs to the NN, it has to switch to its own UGI, so it has the right krb credentials add a service ACL for ZKProtocol fix a potential deadlock with lock inversion in waitForActiveAttempt I did some testing by running a pseudo-distributed secure cluster, and doing a failover back and forth in a while loop from another shell. Ran a couple hundred of these successfully. Also tested behavior when one of the NNs was down, etc. I also ran one of the zkfcs under jcarder to look for potential deadlocks (only found the one mentioned above) Will continue testing this tomorrow, and also do some code cleanup. Mingjie, if you have a chance to test this, would love to hear how it works for you.
          Hide
          Todd Lipcon added a comment -

          Here's a preliminary patch for this, implementing the design as described in section 2.7.2 of the design doc.

          I'll do another rev hopefully tomorrow which includes better javadoc, some more tests, and some manual verification.

          The patch looks long but a lot of it is boilerplate (protocol translators) and test code.

          Show
          Todd Lipcon added a comment - Here's a preliminary patch for this, implementing the design as described in section 2.7.2 of the design doc. I'll do another rev hopefully tomorrow which includes better javadoc, some more tests, and some manual verification. The patch looks long but a lot of it is boilerplate (protocol translators) and test code.
          Hide
          Todd Lipcon added a comment -

          Thanks for filing this, Mingjie. I plan to work on it in the coming weeks.

          Show
          Todd Lipcon added a comment - Thanks for filing this, Mingjie. I plan to work on it in the coming weeks.

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Mingjie Lai
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development