Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-2949

HA: Add check to active state transition to prevent operator-induced split brain

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.24.0
    • Fix Version/s: None
    • Component/s: ha, namenode
    • Labels:
      None
    • Target Version/s:

      Description

      Currently, if the administrator mistakenly calls "-transitionToActive" on one NN while the other one is still active, all hell will break loose. We can add a simple check by having the NN make a getServiceState() RPC to its peer with a short (~1 second?) timeout. If the RPC succeeds and indicates the other node is active, it should refuse to enter active mode. If the RPC fails or indicates standby, it can proceed.

      This is just meant as a preventative safety check - we still expect users to use the "-failover" command which has other checks plus fencing built in.

        Issue Links

          Activity

          Kihwal Lee made changes -
          Target Version/s 0.24.0 [ 12317653 ] 2.5.0 [ 12326264 ]
          Kihwal Lee made changes -
          Assignee Kihwal Lee [ kihwal ]
          Kihwal Lee made changes -
          Link This issue duplicates HDFS-6203 [ HDFS-6203 ]
          Hide
          Todd Lipcon added a comment -

          Another safety check here is to make sure that the transaction IDs match between the nodes before going active.

          Show
          Todd Lipcon added a comment - Another safety check here is to make sure that the transaction IDs match between the nodes before going active.
          Aaron T. Myers made changes -
          Parent HDFS-1623 [ 12498318 ]
          Issue Type Sub-task [ 7 ] Improvement [ 4 ]
          Aaron T. Myers made changes -
          Field Original Value New Value
          Affects Version/s 0.24.0 [ 12317653 ]
          Affects Version/s HA branch (HDFS-1623) [ 12317568 ]
          Target Version/s HA branch (HDFS-1623) [ 12317568 ] 0.24.0 [ 12317653 ]
          Hide
          Aaron T. Myers added a comment -

          Converting to top-level issue with commit of HDFS-1623.

          Show
          Aaron T. Myers added a comment - Converting to top-level issue with commit of HDFS-1623 .
          Hide
          Todd Lipcon added a comment -

          Yep, this is not supposed to solve issues, just to prevent a mistake in the common case. Fencing is the correct answer to prevent split brain in the general case.

          Asking for confirmation might be a nice improvement as well, so long as there's a --force option.

          Show
          Todd Lipcon added a comment - Yep, this is not supposed to solve issues, just to prevent a mistake in the common case. Fencing is the correct answer to prevent split brain in the general case. Asking for confirmation might be a nice improvement as well, so long as there's a --force option.
          Hide
          Uma Maheswara Rao G added a comment -

          That said, having the safety check described in this JIRA is still valuable,

          Agreed with this point to add safety checks. But anyway this can not solve 100% split barain scenarios right? (ex: small network breakage between active and standby and admin accidentally executed -transitiontoActive on standby.) I think this will be addressed in future as part of Automatic failover and shared storage fencing. But when admins deals directly with command line for some maintanence purpose, this case may occur right?
          Also for the apis transitionTo*, do we need to take the confirmation from the user before actually transitioning? this may give some more attention to the admin for proceeding.

          Show
          Uma Maheswara Rao G added a comment - That said, having the safety check described in this JIRA is still valuable, Agreed with this point to add safety checks. But anyway this can not solve 100% split barain scenarios right? (ex: small network breakage between active and standby and admin accidentally executed -transitiontoActive on standby.) I think this will be addressed in future as part of Automatic failover and shared storage fencing. But when admins deals directly with command line for some maintanence purpose, this case may occur right? Also for the apis transitionTo*, do we need to take the confirmation from the user before actually transitioning? this may give some more attention to the admin for proceeding.
          Hide
          Todd Lipcon added a comment -

          I think we should probably un-document the transitionTo* commands, but leave them as a safety valve. It's nice to have direct access to these RPCs just in case there's some problem with one of the safer methods and you need a workaround without recompiling the client.

          That said, having the safety check described in this JIRA is still valuable, even using haadmin -failover, in case the admin has a messed up configuration in some way (eg the fencing script returns true but did not in fact fence the standby correctly)

          Show
          Todd Lipcon added a comment - I think we should probably un-document the transitionTo* commands, but leave them as a safety valve. It's nice to have direct access to these RPCs just in case there's some problem with one of the safer methods and you need a workaround without recompiling the client. That said, having the safety check described in this JIRA is still valuable, even using haadmin -failover, in case the admin has a messed up configuration in some way (eg the fencing script returns true but did not in fact fence the standby correctly)
          Hide
          Hari Mankude added a comment -

          If -failover command can handle this situation and other situations correctly, why not deprecate -transitiontoActive entirely?

          Show
          Hari Mankude added a comment - If -failover command can handle this situation and other situations correctly, why not deprecate -transitiontoActive entirely?
          Todd Lipcon created issue -

            People

            • Assignee:
              Kihwal Lee
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development