Hadoop Common
  1. Hadoop Common
  2. HADOOP-8217

Edge case split-brain race in ZK-based auto-failover


    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.24.0
    • Fix Version/s: None
    • Component/s: auto-failover, ha
    • Labels:


      As discussed in HADOOP-8206, the current design for automatic failover has the following race:

      • ZKFC1 gets active lock
      • ZKFC1 is about to send transitionToActive() and machine freezes (eg GC pause + swapping)
      • ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock
      • ZKFC2 calls transitionToStandby on NN1, and transitions NN2 to active
      • ZKFC1 wakes up from pause, calls transitionToActive(), now we have a bad situation

      This is rare, since it requires ZKFC1 to freeze longer than its ZK session timeout, but worth fixing, since the results can be disastrous.

        Issue Links


          Todd Lipcon created issue -
          Todd Lipcon made changes -
          Field Original Value New Value
          Link This issue relates to HADOOP-8206 [ HADOOP-8206 ]
          Todd Lipcon made changes -
          Link This issue relates to HDFS-3042 [ HDFS-3042 ]
          Todd Lipcon made changes -
          Target Version/s 0.24.0, 0.23.3 [ 12317652, 12320059 ] Auto Failover (HDFS-3042) [ 12320350 ]
          Component/s auto-failover [ 12317908 ]
          Todd Lipcon made changes -
          Attachment hadoop-8217-testcase.txt [ 12520570 ]


            • Assignee:
              Todd Lipcon
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              9 Start watching this issue


              • Created: