Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-8217

Edge case split-brain race in ZK-based auto-failover

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.24.0
    • Fix Version/s: None
    • Component/s: auto-failover, ha
    • Labels:
      None

      Description

      As discussed in HADOOP-8206, the current design for automatic failover has the following race:

      • ZKFC1 gets active lock
      • ZKFC1 is about to send transitionToActive() and machine freezes (eg GC pause + swapping)
      • ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock
      • ZKFC2 calls transitionToStandby on NN1, and transitions NN2 to active
      • ZKFC1 wakes up from pause, calls transitionToActive(), now we have a bad situation

      This is rare, since it requires ZKFC1 to freeze longer than its ZK session timeout, but worth fixing, since the results can be disastrous.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tlipcon Todd Lipcon
                Reporter:
                tlipcon Todd Lipcon
              • Votes:
                0 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated: