Here's a design sketch – I have only done a little bit of implementation but nothing really fleshed out yet. So, it might change a bit during the course of implementation. But feedback on the general approach would be appreciated!
- Ensure that only a single NN can be active at a time.
- Use ZK as a lock manager to satisfy this requirement.
- Perform health monitoring of the active NN to trigger a fail-over should it become unhealthy.
- Automatically fail-over in the case that one of the hosts fails (eg power/network outage)
- Allow manual (administratively initiated) graceful failover
- Initiate fencing of the previously active NN in the case of non-graceful failovers.
The ZooKeeper FailoverController (ZKFC) is a separate process/program which runs next to each of the HA NameNodes in the cluster. It does not directly spawn/supervise the NN JVM process, but rather runs on the same machine and communicates with it via localhost RPC.
The ZKFC is designed to be as simple as possible to decrease likelihood of bugs which might trigger a false fail-over. It is also designed to use only a very small amount of memory, so that it will never have lengthy GC pauses. This allows us to set a fairly low time-out on the ZK session in order to detect machine failures quickly.
The ZKFC needs the following pieces of configuration:
- list of zookeeper servers making up the ZK quorum (fail to start if this is not provided)
- host/port for the HAServiceProtocol of the local NN (defaults to localhost:<well-known port>)
- "base znode" at which to root all of the znodes used by the process
Nodes in ZK:
Everything should be rooted at the configured base znode. Within that, there should be a znode per nameservice ID. Within this /base/nameserviceId/ directory, there are the following znodes:
- activeLock - an ephemeral node taken by the ZKFC before it asks its local NN to become active. This acts as a mutex on the active state and also as a failure detector.
- activeNodeInfo - a non-ephemeral node written by the ZKFC after it succeeds in taking activeLock. This should have data like the IPC address, HTTP address, etc of the NN.
The activeNodeInfo is non-ephemeral so that, when a new NN takes over from a failed one, it has enough information to fence the previous active in case it's still actually running.
Runtime operation states
For simplicity of testing, we can model the ZKFC as a state machine.
The NN on the local host is down or not responding to RPCs. We start in this state.
The NN on the local host is in standby mode and ready to automatically transition to active if the former active dies.
The NN on the local host is running and performing active duty.
Inputs into state machine
Three other classes interact with the state machine:
A ZK thread connects to ZK and watches for the following events:
- The previously active master has lost its ephemeral node
- The ZK session is lost
User-initiated failover controller
By some means (RPC/signal/HTTP/etc) the user can request that the active NN's FC gracefully turn over the active state to a different NN.
A HealthMonitor thread heartbeats continuously to the local NN. It provides an event whenever the health state of the NN changes. For example:
- NN has become unhealthy
- Lost contact with NN
- NN is now healthy
Behavior of state machine
- System starts here
- When HealthMonitor indicates the local NN is healthy:
- Transition to LOCAL_STANDBY mode
- On health state change:
- Transition to NOT_READY state if local NN goes down
- On ZK state change:
- If the old ZK "active" node was deleted, try to initiate automatic failover
- If our own ZK session died, reconnect to ZK
- Try to create the "activeLock" ephemeral node in ZK
- If we are unsuccessful, return to LOCAL_STANDBY
- See if there is a "activeNodeInfo" node in ZK. If so:
- The old NN may still be running (it didn't gracefully shut down).
- Initiate fencing process.
- If successful, delete the "activeNodeInfo" node in ZK.
- Create an "activeNodeInfo" with our own information (ie NN IPC address, etc)
- Send IPC to local NN to transitionToActive. If successful, go to LOCAL_ACTIVE
- On health state change to unhealthy:
- delete our active lock znode, go to LOCAL_NOT_READY. Another node will fence us.
- On ZK connection loss or notice our znode got deleted:
- another process is probably about to fence us... unless all nodes lost their connection, in which case we should "stay the course" and stay active here. Still need to figure this out
- On administrative request to failover:
- IPC to local node to transition to standby
- Delete our active lock znode
- Transition to LOCAL_STANDBY with some timeout set, allowing another failover controller to take the lock.