The neutral state is not totally a new way of handling network partitions. For example, consider the quorum concept supported by Pacemaker to handle split brain scenarios.
Whenever quorum is present, pacemaker will go with the majority vote on important decisions. When quorum is lost (i.e. the cluster is separated into groups where no group has a majority of votes) the behavior of the pool is determined by the no-quorum-policy property:
- ignore - Do nothing when quorum is lost.
- stop (default) - stop all resources in the affected cluster partition.
- freeze - continue running existing resources, but don’t start any stopped ones.
- suicide - fence all nodes in the affected partition.
Neutral state is a modification of the stop option in the list above. When quorum is lost, instead of stopping the Namenode, it is turned into a state where it is neither Active nor Standby. When quorum is available again, it is turned to either Active or Standby state. The advantage of keeping the process in neutral state, compared to stopping the process, is about the time to start servicing the requests after network connection is restored. The down time will be very less in case of going to neutral state.
Another approach is to turn the Namenode to Standby state when quorum is lost. In case of network partition, this may not be of any help since Standby Namenode won’t be able to ping the Active Namenode & it will just keep retrying. After network is restored, if this Namenode (which is turned to Standby when quorum is lost) is elected as active again, the time to turn the Namenode from Standby state to Active state will be higher depending on what is the timeout configured for RPC calls.
When Zookeeper is used as the distributed coordinator, Neutral state is an effective way of handling the network partitions. In future, it may be supported by Pacemaker as an option for no-quorum-policy.
I hope the approach is simple & clear. No one gets elected to Active state or continues to be in Active state unless a quorum decides so. If quorum is not available, it relinquishes the role & remains neutral. Being in neutral state helps to come back to Active state faster after quorum is available again.