Here's a preliminary patch for this issue. It still needs a little cleanup/javadoc/etc, but wanted to make sure people agree this is the right direction before I finish it up.
Here's a summary of the change:
- Add a new flag dfs.ha.automatic-failover.enabled, which is set per-nameservice or globally
- Add a new RequestInfo structure as a parameter to all the HAServiceProtocol methods. This currently just has one field, which indicates what type of client the request is on behalf of. It can either be a user (manual CLI failover), ZKFC (auto failover), or USER_FORCE – indicating that it's a user who wants to avoid this safety check.
- In the NN, if auto-failover is enabled, disallow HA requests from users. If it's not enabled, disallow HA requests from ZKFCs.
- In the ZKFC, disallow startup if auto-failover is disabled
In addition to the unit tests, I ran the following manual tests, on a secure cluster.
1) did not enable auto failover config
2) ran failovers using haadmin command, succesfully
3) Tried to run bin/hdfs zkfc, got expected error:
12/04/05 20:53:38 INFO tools.DFSZKFailoverController: Failover controller configured for NameNode nameserviceId1.nn1
12/04/05 20:53:38 FATAL ha.ZKFailoverController: Automatic failover is not enabled for NameNode at todd-w510/127.0.0.1:8021. Please ensure that automatic failover is enabled in the configuration before running the ZK failover controller.
4) Enabled auto-failover in my config, but left NNs running. Got error when the ZKFC tried to make the local node active. TODO in future JIRA: it could abort at this point, when it sees an AccessControlException, since it's indicative of misconfiguration.
5) Restarted NNs, so they picked up the new config.
6) Ran ZKFC, it successfully made one of the NNs active. Verified automatic failover behavior by killing one of the NNs.
7) Ran manual failover command, got expected error:
12/04/05 20:58:31 ERROR ha.FailoverController: Unable to get service state for NameNode at todd-w510/127.0.0.1:8022: Manual HA control for this NameNode is disallowed, because automatic HA is enabled.
Open questions: should we allow the non-mutative commands like monitorHealth and getServiceState to run when auto-failover is configured? My thinking is probably. If so, should we keep around the RequestInfo parameter on those calls? Or only include RequestInfo for the calls that trigger transitions?