HDFS-2920 (where this patch was originally posted as part of a larger patch), Eli reviewed the patch and had the following feedback:
Wrt delayed shutdown, we likely have (or should have) similar code elsewhere right since there's nothing HA specific?
Not sure quite what you mean by this. Like where?
Why is the shutdown delayed rather than immediate?
The reason for the delayed shutdown is because if we did an immediate shutdown, the state transition RPCs would never throw an error - they would either succeed or the NN would shut down. That seems unfortunate. It's reasonable to do a delayed shutdown here because the FailoverController won't continue with the failover if an error is thrown, so it won't tell the other node to transition to the active state, unless some other fencing mechanism succeeds, in which case the point is moot.
Wrt "Error encountered during state transition.", isn't the error most likely due to a failure to start a service?
Likely, but not necessarily. It could be a failure to stop a service, a failure to open an edit log for write, an OOM, or even an NPE due to a bug.