I was thinking about it a bit, it might get tricky to check for resources when starting active services, because at this point the namenode is still in standby. If it enters safe mode, then if there is any failure in transition we should take care to transition it back to non-safe mode. I am also suspicious that if it transitions to safemode, some active services may not start just because of the safemode, and that would mean loss of service. We cannot throw an exception either, if resources are low, for the same reason.
Hmmmmm, I don't think this should be a problem. We currently support transitioning to the active state while the NN is in safemode, so I don't see why any services would fail to start if we were to enter safemode while transitioning to the active state.
Regardless, even if it is possible, I think you've convinced me that it's not actually necessary.
I am leaning towards separating the two failure (low resources is not a failure though) scenarios, i.e. standby transitions to active irrespective of what its resource status is, and the check for resources is done independently once transition to active is successfully completed. This is consistent with the fact that low resources is not a failure, the cluster is still available in read only mode.
OK, that seems fine. Perhaps we could also have FSNS#startActiveServices interrupt the NameNodeResourceMonitor thread? That would guarantee that a resource check would happen promptly after transitioning to active.
Offline Todd pointed out to me that another thing we could do would be to check for having available resources in the monitorHealth RPC call, which the failover controller can call before initiating a failover, to make sure there are available resources on the NN which we want to failover to. That should probably be done in a separate JIRA, though.