In an HA deployment, Clients are configured with the hostnames of both the Active and Standby Namenodes.Clients will first try one of the NNs (non-deterministically) and if its a standby NN, then it will respond to the client to retry the request on the other Namenode.
If the client happens to talks to the Standby first, and the standby is undergoing some GC / is busy, then those clients might not get a response soon enough to try the other NN.
Proposed Approach to solve this :
1) Use hedged RPCs to simultaneously call multiple configured NNs to decide which is the active Namenode.
2) Subsequent calls, will invoke the previously successful NN.
3) On failover of the currently active NN, the remaining NNs will be invoked to decide which is the new active