Thanks Steve Leland for clarifying the potential issues arising out of setting a higher frequency for retries.
The context for this is indeed
YARN-1028 - ConfiguredFailoverProxy for RM failover. In an HA setting where the second RM is the active, with the current default for ipc.client.connect.max.retries (10), Clients / AMs / NMs retry the first RM for 10 seconds before trying the second RM. This leads to a significant performance hit. This delay in the clients failing over can be mitigated by setting ipc.client.connect.max.retries to 1, but I thought there might be merit to connect to the same RM multiple times (> 1) before trying the other one. Hence, the proposal to allow making the retry-interval shorter - try connecting to the same RM twice with a delay of half-a-second before failing over.
If it really is NM->RM calls you are worried about, then perhaps rather than make changes to the general IPC client, this is a good time to impose a better retry policy here, where exponential backoff with jitter is what I'd propose.
Even if we improve the retry policy in
*RMProxy, the ipc.Client delay of 10 seconds to failover still exists. What do you think of making the general Client dumb enough to try connecting only once and let the higher layers take care of the actual retry policies? I know that would be a significant change, but worth making?