Thanks Vinod Kumar Vavilapalli for the comments.
Why isn't existing RMProxy framework taking care of this?
RMProxy is supposed to take care of this. However, the way that RMProxy to do is to do retry on specific (known) exceptions but fail directly for other exceptions. Like this case, IOException get thrown will get failed directly without any retry (for non-HA case). We are a little risky if more potential exception could get thrown during RM down time. For this particular case, I can add the IOException (other than RemoteException) to be handled directly which sounds a easy way of fix.
Why are we putting special code in NodeStatusUpdater? Shouldn't we use something in the RMProxy framework? See ServerProxy for example that gets used by NMClients.
As I mentioned above, having a white list of exceptions to retry doesn't sound robust enough: if any exception we don't meet before, we could skip the retry unintentionally. Isn't it? Anyway, I could fix the problem with following existing retry policy framework but hopefully we could improve the framework in other JIRA.
Just looked at
YARN-4132 too, we should definitely see if we can merge these two together.
This is a bug that NM doesn't retry in some cases.
YARN-4132 talk about another problem that NM retry should be longer than general RMProxy client which is a more general improvement. I think we'd better separate them out. Thoughts?