Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
2.9.0, 3.0.0-alpha1
-
None
-
Reviewed
Description
When RM failsover, it does not auto re-register running apps and so they need to re-register when reconnecting to new primary. This is done by catching ApplicationMasterNotRegisteredException in allocate calls and re-registering. But RequestHedgingRMFailoverProxyProvider does not propagate YarnException as the actual invocation is done asynchronously using seperate threads, so AMs cannot reconnect to RM after failover.
This JIRA proposes that the RequestHedgingRMFailoverProxyProvider propagate any YarnException that it encounters.
Attachments
Attachments
Issue Links
- is related to
-
YARN-4496 Improve HA ResourceManager Failover detection on the client
- Resolved
Looking at the code, one fix I can think of is to refactor the invoke method to an identify RM_IDs. Then the actual connection to the selected RM_ID (current primary) can be made directly using the main thread as is done presently.
jianhe, thoughts/suggestions?