Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
None
Description
A number of jobs failed or restarted when we lost a couple hosts in the cluster.
The theory is that this happened because the AppMaster detects the failed
container before YARN detects the missing NM, so it tries to run the
container on that host again, but doesn't handle the connection errors from the NM properly. Switching from a synchronous NM client model to an async model is expected to help, but we need to discuss this.
Attachments
Attachments
Issue Links
- is cloned by
-
SAMZA-893 Fix a bug with host affinity request expiration introduced in SAMZA-867
- Resolved