Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
If a NM node shuts down, RM will not mark it as LOST until liveness monitor finds it timeout. However before that, RM might continuously allocate AM on that NM.
We found this case in our cluster: RM continuously allocated a same AM on a lost NM before RM found it lost, and AMLauncher always failed because it could not connect to the lost NM. To solve the problem, we could add the NM to AM blacklist if RM failed to launch it.
Attachments
Issue Links
- is related to
-
YARN-4837 User facing aspects of 'AM blacklisting' feature need fixing
-
- Resolved
-