Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.8.0
-
None
-
Reviewed
Description
Per YARN-2005, there is now a way to blacklist nodes for AM purposes so the next app attempt can be assigned to a different node.
However, currently the condition under which the node gets blacklisted is limited to DISKS_FAILED. There are a whole host of other issues that may cause the failure, for which we want to locate the AM elsewhere; e.g. disks full, JVM crashes, memory issues, etc.
Since the AM blacklisting is per-app, there is little practical downside in blacklisting the nodes on any failure (although it might lead to blacklisting the node more aggressively than necessary). I would propose locating the next app attempt to a different node on any failure.