[YARN-4576] Enhancement for tracking Blacklist in AM Launching - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: resourcemanager
Labels:
None

Description

Before ~~YARN-2005~~, YARN blacklist mechanism is to track the bad nodes by AM: If AM tried to launch containers on a specific node get failed for several times, AM will blacklist this node in future resource asking. This mechanism works fine for normal containers. However, from our observation on behaviors of several clusters: if this problematic node launch AM failed, then RM could pickup this problematic node to launch next AM attempts again and again that cause application failure in case other functional nodes are busy. In normal case, the customized healthy checker script cannot be so sensitive to mark node as unhealthy when one or two containers get launched failed.

After ~~YARN-2005~~, we can have a BlacklistManager in each RMapp, so those nodes who launching AM attempts failed for specific application before will get blacklisted. To get rid of potential risks that all nodes being blacklisted by BlacklistManager, a disable-failure-threshold is involved to stop adding more nodes into blacklist if hit certain ratio already.

There are already some enhancements for this AM blacklist mechanism: ~~YARN-4284~~ is to address the more wider case for AM container get launched failure and ~~YARN-4389~~ tries to make configuration settings available for change by App to meet app specific requirement. However, there are still several gaps to address more scenarios:
1. We may need a global blacklist instead of each app maintain a separated one. The reason is: AM could get more chance to fail if other AM get failed before. A quick example is: in a busy cluster, all nodes are busy except two problematic nodes: node a and node b, app1 already submit and get failed in two AM attempts on a and b. app2 and other apps should wait for other busy nodes rather than waste attempts on these two problematic nodes.
2. If AM container failure is recognized as global event instead app own issue, we should consider the blacklist is not a permanent thing but with a specific time window.
3. We could have user defined black list polices to address more possible cases and scenarios, so it reasonable to make blacklist policy pluggable.
4. For some test scenario, we could have whitelist mechanism for AM launching.
5. Some minor issues: it sounds like NM reconnect won't refresh blacklist so far.
Will try to address all issues here.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

EnhancementAMLaunchingBlacklist.pdf
21/Jan/16 17:15
128 kB
Junping Du

Issue Links

is related to

YARN-4685 Disable AM blacklisting by default to mitigate situations that application get hanged

Resolved

YARN-4837 User facing aspects of 'AM blacklisting' feature need fixing

Resolved

YARN-6409 RM does not blacklist node for AM launch failures

Patch Available

relates to

YARN-4389 "yarn.am.blacklisting.enabled" and "yarn.am.blacklisting.disable-failure-threshold" should be app specific rather than a setting for whole YARN cluster

Resolved

YARN-4284 condition for AM blacklisting is too narrow

Resolved

YARN-2005 Blacklisting support for scheduling AMs

Resolved

(1 relates to)

Sub-Tasks

1.	Add global blacklist tracking for AM container failure.	Open	Junping Du
2.	Make blacklist tracking policy pluggable for more extensions.	Open	Sunil G
3.	AM launching blacklist purge mechanism (time based)	Open	Sunil G
4.	Node whitelist support for AM launching	Open	Junping Du
5.	AM blacklisting to consider node label partition	Resolved	Bibin Chundatt

Activity

People

Assignee:: Junping Du

Reporter:: Junping Du

Votes:: 0 Vote for this issue

Watchers:: 28 Start watching this issue

Dates

Created:: 11/Jan/16 15:59

Updated:: 25/Oct/19 20:27