[YARN-7790] Improve Capacity Scheduler Async Scheduling to better handle node failures - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.1.0, 3.0.1
Component/s: None
Labels:
None

Target Version/s:

3.1.0
Hadoop Flags:

Reviewed

Description

This is not a new issue but async scheduling makes it worse:

In sync scheduling, if an AM container allocated to a node, it assumes node just heartbeat to RM, and AM launcher will connect NM to launch the container. Even though it is possible that NM crashes after the heartbeat, which causes AM hangs for a while. But it is related rare.

In async scheduling world, multiple AM containers can be placed on a problematic NM, which could cause application hangs easily. Discussed with sunilg and jianhe , we need one fix:

When async scheduling enabled:

Skip node which missed X node heartbeat.

And in addition, it's better to reduce wait time by setting following configs to earlier fail a container being launched at an NM with connectivity issue.

RetryPolicy retryPolicy =
    createRetryPolicy(conf,
      YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
      YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
      YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
      YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);

The second part is not covered by the patch.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-7790.001.patch
23/Jan/18 12:19
24 kB
Wangda Tan
YARN-7790.002.patch
24/Jan/18 02:19
23 kB
Wangda Tan
YARN-7790.003.patch
25/Jan/18 14:24
24 kB
Wangda Tan

Activity

People

Assignee:: Wangda Tan

Reporter:: Sumana Sathish

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 23/Jan/18 07:13

Updated:: 30/Jan/18 02:03

Resolved:: 29/Jan/18 15:22