[YARN-8575] Avoid committing allocation proposal to unavailable nodes in async scheduling - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.2.0, 3.1.2
Component/s: capacityscheduler
Labels:
None

Hadoop Flags:

Reviewed

Description

Recently we found a new error as follows:

ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to unreserve doesn't exist, nodeid: host1:45454

Reproduce this problem:
(1) Create a reserve proposal for app1 on node1
(2) node1 is successfully decommissioned and removed from node tracker
(3) Try to commit this outdated reserve proposal, it will be accepted and applied.
This error may be occurred after decommissioning some NMs. The application who print the error log will always have a reserved container on non-exist (decommissioned) NM and the pending request will never be satisfied.
To solve this problem, scheduler should check node state in FiCaSchedulerApp#accept to avoid committing outdated proposals on unusable nodes.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-8575.001.patch
25/Jul/18 08:41
13 kB
Tao Yang
YARN-8575.002.patch
10/Aug/18 02:51
13 kB
Tao Yang

Issue Links

relates to

YARN-5139 [Umbrella] Move YARN scheduler towards global scheduler

Open

Activity

People

Assignee:: Tao Yang

Reporter:: Tao Yang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 25/Jul/18 08:39

Updated:: 30/Oct/18 13:44

Resolved:: 10/Aug/18 07:16