[YARN-7249] Fix CapacityScheduler NPE issue when a container preempted while the node is being removed - ASF JIRA

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.8.1, 2.7.5
Fix Version/s: 2.8.2, 2.7.6
Component/s: None
Labels:
None

Target Version/s:

2.8.2, 2.7.6
Hadoop Flags:

Reviewed

Description

This issue could happen when 3 conditions satisfied:

1) A node is removing from scheduler.
2) A container running on the node is being preempted.
3) A rare race condition causes scheduler pass a null node to leaf queue.

Fix of the problem is to add a null node check inside CapacityScheduler.

Stack trace:

2017-08-31 02:51:24,748 FATAL resourcemanager.ResourceManager (ResourceManager.java:run(714)) - Error in handling event type KILL_RESERVED_CONTAINER to the scheduler 
java.lang.NullPointerException 
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1308) 
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469) 
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497) 
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.killReservedContainer(CapacityScheduler.java:1505) 
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1341) 
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127) 
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:705)

This is an issue only existed in 2.8.x

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-7249.branch-2.8.001.patch
25/Sep/17 17:13
1 kB
Wangda Tan

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Wangda Tan

Reporter:: Wangda Tan

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 25/Sep/17 16:49

Updated:: 14/Oct/19 15:38

Resolved:: 31/Mar/18 01:06

Agile

View on Board

Fix CapacityScheduler NPE issue when a container preempted while the node is being removed

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment