[YARN-8451] Multiple NM heartbeat thread created when a slow NM resync with RM - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.4
Component/s: nodemanager
Labels:
None

Hadoop Flags:

Reviewed

Description

During a NM resync with RM (say RM did a master slave switch), if NM is running slow, more than one RESYNC event may be put into the NM dispatcher by the existing heartbeat thread before they are processed. As a result, multiple new heartbeat thread are later created and start to hb to RM concurrently with their own responseId. If at some point of time, one thread becomes more than one step behind others, RM will send back a resync signal in this heartbeat response, killing all containers in this NM.

See comments below for details on how this can happen.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-8451.v1.patch
22/Jun/18 19:55
9 kB
Botong Huang
YARN-8451.v2.patch
28/Jun/18 18:36
9 kB
Botong Huang

Activity

People

Assignee:: Botong Huang

Reporter:: Botong Huang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 22/Jun/18 19:47

Updated:: 29/Jun/18 18:30

Resolved:: 29/Jun/18 18:24