Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.3.0, 3.4.0
Description
When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM Active Nodes will be still having those stopped nodes until NM Liveliness Monitor Expires after configured timeout (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, Multi Node Placement assigns the containers on those nodes. They need to exclude the nodes which has not heartbeated for configured heartbeat interval (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to Asynchronous Capacity Scheduler Threads. (CapacityScheduler#shouldSkipNodeSchedule)
Repro:
1. Enable Multi Node Placement (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery Enabled (yarn.node.recovery.enabled)
2. Have only one NM running say worker0
3. Stop worker0 and start any other NM say worker1
4. Submit a sleep job. The containers will timeout as assigned to stopped NM worker0.
Attachments
Attachments
Issue Links
- is duplicated by
-
YARN-8557 Exclude lagged/unhealthy/decommissioned nodes in async allocating thread
- Resolved
- is related to
-
YARN-10357 Proactively relocate allocated containers from a stopped node
- Open
- relates to
-
YARN-10572 Merge YARN-8557 and YARN-10352, and rebase based YARN-10380.
- Resolved