[YARN-11512] Graceful decommission doesn't work when NM restart recovery is enabled - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.3.1
Fix Version/s: None
Component/s: graceful, nodemanager
Labels:
None

Description

We have added these configs on yarn-site.xml file of our Hadoop-Yarn cluster.

<property>
    <name>yarn.nodemanager.recovery.enabled</name>
    <value>true</value>
</property>
<property>
    <name>yarn.nodemanager.recovery.supervised</name>
    <value>true</value>
</property>

The NM restart recovery feature has been working well, applications not failing even if we restart nodemanager processes. But, when we try to decommission a node by adding the node name to yarn_exclude_hosts file and refreshing nodes on resourcemanager, the applications that had containers running on that node are stuck for a long time and then fail.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Akshesh Doshi

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 13/Jun/23 20:35

Updated:: 13/Jun/23 21:02