Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.1.1
-
None
-
Reviewed
Description
During internal testing, we found a nasty race condition which occurs during decommissioning.
Node manager, incorrect behaviour:
2018-06-18 21:00:17,634 WARN org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting down. 2018-06-18 21:00:17,634 WARN org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 hostname:node-6.hostname.com
Node manager, expected behaviour:
2018-06-18 21:07:37,377 WARN org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting down. 2018-06-18 21:07:37,377 WARN org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be decommissioned
Note the two different messages from the RM ("Disallowed NodeManager" vs "DECOMMISSIONING"). The problem is that ResourceTrackerService can see an inconsistent state of nodes while they're being updated:
2018-06-18 21:00:17,575 INFO org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} exclude:{node-6.hostname.com} 2018-06-18 21:00:17,575 INFO org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully decommission node node-6.hostname.com:8041 with state RUNNING 2018-06-18 21:00:17,575 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: node-6.hostname.com 2018-06-18 21:00:17,576 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node node-6.hostname.com:8041 in DECOMMISSIONING. 2018-06-18 21:00:17,575 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn IP=172.26.22.115 OPERATION=refreshNodes TARGET=AdminService RESULT=SUCCESS 2018-06-18 21:00:17,577 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve original total capability: <memory:8192, vCores:8> 2018-06-18 21:00:17,577 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING
When the decommissioning succeeds, there is no output logged from ResourceTrackerService.
Attachments
Attachments
Issue Links
- is related to
-
YARN-9462 TestResourceTrackerService.testNodeRemovalGracefully fails sporadically
- Resolved