Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.6.0
-
None
-
Reviewed
Description
Under the default NM port configuration, which is 0, we have observed in the current version, "lost nodes" count is greater than the length of the lost node list. This will happen when we consecutively restart the same NM twice:
- NM started at port 10001
- NM restarted at port 10002
- NM restarted at port 10003
- NM:10001 timeout, ClusterMetrics#incrNumLostNMs(), # lost node=1; rmNode.context.getInactiveRMNodes().put(rmNode.nodeId.getHost(), rmNode), inactiveNodes has 1 element
- NM:10002 timeout, ClusterMetrics#incrNumLostNMs(), # lost node=2; rmNode.context.getInactiveRMNodes().put(rmNode.nodeId.getHost(), rmNode), inactiveNodes still has 1 element
Since we allow multiple NodeManagers on one host (as discussed in YARN-1888), inactiveNodes should be of type ConcurrentMap<NodeId, RMNode>. If this will break the current API, then the key string should include the NM's port as well.
Thoughts?
Attachments
Attachments
Issue Links
- duplicates
-
YARN-1391 Lost node list should be identify by NodeId
- Resolved