[YARN-3266] RMContext inactiveNodes should have NodeId as map key - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.6.0
Fix Version/s: 2.8.0, 3.0.0-alpha1
Component/s: resourcemanager
Labels:
None

Target Version/s:

2.8.0
Hadoop Flags:

Reviewed

Description

Under the default NM port configuration, which is 0, we have observed in the current version, "lost nodes" count is greater than the length of the lost node list. This will happen when we consecutively restart the same NM twice:

NM started at port 10001
NM restarted at port 10002
NM restarted at port 10003
NM:10001 timeout, ClusterMetrics#incrNumLostNMs(), # lost node=1; rmNode.context.getInactiveRMNodes().put(rmNode.nodeId.getHost(), rmNode), inactiveNodes has 1 element
NM:10002 timeout, ClusterMetrics#incrNumLostNMs(), # lost node=2; rmNode.context.getInactiveRMNodes().put(rmNode.nodeId.getHost(), rmNode), inactiveNodes still has 1 element

Since we allow multiple NodeManagers on one host (as discussed in ~~YARN-1888~~), inactiveNodes should be of type ConcurrentMap<NodeId, RMNode>. If this will break the current API, then the key string should include the NM's port as well.

Thoughts?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-3266.03.patch
14/Apr/15 12:58
15 kB
Chengbing Liu
YARN-3266.02.patch
26/Feb/15 10:51
9 kB
Chengbing Liu
YARN-3266.01.patch
26/Feb/15 09:37
5 kB
Chengbing Liu

Issue Links

duplicates

YARN-1391 Lost node list should be identify by NodeId

Resolved

Activity

People

Assignee:: Chengbing Liu

Reporter:: Chengbing Liu

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 26/Feb/15 08:37

Updated:: 30/Aug/16 01:26

Resolved:: 14/Apr/15 18:32