Hi Vinod and Arun, Thanks for the comments,
Shall we rename NodeAction.DECOMMISSION to SHUTDOWN}?
This would be a good idea. We can generalize it.
Need to send a SHUTDOWN command to the nodes even if it is invalid at the RM at the time of registration. This is a very common case, we exclude the node even before we start the cluster.
Please also add a test for the above.
It is already shutting down if it is excluded node by throwing IOException. This can be done by sending the shutdown command as part of RegisterNodeManagerResponse instead of IOException.
If the node is not valid, the correct component to send a RMNodeEventType.DECOMMISSION event to RMNode is NodeListManager. We can move this code out of ResourceTrackerService into NodeListManager.refreshNodes() - sending events to all nodes that get decomissioned during refreshNodes(). This will also ensure that the decomissioned-node-count gets incremented immediately instead of waiting for all the nodes to reach RM. Your tests in TestResourceTrackerService also simplify a bit.
I agree with Aruns comments on this, and in this case, we either need to establish a new communication between RM and NM other than heartbeat for generating events.
TestNodeStatusUpdater: The two second sleeps are error prone. I think it should simply wait till heartBeatID becomes more than 3 or a timeout
Similarly in TestNMExpiry, you should spin around till lost-nodes' count becomes two or a timeout happens.
TestResourceTrackerService is good work!
checkDecommissionedNMCount(): Again spin till the correct count or a timeout occurs.
Yes, I will address this problem.