Thanks for the review.
bq 1. Should the "Total Nodes" column header on the nodes pages be changed to something like "Active Nodes"? Currently it sounds like it should be a count of all the nodes (active or not) associated with the cluster, but it's only counting the active nodes. Related to this, should there be a page listing all nodes regardless of state (i.e.: a real Total Nodes page)?
Yes currently it shows only active nodes in the cluster. Even I think, changing it to Active Nodes would be more appropriate (as it does).
3. Maintenance: ClusterMetrics.decr(RMNodeState) doesn't handle all the node states. May be better to just to remove this method and have the one place it's used do the switch.
I think, there are only three events, which causes the NM to lost, and all the 3 events have been handled.
RUNNING -> UNHEALTY, and UNHEALTY -> RUNNING transitions have been handled in their corresponding transition hooks.
Nit: The "N/A" web address for inactive nodes shouldn't be a hyperlink, since it doesn't go anywhere useful.
For inactive nodes, hyperlink appears as "N/A" however clicking on that does not go anywhere.
Please once refer the attached screen shot.
2. There will be bookkeeping issues for the inactive nodes when a single host has been configured with multiple nodemanager instances. (A bit odd, but possible to setup.) Since the inactive nodes are tracked only by hostname, we will remove a node from the inactive list when a new nodemanager appears on a different port. Probably best to track that issue in a separate JIRA. There are other issues with that setup, e.g.: inability to detect redundant nodemanager launches, limit nodemanager instances, etc.
If there is only one NM running on a host, there won't be any problem. However if there are multiple NMs running on a single host, it will be a problem.
If the NMs running on a particular host configured to use ephemeral ports, there is no such mechanism to identify NMs comeback.
Filing a separate JIRA for this