I think we can address this issue at a broader level.
Metrics is definitely one of the important aspects.
As Karam Singh mentioned, we do have metrics at NodeManager level.
What if we gather all the metrics from all NodeManagers at the cetral point which is ResourceManager?
Following points can be considered
1. Minimal processing and communication overhead on the cluster.
2. Addition of more metrics in future
3. Configurable - e.g. OnDemand-admin should be to trigger it from resource manager web UI. or periodic refresh
following two solutions i could think of.
Either we can provide a configuration or a link On the ResourceManager UI or through jmx trigger point, we can provide a way to trigger the gathering of metrics from each node manager.
It involves a service on ResourceManager side, which can be a RPC service, which will accept metrics update requests from all NodeManagers.
When Administrator triggers the gathering of metrics, the NodeManager will be informed to report the metrics to ResourceManager through the heart beat response.
ResourceManager, NodeManager, MRAppMaster everyone bydefault support for org.apache.hadoop.metrics.MetricsServlet which returns the data in JSON format.
ResourceManager can have a service which connects to all NodeManager's MetricsSevlet and uses the JSON data to prepare the metrics information.
please add your views.