I would like to make this metrics discussion a bit more clear for my own sanity. The current situation:
A1) ClusterMetrics, prior to
YARN-2802, only had NM metrics. AM metrics were added in YARN-2802, partly because storing in each node isn't useful for debugging. Review from Vinod pushed the metric from the RM (since it really isn't RM related) to ClusterMetrics.
A2) QueueMetrics (and derived classes) currently has metrics for App counts and MB/VCore/Container statistics.
This JIRA is the first of many, to start placing the metrics to get some sort of YARN profiling in place, at least at some basic level.
B1) If it's put into ClusterMetrics, it is as Anubhav mentioned, a good global metric/warning system, but won't necessarily help with debugging other than at the cluster level.
B2) If it's put into the QueueMetrics, then there is the additional ability to be able to debug queue vs. network/cluster issues with respect to container allocation.
My feedback on the discussion so far:
C1) I do believe container allocation has a chance of being queue dependent. Now, whether it's only useful for FairScheduler vs. other schedulers could be debated (which is why it was originally in FSQueueMetrics).
C2) QueueMetrics has the advantage of being able to have a customer take a metrics snapshot and use it for debugging application delays (at least for this first metric so far). My goal for the near-future is to continue adding to this area in order to get a clear snapshot of any RM related application runtime metrics for each queue.
PS: I appreciate all the great feedback so far. It's definitely giving me places to look at the code and get a better overall understanding. Thanks.