Details
-
Bug
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
2.8.0, 3.3.0
-
None
-
None
Description
After https://issues.apache.org/jira/browse/YARN-9596, QueueMetrics still has incorrect allocated jmx, such as allocatedMB, allocatedVCores and allocatedContainers, when the node partition is updated from "DEFAULT" to other label and there are running applications.
Steps to reproduce
==============
- Configure capacity-scheduler.xml with label configuration
- Submit one application to default partition and run
- Add label "tpcds" to cluster and replace label on node1 and node2 to be "tpcds" when the above application is running
- Note down "VCores Used" at Web UI
- When the application is finished, the metrics get wrong (screenshots attached).
==============
FiCaSchedulerApp doesn't update queue metrics when CapacityScheduler handles this event NODE_LABELS_UPDATE.
So we should release container resource from old partition and add used resource to new partition, just as updating queueUsage.
// code placeholder public void nodePartitionUpdated(RMContainer rmContainer, String oldPartition, String newPartition) { Resource containerResource = rmContainer.getAllocatedResource(); this.attemptResourceUsage.decUsed(oldPartition, containerResource); this.attemptResourceUsage.incUsed(newPartition, containerResource); getCSLeafQueue().decUsedResource(oldPartition, containerResource, this); getCSLeafQueue().incUsedResource(newPartition, containerResource, this); // Update new partition name if container is AM and also update AM resource if (rmContainer.isAMContainer()) { setAppAMNodePartitionName(newPartition); this.attemptResourceUsage.decAMUsed(oldPartition, containerResource); this.attemptResourceUsage.incAMUsed(newPartition, containerResource); getCSLeafQueue().decAMUsedResource(oldPartition, containerResource, this); getCSLeafQueue().incAMUsedResource(newPartition, containerResource, this); } }