Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-10517

QueueMetrics has incorrect Allocated Resource when labelled partitions updated

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 2.8.0, 3.3.0
    • None
    • resourcemanager
    • None

    Description

      After https://issues.apache.org/jira/browse/YARN-9596, QueueMetrics still has incorrect allocated jmx, such as  allocatedMB, allocatedVCores and allocatedContainers, when the node partition is updated from "DEFAULT" to other label and there are  running applications.

      Steps to reproduce

      ==============

      1. Configure capacity-scheduler.xml with label configuration
      2. Submit one application to default partition and run
      3. Add label "tpcds" to cluster and replace label on node1 and node2 to be "tpcds" when the above application is running
      4. Note down "VCores Used" at Web UI
      5. When the application is finished, the metrics get wrong (screenshots attached).

      ==============

       

      FiCaSchedulerApp doesn't update queue metrics when CapacityScheduler handles this event NODE_LABELS_UPDATE.

      So we should release container resource from old partition and add used resource to new partition, just as updating queueUsage.

      // code placeholder
      public void nodePartitionUpdated(RMContainer rmContainer, String oldPartition,
          String newPartition) {
        Resource containerResource = rmContainer.getAllocatedResource();
        this.attemptResourceUsage.decUsed(oldPartition, containerResource);
        this.attemptResourceUsage.incUsed(newPartition, containerResource);
        getCSLeafQueue().decUsedResource(oldPartition, containerResource, this);
        getCSLeafQueue().incUsedResource(newPartition, containerResource, this);
      
        // Update new partition name if container is AM and also update AM resource
        if (rmContainer.isAMContainer()) {
          setAppAMNodePartitionName(newPartition);
          this.attemptResourceUsage.decAMUsed(oldPartition, containerResource);
          this.attemptResourceUsage.incAMUsed(newPartition, containerResource);
          getCSLeafQueue().decAMUsedResource(oldPartition, containerResource, this);
          getCSLeafQueue().incAMUsedResource(newPartition, containerResource, this);
        }
      }
      

      Attachments

        1. YARN-10517.001.patch
          5 kB
          Qi Zhu
        2. YARN-10517-branch-3.2.001.patch
          1 kB
          sibyl.lv
        3. wrong metrics.png
          1.18 MB
          sibyl.lv

        Activity

          People

            zhuqi Qi Zhu
            sibyl.lv sibyl.lv
            Votes:
            1 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated: