Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-11490

JMX QueueMetrics breaks after mutable config validation in CS

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      Reproduction steps:

      1. Submit a long running job

      hadoop-3.4.0-SNAPSHOT/bin/yarn jar hadoop-3.4.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.4.0-SNAPSHOT-tests.jar sleep -m 1 -r 1 -rt 1200000 -mt 20
      

      2. Verify that there is one running app

      $ curl http://localhost:8088/ws/v1/cluster/metrics | jq
      

      3. Verify that the JMX endpoint reports 1 running app as well

      $ curl http://localhost:8088/jmx | jq
      

      4. Validate the configuration (x2)

      $ curl -X POST -H 'Content-Type: application/json' -d @defaultqueue.json localhost:8088/ws/v1/cluster/scheduler-conf/validate
      
      $ cat defaultqueue.json
      {"update-queue":{"queue-name":"root.default","params":{"entry":{"key":"maximum-applications","value":"100"}}},"subClusterId":"","global":null,"global-updates":null}
      

      5. Check 2. and 3. again. The cluster metrics should still work but the JMX endpoint will show 0 running apps, that's the bug.

      It is caused by YARN-11211, reverting that patch (or only removing the QueueMetrics.clearQueueMetrics(); line) fixes the issue. But I think that would re-introduce the memory leak.

      It looks like the QUEUE_METRICS hash map is "add-only", the clearQueueMetrics() was only called from ResourceManager.reinitialize() method (transitionToActive/transitionToStandby) prior to YARN-11211. Constantly adding and removing queues with unique names would cause a leak as well, because there is no remove from QUEUE_METRICS, so it is not just the validation API that has this problem.

      Attachments

        1. hadoop-tdomok-resourcemanager-tdomok-MBP16.log
          171 kB
          Tamas Domok
        2. addqueue.xml
          0.6 kB
          Tamas Domok
        3. stopqueue.json
          0.1 kB
          Tamas Domok
        4. removequeue.xml
          0.4 kB
          Tamas Domok
        5. defaultqueue.json
          0.2 kB
          Tamas Domok

        Issue Links

          Activity

            People

              tdomok Tamas Domok
              tdomok Tamas Domok
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: