[YARN-11490] JMX QueueMetrics breaks after mutable config validation in CS - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.4.0
Component/s: capacityscheduler
Labels:
- pull-request-available

Target Version/s:

3.4.0
Hadoop Flags:

Reviewed

Description

Reproduction steps:

1. Submit a long running job

hadoop-3.4.0-SNAPSHOT/bin/yarn jar hadoop-3.4.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.4.0-SNAPSHOT-tests.jar sleep -m 1 -r 1 -rt 1200000 -mt 20

2. Verify that there is one running app

$ curl http://localhost:8088/ws/v1/cluster/metrics | jq

3. Verify that the JMX endpoint reports 1 running app as well

$ curl http://localhost:8088/jmx | jq

4. Validate the configuration (x2)

$ curl -X POST -H 'Content-Type: application/json' -d @defaultqueue.json localhost:8088/ws/v1/cluster/scheduler-conf/validate

$ cat defaultqueue.json
{"update-queue":{"queue-name":"root.default","params":{"entry":{"key":"maximum-applications","value":"100"}}},"subClusterId":"","global":null,"global-updates":null}

5. Check 2. and 3. again. The cluster metrics should still work but the JMX endpoint will show 0 running apps, that's the bug.

It is caused by ~~YARN-11211~~, reverting that patch (or only removing the QueueMetrics.clearQueueMetrics(); line) fixes the issue. But I think that would re-introduce the memory leak.

It looks like the QUEUE_METRICS hash map is "add-only", the clearQueueMetrics() was only called from ResourceManager.reinitialize() method (transitionToActive/transitionToStandby) prior to ~~YARN-11211~~. Constantly adding and removing queues with unique names would cause a leak as well, because there is no remove from QUEUE_METRICS, so it is not just the validation API that has this problem.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

hadoop-tdomok-resourcemanager-tdomok-MBP16.log
10/May/23 13:36
171 kB
Tamas Domok
addqueue.xml
11/May/23 06:33
0.6 kB
Tamas Domok
stopqueue.json
11/May/23 06:33
0.1 kB
Tamas Domok
removequeue.xml
11/May/23 06:33
0.4 kB
Tamas Domok
defaultqueue.json
11/May/23 06:33
0.2 kB
Tamas Domok

Issue Links

is caused by

YARN-11211 QueueMetrics leaks Configuration objects when validation API is called multiple times

Resolved

links to

GitHub Pull Request #5644

Activity

People

Assignee:: Tamas Domok

Reporter:: Tamas Domok

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/May/23 11:17

Updated:: 25/Jan/24 14:56

Resolved:: 23/May/23 08:37