[YARN-11114] RMWebServices returns only apps matching exactly the submitted queue name - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.4.0
Component/s: capacity scheduler, webapp
Labels:
- pull-request-available

Target Version/s:

3.4.0
Hadoop Flags:

Reviewed

Description

I've added 2 testcases that demonstrate the issue with this commit.

1. With 'testAppsQueryByQueueShortname', there's a finishedApp submitted to "root.default" and there's a runningApp that is submitted to "default".
The testcase queries the apps by queue name "default" and the response only contains the runningApp, which is submitted to "default" so the other app that is submitted to "root.default" is not returned.

2. With 'testAppsQueryByQueueFullname', there's a finishedApp submitted to "root.default" and there's a runningApp that is submitted to "default" (same setup as above).
The testcase queries the apps by queue name "root.default" (which is the full queue path) and the response only contains the finishedApp, which is submittted to "root.default" so the other app that is submitted to "default" is not returned.

A trivial conclusion of this is that only those applications are included in the response that exactly match the queue name where the application is submitted to, either specified explicity at submission or resolved by the placement engine.

Before ~~YARN-9879~~ was implemented, Capacity Scheduler was only capable of definining a leaf queue with a specific name in the whole hierarchy once, meaning that leaf queue names were unique.
For example root.a.testQueue and root.b.testQueue couldn't coexist, as the leaf queue name is the same.

At this point, I supposed that ~~YARN-9879~~ is causing this issue, but as the behaviour of CS before ~~YARN-9879~~ was merged didn't allow two leaf queues with the same name, a query of "root.default" and "default" could easily work as it was guaranteed that there's not another "default" leaf queue in the hierarchy, just one. I digged a bit further.

I also noticed that ~~YARN-8659~~ (commit link) could have introduced this issue a long time ago, as it removed the iterator logic that queried the applications with method YarnScheduler#getAppsInQueue (see this).

Let's follow the implementation of YarnScheduler#getAppsInQueue for CS:
1. First of all, here is the method definition.
CapacityScheduler#getQueue is called from here.

2. CapacityScheduler#getQueue is then calling QueueManager#getQueue.

3. QueueManager#getQueue is then calling CSQueueStore#get.

4. CSQueueStore#get calls the 'getMap' fields getOrDefault method here.

4.1 CSQueueStore#getMap (field) stores the Queue objects mapped to their short and full names (e.g. 'default' and 'root.default').
CSQueueStore#add is the method that is responsible for adding the CSQueue objects.

4.2 The first getMap.put call is invoked here with the full queue name.

4.3 The second getMap.put call is invoked via CSQueueStore#updateGetMapForShortName here.

As a conclusion, in ClientRMService#getApplications, the app filtering by queues seems wrong for me.
The block that filters by queues is here.

This should be enhanced by querying the apps from YarnScheduler#getAppsInQueue, as it both handles the short and full queue names for CS in the end.
It's crucial to not just fall back to the logic that was replaced by ~~YARN-8659~~ (commit link).
As the original issue was there that rmContext.getRMApps() returns both running and finished apps, while scheduler.getAppsInQueue only returns running apps.

NOTES

NOTE #1:
As there's no way to get the short queue name + the full queue name from RmApp / RmAppImpl, it's currently not possible to compare the queue filter of the RM client request with both type of queue names of the application.

NOTE #2:
scheduler.getAppsInQueue(queue) will only return running apps, so for running apps, it's possible to retrieve the apps by queue name, and it will work with both short and full names. However, for non-running apps, only the submitted app name would work for filtering.

NOTE #3 (plan for implementation):
It would be completely reasonable to consider both running and non-running apps while querying, however I think it never worked that way.
Before ~~YARN-8659~~, only running apps were considered and before ~~YARN-9879~~, both running + non-running apps were considered but only the stored queue name (in RmAppImpl) was compared to the app filter's queue name, which was either the short or the full queue name.
All in all, I don't want to change this behavior and also I think it would make the code more convoluted if RmAppImpl would store the short and the full queue names as well.

Attachments

Issue Links

Dependency

YARN-11123 ResourceManager webapps test failures due to org.apache.hadoop.metrics2.MetricsException and subsequent java.net.BindException: Address already in use

Resolved

links to

GitHub Pull Request #4235

Activity

People

Assignee:: Szilard Nemeth

Reporter:: Szilard Nemeth

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 20/Apr/22 17:47

Updated:: 28/Jan/24 05:30

Resolved:: 11/May/22 16:06

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 50m