[YARN-9050] [Umbrella] Usability improvements for scheduler activities - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.3.0
Component/s: capacityscheduler
Labels:
None

Hadoop Flags:

Reviewed

Description

We have did some usability improvements for scheduler activities based on YARN3.1 in our cluster as follows:
1. Not available for multi-thread asynchronous scheduling. App and node activities maybe confused when multiple scheduling threads record activities of different allocation processes in the same variables like appsAllocation and recordingNodesAllocation in ActivitiesManager. I think these variables should be thread-local to make activities clear among multiple threads.
2. Incomplete activities for multi-node lookup mechanism, since ActivitiesLogger will skip recording through {{if (node == null || activitiesManager == null) }} when node is null which represents this allocation is for multi-nodes. We need support recording activities for multi-node lookup mechanism.
3. Current app activities can not meet requirements of diagnostics, for example, we can know that node doesn't match request but hard to know why, especially when using placement constraints, it's difficult to make a detailed diagnosis manually. So I propose to improve the diagnoses of activities, add diagnosis for placement constraints check, update insufficient resource diagnosis with detailed info (like 'insufficient resource names:[memory-mb]') and so on.
4. Add more useful fields for app activities, in some scenarios we need to distinguish different requests but can't locate requests based on app activities info, there are some other fields can help to filter what we want such as allocation tags. We have added containerPriority, allocationRequestId and allocationTags fields in AppAllocation.
5. Filter app activities by key fields, sometimes the results of app activities is massive, it's hard to find what we want. We have support filter by allocation-tags to meet requirements from some apps, more over, we can take container-priority and allocation-request-id as candidates if necessary.
6. Aggregate app activities by diagnoses. For a single allocation process, activities still can be massive in a large cluster, we frequently want to know why request can't be allocated in cluster, it's hard to check every node manually in a large cluster, so that aggregation for app activities by diagnoses is necessary to solve this trouble. We have added groupingType parameter for app-activities REST API for this, supports grouping by diagnostics.

I think we can have a discuss about these points, useful improvements which can be accepted will be added into the patch. Thanks.

Running design doc is attached here.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2018-11-23-16-46-38-138.png
23/Nov/18 08:46
192 kB
Tao Yang

Sub-Tasks

1.	Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities	Resolved	Tao Yang
2.	Support asynchronized scheduling mode and multi-node lookup mechanism for app activities	Resolved	Tao Yang
3.	Improve diagnostics for scheduler and app activities	Resolved	Tao Yang
4.	Support filtering by request-priorities and allocation-request-ids for query results of app activities	Resolved	Tao Yang
5.	Support grouping by diagnostics for query results of scheduler and app activities	Resolved	Tao Yang
6.	Document scheduler/app activities and REST APIs	Resolved	Tao Yang
7.	Improve cleanup process of app activities and make some conditions configurable	Resolved	Tao Yang
8.	Add diagnostics for outstanding resource requests on app attempts page	Resolved	Tao Yang
9.	Add limit/actions/summarize options for app activities REST API	Resolved	Tao Yang
10.	Correct incompatible, incomplete and redundant activities	Resolved	Tao Yang
11.	Auto adjust max queue length of app activities to make sure activities on all nodes can be covered	Resolved	Tao Yang
12.	Improve response of scheduler/app activities for better understanding	Resolved	Tao Yang

Activity

People

Assignee:: Tao Yang

Reporter:: Tao Yang

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 23/Nov/18 09:33

Updated:: 14/Mar/20 05:41

Resolved:: 13/Mar/20 21:40