[YARN-3480] Recovery may get very slow with lots of services with lots of app-attempts - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.6.0
Fix Version/s: 2.9.0, 3.0.0-alpha1
Component/s: resourcemanager
Labels:
None

Target Version/s:

2.9.0
Hadoop Flags:

Reviewed

Description

When RM HA is enabled and running containers are kept across attempts, apps are more likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore.

BTW: When 'attemptFailuresValidityInterval'(introduced in ~~YARN-611~~) is set to a small value, retried attempts might be very large. So we need to delete some attempts stored in RMStateStore and RMStateStore.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-3480.01.patch
23/Apr/15 15:58
37 kB
Jun Gong
YARN-3480.02.patch
24/Apr/15 14:23
37 kB
Jun Gong
YARN-3480.03.patch
02/May/15 16:11
43 kB
Jun Gong
YARN-3480.04.patch
08/May/15 16:12
32 kB
Jun Gong
YARN-3480.05.patch
12/Dec/15 12:05
29 kB
Jun Gong
YARN-3480.06.patch
13/Dec/15 03:59
29 kB
Jun Gong
YARN-3480.07.patch
15/Dec/15 10:00
33 kB
Jun Gong
YARN-3480.08.patch
16/Dec/15 08:25
25 kB
Jun Gong
YARN-3480.09.patch
16/Dec/15 13:17
25 kB
Jun Gong
YARN-3480.10.patch
17/Dec/15 08:21
36 kB
Jun Gong
YARN-3480.11.patch
21/Dec/15 11:03
36 kB
Jun Gong
YARN-3480.12.patch
22/Dec/15 16:53
41 kB
Jun Gong
YARN-3480.13.patch
23/Dec/15 02:19
36 kB
Jun Gong
YARN-3480.14.patch
29/Dec/15 01:52
37 kB
Jun Gong
YARN-3480.15.patch
29/Dec/15 04:31
37 kB
Jun Gong

Issue Links

is blocked by

YARN-1039 Add parameter for YARN resource requests to indicate "long lived"

Open

is related to

YARN-4497 RM might fail to restart when recovering apps whose attempts are missing

Resolved

YARN-4584 RM startup failure when AM attempts greater than max-attempts

Resolved

YARN-3668 Long run service shouldn't be killed even if Yarn crashed

Open

relates to

YARN-4929 Explore a better way than sleeping for a while in some test cases

Open

Activity

People

Assignee:: Jun Gong

Reporter:: Jun Gong

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 13/Apr/15 13:57

Updated:: 30/Aug/16 01:24

Resolved:: 30/Dec/15 00:00