[YARN-611] Add an AM retry count reset window to YARN RM - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.3-alpha
Fix Version/s: 2.6.0
Component/s: resourcemanager
Labels:
None

Target Version/s:

2.6.0
Hadoop Flags:

Reviewed

Description

YARN currently has the following config:

yarn.resourcemanager.am.max-retries

This config defaults to 2, and defines how many times to retry a "failed" AM before failing the whole YARN job. YARN counts an AM as failed if the node that it was running on dies (the NM will timeout, which counts as a failure for the AM), or if the AM dies.

This configuration is insufficient for long running (or infinitely running) YARN jobs, since the machine (or NM) that the AM is running on will eventually need to be restarted (or the machine/NM will fail). In such an event, the AM has not done anything wrong, but this is counted as a "failure" by the RM. Since the retry count for the AM is never reset, eventually, at some point, the number of machine/NM failures will result in the AM failure count going above the configured value for yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the job as failed, and shut it down. This behavior is not ideal.

I propose that we add a second configuration:

yarn.resourcemanager.am.retry-count-window-ms

This configuration would define a window of time that would define when an AM is "well behaved", and it's safe to reset its failure count back to zero. Every time an AM fails the RmAppImpl would check the last time that the AM failed. If the last failure was less than retry-count-window-ms ago, and the new failure count is > max-retries, then the job should fail. If the AM has never failed, the retry count is < max-retries, or if the last failure was OUTSIDE the retry-count-window-ms, then the job should be restarted. Additionally, if the last failure was outside the retry-count-window-ms, then the failure count should be set back to 0.

This would give developers a way to have well-behaved AMs run forever, while still failing mis-behaving AMs after a short period of time.

I think the work to be done here is to change the RmAppImpl to actually look at app.attempts, and see if there have been more than max-retries failures in the last retry-count-window-ms milliseconds. If there have, then the job should fail, if not, then the job should go forward. Additionally, we might also need to add an endTime in either RMAppAttemptImpl or RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the failure.

Thoughts?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-611.1.patch
02/Jul/14 01:28
54 kB
Xuan Gong
YARN-611.2.patch
08/Jul/14 16:36
70 kB
Xuan Gong
YARN-611.3.patch
08/Jul/14 18:58
71 kB
Xuan Gong
YARN-611.4.patch
11/Jul/14 00:38
98 kB
Xuan Gong
YARN-611.4.rebase.patch
29/Jul/14 05:48
97 kB
Xuan Gong
YARN-611.5.patch
21/Aug/14 19:26
107 kB
Xuan Gong
YARN-611.6.patch
03/Sep/14 06:57
32 kB
Xuan Gong
YARN-611.7.patch
04/Sep/14 05:23
34 kB
Xuan Gong
YARN-611.8.patch
04/Sep/14 21:57
34 kB
Xuan Gong
YARN-611.9.patch
11/Sep/14 04:35
48 kB
Xuan Gong
YARN-611.9.rebase.patch
11/Sep/14 04:46
49 kB
Xuan Gong
YARN-611.10.patch
12/Sep/14 19:37
49 kB
Xuan Gong
YARN-611.11.patch
12/Sep/14 20:38
50 kB
Xuan Gong

Issue Links

is depended upon by

SLIDER-930 Incorporate Yarn feature of resetting AM failure count into Slider AM

Resolved

YARN-896 Roll up for long-lived services in YARN

Open

is related to

SLIDER-77 use a window for tracking container failures

Resolved

YARN-2074 Preemption of AM containers shouldn't count towards AM failures

Closed

YARN-2355 MAX_APP_ATTEMPTS_ENV may no longer be a useful env var for a container

Resolved

YARN-614 Separate AM failures from hardware failure or YARN error and do not count them to AM retry count

Closed

relates to

YARN-4929 Explore a better way than sleeping for a while in some test cases

Open

(1 is related to, 1 relates to)

Activity

People

Assignee:: Xuan Gong

Reporter:: Chris Riccomini

Votes:: 0 Vote for this issue

Watchers:: 21 Start watching this issue

Dates

Created:: 25/Apr/13 18:27

Updated:: 06/Apr/16 22:47

Resolved:: 14/Sep/14 01:06