[HADOOP-12317] Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.8.0, 3.0.0-alpha1
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

On a debian machine we have seen node manager recovery of containers fail because the signal syntax for process group may not work. We see errors in checking if process is alive during container recovery which causes the container to be declared as LOST (154) on a NodeManager restart.

The application will fail with error. The attempts are not retried.

Application application_1439244348718_0001 failed 1 times due to Attempt recovered after RM restartAM Container for appattempt_1439244348718_0001_000001 exited with exitCode: 154

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-4096.001.patch
11/Aug/15 21:12
3 kB
Anubhav Dhoot
YARN-4046.002.patch
12/Aug/15 00:56
3 kB
Anubhav Dhoot
YARN-4046.002.patch
12/Aug/15 00:57
3 kB
Anubhav Dhoot

Issue Links

breaks

HADOOP-12441 Fix kill command behavior under some Linux distributions.

Resolved

YARN-3561 Non-AM Containers continue to run even after AM is stopped

Resolved

duplicates

YARN-3561 Non-AM Containers continue to run even after AM is stopped

Resolved

supercedes

HADOOP-11989 Kill command for process group id throws ExitCodeException

Resolved

Activity

People

Assignee:: Anubhav Dhoot

Reporter:: Anubhav Dhoot

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 11/Aug/15 18:41

Updated:: 30/Aug/16 01:24

Resolved:: 20/Aug/15 02:02