[MAPREDUCE-6982] Containers on lost nodes are considered failed after a too long time. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: 2.6.0
Fix Version/s: None
Component/s: mr-am
Labels:
None
Environment:

cdh5.5.0

Description

Containers on lost nodes (nodemanager being unavailable or server being unavailable) are considered failed after a too long time.
This is due to the AppMaster trying to cleanup the container on the unavailable node.
The proposed path will limit the impact of this timeout by managing NodeManager lost events on AM as described below:

on nodemanager service unavailibility (crash, oom ...):
When receiving lost NodeManager events, it failed the impacted attempt and do not go through the cleanup stage.
on nodemanager server unavailibility with default settings AM detect first that the attempt is in timeout and try to cleanup the attempt:
When receiving lost NodeManager events, it stop the cleanup process on the impacted container and failed the attempt.

This reduce the duration of the timeout to the timeout for detecting a NodeManager down.

Similar issue than MAPREDUCE-6659 on which I can't attached the patch.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-6982.patch
13/Oct/17 10:00
13 kB
Nicolas Fraison

Issue Links

duplicates

MAPREDUCE-6659 Mapreduce App master waits long to kill containers on lost nodes.

Open

Activity

People

Assignee:: Unassigned

Reporter:: Nicolas Fraison

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 13/Oct/17 09:59

Updated:: 13/Oct/17 13:15

Resolved:: 13/Oct/17 13:15