[MAPREDUCE-6659] Mapreduce App master waits long to kill containers on lost nodes. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.6.0
Fix Version/s: None
Component/s: mr-am
Labels:
None

Description

MR Application master waits for very long time to cleanup and relaunch the tasks on lost nodes. Wait time is actually 2.5 hours (ipc.client.connect.max.retries * ipc.client.connect.max.retries.on.timeouts * ipc.client.connect.timeout = 10 * 45 * 20 = 9000 seconds = 2.5 hours)

Some similar issue related in RM-AM rpc protocol is fixed in ~~YARN-3809~~.
As fixed in ~~YARN-3809~~, we may need to introduce new configurations to control this RPC retry behavior.

Also, I feel this total retry time should honor and capped maximum to global task time out (mapreduce.task.timeout = 600000 default)

Attachments

Issue Links

is duplicated by

MAPREDUCE-6982 Containers on lost nodes are considered failed after a too long time.

Resolved

relates to

YARN-3809 Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

Closed

Activity

People

Assignee:: Nicolas Fraison

Reporter:: Laxman

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 24/Mar/16 12:01

Updated:: 08/Nov/17 15:05