[MAPREDUCE-5877] Inconsistency between JT/TT for tasks taking a long time to launch - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.2.1
Fix Version/s: 1.3.0
Component/s: jobtracker, tasktracker
Labels:
None

Hadoop Flags:

Reviewed

Description

For the tasks that take too long to launch (for genuine reasons like large distributed caches), JT expires the task. Depending on whether job recovery is enabled and the JT's restart state, another attempt is launched or not even when the JT is not restarted. The status of the attempt changes to "Error launching task". Meanwhile, the TT is not informed of this task expiry and eventually launches the task. Also, the "new" attempt might be assigned to the same TT leading to more inconsistent behavior.

To avoid this, one can bump up the mapred.tasktracker.expiry.interval, but leading to long TT failure discovery times.

We should have a per-job timeout for task launches/ heartbeat and JT/TT should be consistent in what they say.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

mr-5877-1.patch
05/May/14 21:26
2 kB
Karthik Kambatla
repro-mr-5877.patch
05/May/14 21:15
1 kB
Karthik Kambatla

Activity

People

Assignee:: Karthik Kambatla

Reporter:: Karthik Kambatla

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 05/May/14 20:36

Updated:: 03/Nov/14 18:33

Resolved:: 07/May/14 00:50