[MAPREDUCE-3228] MR AM hangs when one node goes bad - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.23.0
Fix Version/s: 0.23.0
Component/s: applicationmaster, mrv2
Labels:
None

Description

Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via stopContainer(). The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.

Thanks to Karams for helping with this.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-3228-20111020.txt
20/Oct/11 12:14
12 kB
Vinod Kumar Vavilapalli
MAPREDUCE-3228-20111027.txt
27/Oct/11 13:09
18 kB
Vinod Kumar Vavilapalli

Activity

People

Assignee:: Vinod Kumar Vavilapalli

Reporter:: Vinod Kumar Vavilapalli

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 20/Oct/11 12:12

Updated:: 15/Nov/11 00:48

Resolved:: 27/Oct/11 17:31