[MAPREDUCE-6190] If a task stucks before its first heartbeat, it never timeouts and the MR job becomes stuck - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.1
Fix Version/s: 3.3.0
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed
Release Note:
Added "mapreduce.task.stuck.timeout-ms" parameter to timeout a task before sending the task's first heartbeat. The default value is 600000 milliseconds (10 minutes).

Description

Trying to figure out a weird issue we started seeing on our CDH5.1.0 cluster with map reduce jobs on YARN.

We had a job stuck for hours because one of the mappers never started up fully. Basically, the map task had 2 attempts, the first one failed and the AM tried to schedule a second one and the second attempt was stuck on STATE: STARTING, STATUS: NEW. A node never got assigned and the task along with the job was stuck indefinitely.

The AM logs had this being logged again and again:

2014-12-09 19:25:12,347 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down 0
2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1408745633994_450952_02_003807
2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce preemption successful attempt_1408745633994_450952_r_000048_1000
2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 1
2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Preempting attempt_1408745633994_450952_r_000050_1000
2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=0
2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: completedMapPercent 0.99968 totalMemLimit:1722880 finalMapMemLimit:2560 finalReduceMemLimit:1720320 netScheduledMapMem:2560 netScheduledReduceMem:1722880
2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down 0
2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:77 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 AssignedReds:673 CompletedMaps:3124 CompletedReds:0 ContAlloc:4789 ContRel:798 HostLocal:2944 RackLocal:155
2014-12-09 19:25:14,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:78 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 AssignedReds:673 CompletedMaps:3124 CompletedReds:0 ContAlloc:4789 ContRel:798 HostLocal:2944 RackLocal:155
2014-12-09 19:25:14,359 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=0

On killing the task manually, the AM started up the task again, scheduled and ran it successfully completing the task and the job with it.

Some quick code grepping led us here:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-app/2.3.0/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java#397

But still dont quite understand why this would happen once in a while and why the job would suddenly be ok once the stuck task is manually killed.

Note: Other jobs succeed on the cluster while this job is stuck.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-6190.001.patch
23/Nov/18 04:09
5 kB
Zhaohui Xin
MAPREDUCE-6190.002.patch
25/Nov/18 06:35
9 kB
Zhaohui Xin
MAPREDUCE-6190.003.patch
27/Nov/18 09:58
11 kB
Zhaohui Xin
MAPREDUCE-6190.004.patch
28/Nov/18 03:49
11 kB
Zhaohui Xin
MAPREDUCE-6190.005.patch
28/Nov/18 06:49
11 kB
Zhaohui Xin

Issue Links

breaks

MAPREDUCE-7198 mapreduce.task.timeout=0 configuration used to disable timeout doesn't work

Resolved

Activity

People

Assignee:: Zhaohui Xin

Reporter:: Ankit Malhotra

Votes:: 5 Vote for this issue

Watchers:: 20 Start watching this issue

Dates

Created:: 10/Dec/14 20:29

Updated:: 23/Apr/19 02:35

Resolved:: 28/Nov/18 09:06