[MAPREDUCE-6689] MapReduce job can infinitely increase number of reducer resource requests - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.8.0, 2.7.3, 2.6.5, 3.0.0-alpha1
Component/s: None
Labels:
None

Target Version/s:

2.8.0, 2.7.3, 2.6.4
Hadoop Flags:

Reviewed

Description

We have seen this issue from one of our clusters: when running terasort map-reduce job, some mappers failed after reducer started, and then MR AM tries to preempt reducers to schedule these failed mappers.

After that, MR AM enters an infinite loop, for every RMContainerAllocator#heartbeat run, it:

In preemptReducesIfNeeded, it cancels all scheduled reducer requests. (total scheduled reducers = 1024)
Then, in scheduleReduces, it ramps up all reducers (total = 1024).

As a result, we can see total #requested-containers increased 1024 for every MRAM-RM heartbeat (1 sec per heartbeat). The AM is hanging for 18+ hours, so we get 18 * 3600 * 1024 ~ 66M+ requested containers in RM side.

And this bug also triggered ~~YARN-4844~~, which makes RM stop scheduling anything.

Thanks to sidharta-s for helping with analysis.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-6689.1.patch
05/May/16 20:15
10 kB
Wangda Tan

Issue Links

is broken by

MAPREDUCE-6302 Preempt reducers after a configurable timeout irrespective of headroom

Closed

is related to

YARN-4844 Add getMemorySize/getVirtualCoresSize to o.a.h.y.api.records.Resource

Resolved

Activity

People

Assignee:: Wangda Tan

Reporter:: Wangda Tan

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 05/May/16 00:38

Updated:: 30/Aug/16 01:13

Resolved:: 06/May/16 22:41