[MAPREDUCE-5043] Fetch failure processing can cause AM event queue to backup and eventually OOM - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.23.7, 2.1.0-beta
Fix Version/s: 0.23.7, 2.1.0-beta
Component/s: mr-am
Labels:
None

Target Version/s:

0.23.7, 2.1.0-beta

Description

Saw an MRAppMaster with a 3G heap OOM. Upon investigating another instance of it running, we saw the UI in a weird state where the task table and task attempt tables in the job overview page weren't consistent. The AM log showed the AsyncDispatcher had hundreds of thousands of events in the event queue, and jstacks showed it spending a lot of time in fetch failure processing. It turns out fetch failure processing is currently very expensive, with a triple for loop where the inner loop is calling the quite-expensive TaskAttempt.getReport. That function ends up type-converting the entire task report, counters and all, and performing locale conversions among other things. It does this for every reduce task in the job, for every map task that failed. And when it's done building up the large task report, it pulls out one field, the phase, then throws the report away.

While the AM is busy processing fetch failures, tasks attempts are continuing to send events to the AM including memory-expensive events like status updates which include the counters. These back up in the AsyncDispatcher event queue and eventually even an AM with a large heap size will run out of memory and crash or expire because it thrashes in garbage collect.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-5043.patch
02/Mar/13 22:01
8 kB
Jason Darrell Lowe

Issue Links

is related to

MAPREDUCE-5124 AM lacks flow control for task events

Resolved

Activity

People

Assignee:: Jason Darrell Lowe

Reporter:: Jason Darrell Lowe

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 02/Mar/13 00:41

Updated:: 03/Sep/14 22:57

Resolved:: 04/Mar/13 20:14