[MAPREDUCE-6131] Integer overflow in RMContainerAllocator results in starvation of applications - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Invalid
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

When processing large datasets, Hadoop encounters a scenario where all
containers run reduce tasks and no map tasks are scheduled. The
application does not fail but rather remains in this state without making
any forward progress. It then has to be manually terminated.

This bug is due to integer overflow in scheduleReduces() of
RMContainerAllocator. The variable netScheduledMapMem overflows for
large data sizes, takes negative value, and results in a large
finalReduceMemLimit and a large rampup value. In almost all cases, this
large rampup value is greater than the total number of reduce tasks.
Therefore, the AM tries to assign all reduce tasks. And if the total number
of reduce tasks is greater than the total container slots, then all slots are
taken up by reduce tasks, leaving none for maps.

With 128MB block size and 2GB map container size, overflow occurs with 128 TB data size. An example scenario for the reproduction is:

Input data size of 32TB, block size 128MB, Map container size = 10GB,
reduce container size = 10GB, #reducers = 50, cluster mem capacity = 7 x 40GB, slowstart=0.0

Better resolution might be to change the variables used in
RMContainerAllocator from int to long. A simpler fix instead would be to
only change the local variables of scheduleReduces() to long data types.
Patch is attached for 2.2.0.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-6131-2.2.0.patch
17/Oct/14 23:59
3 kB
Kamal Kc

Activity

People

Assignee:: Unassigned

Reporter:: Kamal Kc

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 17/Oct/14 23:49

Updated:: 21/Jul/16 03:45

Resolved:: 21/Jul/16 03:45