[MAPREDUCE-7180] Relaunching Failed Containers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: mrv1, mrv2
Labels:
None

Description

In my experience, it is very common that a MR job completely fails because a single Mapper/Reducer container is using more memory than has been reserved in YARN. The following message is logging the the MapReduce ApplicationMaster:

Container [pid=46028,containerID=container_e54_1435155934213_16721_01_003666] is running beyond physical memory limits. 
Current usage: 1.0 GB of 1 GB physical memory used; 2.7 GB of 2.1 GB virtual memory used. Killing container.

In this case, the container is re-launched on another node, and of course, it is killed again for the same reason. This process happens three (maybe four?) times before the entire MapReduce job fails. It's often said that the definition of insanity is doing the same thing over and over and expecting different results.

For all intents and purposes, the amount of resources requested by Mappers and Reducers is a fixed amount; based on the default configuration values. Users can set the memory on a per-job basis, but it's a pain, not exact, and requires intimate knowledge of the MapReduce framework and its memory usage patterns.

I propose that if the MR ApplicationMaster detects that a container is killed because of this specific memory resource constraint, that it requests a larger container for the subsequent task attempt.

For example, increase the requested memory size by 50% each time the container fails and the task is retried. This will prevent many Job failures and allow for additional memory tuning, per-Job, after the fact, to get better performance (v.s. fail/succeed).

Attachments

Issue Links

is related to

YARN-9260 Re-Launch ApplicationMasters That Fail With OOM Using Larger Container

Open

YARN-1197 Support changing resources of an allocated container

Open

relates to

YARN-9347 Do not retry containers killed for "running beyond physical memory limits"

Open

Activity

People

Assignee:: Unassigned

Reporter:: David Mollitor

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 31/Jan/19 15:08

Updated:: 26/Feb/20 05:29