[REEF-42] Extra YARN container causing unexpected memory reservations - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: REEF-Runtime-YARN
Labels:
None

Description

Our cluster has 4 nodes, each with 13.67GB of memory. When we launch Surf(a long-running job) with 4 evaluators(7GB each), The available memory becomes 6.67GBX3nodes and 5.67GBX1node(AM is 1GB). But an extra container request(7GB), hangs at RM, because of the following reasons.

Because Surf is a long-running job, the evaluators that have been allocated do not exit and make room for the extra container. If there was a room, REEF would have been notified of the allocation of the extra container and released it right away.
To avoid YARN-314, currently we never send a 0-container request, which in effect removes the hanging extra container

As a result, RM infinitely tries to allocate the hanging request, reserving 7GB for each node. So, Memory Reserved metric increases and Memory Available metric decreases.

The same thing happens when we explicitly request for more than the capacity, say 8GBX5evaluators. But the difference is that the one caused by the extra container is unpredictable.

chobrian and I discussed the tradeoff between the followings.

Send 0-container requests and address YARN-314 differently by adding another indirection atop AMRMClient or replacing it altogether
Wait until YARN-314 is resolved since our case is not common and can be discovered and fixed by the system administrator

We think the second approach is better. Once YARN-314 is resolved, I'll create a patch that allows sending 0-container requests.

Any suggestions are welcome.

Attachments

Issue Links

is blocked by

YARN-314 Schedulers should allow resource requests of different sizes at the same priority and location

Open

Activity

People

Assignee:: John Yang

Reporter:: John Yang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 14/Nov/14 10:29

Updated:: 14/Nov/14 16:20