[YARN-10467] ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.10.0, 3.0.3, 3.2.1, 3.1.4
Fix Version/s: 3.2.2, 3.4.0, 3.3.1, 2.10.2, 3.2.3
Component/s: resourcemanager
Labels:
None

Description

In one of our recent heap analysis, we found that the majority of the heap is occupied by RMNodeImpl.completedContainers<ContainerIdPBImp>, which accounts for 19GB, out of 24.3 GB. There are over 86 million ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects which represent the # of active containers that RM is still tracking. Inspecting some ContainerIdPBImpl objects, they belong to applications that have long finished. This indicates some sort of memory leak of ContainerIdPBImpl objects in RMNodeImpl.

Right now, when a container is reported by a NM as completed, it is immediately added to RMNodeImpl.completedContainers and later cleaned up after the AM has been notified of its completion in the AM-RM heartbeat. The cleanup can be broken into a few steps.

Step 1: the completed container is first added to RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added to RMNodeImpl.completedContainers).
Step 2: During the heartbeat AM-RM heartbeat, the container is removed from RMAppAttemptImpl.justFinishedContainers and added to RMAppAttemptImpl.finishedContainersSentToAM

Once a completed container gets added to RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned up from RMNodeImpl.completedContainers

However, if the AM exits (regardless of failure or success) before some recently completed containers can be added to RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, these objects stay in RMNodeImpl.completedContainers forever.

We have observed in MR that AMs can decide to exit upon success of all it tasks without waiting for notification of the completion of every container, or AM may just die suddenly (e.g. OOM). Spark and other framework may just be similar.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-10467.00.patch
27/Oct/20 03:21
11 kB
Haibo Chen
YARN-10467.01.patch
27/Oct/20 16:37
11 kB
Haibo Chen
YARN-10467.02.patch
28/Oct/20 16:38
11 kB
Haibo Chen
YARN-10467.branch-2.10.00.patch
27/Oct/20 01:06
11 kB
Haibo Chen
YARN-10467.branch-2.10.01.patch
27/Oct/20 16:41
26 kB
Haibo Chen
YARN-10467.branch-2.10.02.patch
27/Oct/20 21:21
11 kB
Haibo Chen
YARN-10467.branch-2.10.03.patch
28/Oct/20 16:23
11 kB
Haibo Chen

Activity

People

Assignee:: Haibo Chen

Reporter:: Haibo Chen

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 19/Oct/20 21:16

Updated:: 10/Jun/21 08:14

Resolved:: 28/Oct/20 17:49