[YARN-9195] RM Queue's pending container number might get decreased unexpectedly or even become negative once RM failover - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Patch Available
Priority: Critical
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: client
Labels:
None

Description

Hi, all:

Previously we have encountered a serious problem in ResourceManager, we found that pending container number of one RM queue became negative after RM failed over. Since queues in RM are managed in hierarchical structure, the root queue's pending containers became negative at last, thus the scheduling process of the whole cluster became affected.

The version of both our RM server and AMRM client in our application are based on yarn 3.1, and we uses AMRMClientAsync#addSchedulingRequests() method in our application to request resources from RM.

After investigation, we found that the direct cause was numAllocations of some AMs' requests became negative after RM failed over. And there are at lease three necessary conditions:
(1) Use schedulingRequests in AMRM client, and the application set zero to the numAllocations for a schedulingRequest. In our batch job scenario, the numAllocations of a schedulingRequest could turn to zero because theoretically we can run a full batch job using only one container.
(2) RM failovers.
(3) Before AM reregisters itself to RM after RM restarts, RM has already recovered some of the application's containers assigned before.

Here are some more details about the implementation:
(1) After RM recovers, RM will send all alive containers to AM once it re-register itself through RegisterApplicationMasterResponse#getContainersFromPreviousAttempts.
(2) During registerApplicationMaster, AMRMClientImpl will removeFromOutstandingSchedulingRequests once AM gets ContainersFromPreviousAttempts without checking whether these containers have been assigned before. As a consequence, its outstanding requests might be decreased unexpectedly even if it may not become negative.
(3) There is no sanity check in RM to validate requests from AMs.

For better illustrating this case, I've written a test case based on the latest hadoop trunk, posted in the attachment. You may try case testAMRMClientWithNegativePendingRequestsOnRMRestart and testAMRMClientOnUnexpectedlyDecreasedPendingRequestsOnRMRestart .

To solve this issue, I propose to filter allocated containers before removeFromOutstandingSchedulingRequests in AMRMClientImpl during registerApplicationMaster, and some sanity checks are also needed to prevent things from getting worse.

More comments and suggestions are welcomed.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

cases_to_recreate_negative_pending_requests_scenario.diff
14/Jan/19 04:02
17 kB
MalcolmSanders
YARN-9195.001.patch
25/Jan/19 07:56
26 kB
MalcolmSanders
YARN-9195.002.patch
25/Jan/19 12:03
27 kB
MalcolmSanders
YARN-9195.003.patch
21/Feb/19 14:32
34 kB
MalcolmSanders

Issue Links

relates to

YARN-6168 Restarted RM may not inform AM about all existing containers

Resolved

YARN-7565 Yarn service pre-maturely releases the container after AM restart

Resolved

Activity

People

Assignee:: Ashutosh Gupta

Reporter:: MalcolmSanders

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 14/Jan/19 07:40

Updated:: 01/Sep/22 09:09