Description
Regarding AMRMClientImpl
Scenario 1:
Given a ContainerRequest x with Resource y, when addContainerRequest is called z times with x, allocate is called and at least one of the z allocated containers is started, then if another addContainerRequest call is done and subsequently an allocate call to the RM, (z+1) containers will be allocated, where 1 container is expected.
Scenario 2:
No containers are started between the allocate calls.
Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) are requested in both scenarios, but that only in the second scenario, the correct behavior is observed.
Looking at the implementation I have found that this (z+1) request is caused by the structure of the remoteRequestsTable. The consequence of Map<Resource, ResourceRequestInfo> is that ResourceRequestInfo does not hold any information about whether a request has been sent to the RM yet or not.
There are workarounds for this, such as releasing the excess containers received.
The solution implemented is to initialize a new ResourceRequest in ResourceRequestInfo when a request has been successfully sent to the RM.
The patch includes a test in which scenario one is tested.
Attachments
Attachments
Issue Links
- causes
-
YARN-9877 Intermittent TIME_OUT of LogAggregationReport
- Resolved
- is related to
-
SLIDER-829 when containers are allocated, explicitly cancel the request
- Resolved
- relates to
-
YARN-110 AM releases too many containers due to the protocol
- Open