In my investigation into the intermittent failures of tests for
TWILL-181 I discovered this bug. This code (starting on line 703 of ApplicationMasterService):
There is a case when instances are failing (but not simultaneously) where the retries for the instances will be spread over two invocations of `ApplicationMasterService.handleCompleted`. This means they will be part of separate `RunnableContainerRequests` and thus will be provisioned separately. But because the code above does not anticipate this case, the first provisionRequest will never appear to be satisfied, never be polled and the total can never be met.
The first provisionRequest does not appear to be satisfied because the expected containers will never equal the running containers. The code as-is expects every request to be an `ALLOCATE_ONE_INSTANCE_AT_A_TIME` or for all instances. In the case of retries, requests may can in all at once or in other patterns which result in multiple provision requests.
When retrying instances, the code should set the type to be `ALLOCATE_ONE_INSTANCE_AT_A_TIME` to reflect the situation.