Uploaded image for project: 'Apache Twill'
  1. Apache Twill
  2. TWILL-211

Retries of failed runnable instances may result in unsatisfiable provisioning requests

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 0.10.0
    • Component/s: core
    • Labels:
      None

      Description

      In my investigation into the intermittent failures of tests for TWILL-181 I discovered this bug. This code (starting on line 703 of ApplicationMasterService):

       if (expectedContainers.getExpected(runnableName) == runningContainers.count(runnableName) ||
          provisioning.peek().getType().equals(AllocationSpecification.Type.ALLOCATE_ONE_INSTANCE_AT_A_TIME)) {
          provisioning.poll();
        }
      

      There is a case when instances are failing (but not simultaneously) where the retries for the instances will be spread over two invocations of `ApplicationMasterService.handleCompleted`. This means they will be part of separate `RunnableContainerRequests` and thus will be provisioned separately. But because the code above does not anticipate this case, the first provisionRequest will never appear to be satisfied, never be polled and the total can never be met.

      The first provisionRequest does not appear to be satisfied because the expected containers will never equal the running containers. The code as-is expects every request to be an `ALLOCATE_ONE_INSTANCE_AT_A_TIME` or for all instances. In the case of retries, requests may can in all at once or in other patterns which result in multiple provision requests.

      When retrying instances, the code should set the type to be `ALLOCATE_ONE_INSTANCE_AT_A_TIME` to reflect the situation.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                mserrano Martin Serrano
                Reporter:
                mserrano Martin Serrano
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: