Uploaded image for project: 'Apache Twill'
  1. Apache Twill
  2. TWILL-211

Retries of failed runnable instances may result in unsatisfiable provisioning requests

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 0.10.0
    • Component/s: core
    • Labels:
      None

      Description

      In my investigation into the intermittent failures of tests for TWILL-181 I discovered this bug. This code (starting on line 703 of ApplicationMasterService):

       if (expectedContainers.getExpected(runnableName) == runningContainers.count(runnableName) ||
          provisioning.peek().getType().equals(AllocationSpecification.Type.ALLOCATE_ONE_INSTANCE_AT_A_TIME)) {
          provisioning.poll();
        }
      

      There is a case when instances are failing (but not simultaneously) where the retries for the instances will be spread over two invocations of `ApplicationMasterService.handleCompleted`. This means they will be part of separate `RunnableContainerRequests` and thus will be provisioned separately. But because the code above does not anticipate this case, the first provisionRequest will never appear to be satisfied, never be polled and the total can never be met.

      The first provisionRequest does not appear to be satisfied because the expected containers will never equal the running containers. The code as-is expects every request to be an `ALLOCATE_ONE_INSTANCE_AT_A_TIME` or for all instances. In the case of retries, requests may can in all at once or in other patterns which result in multiple provision requests.

      When retrying instances, the code should set the type to be `ALLOCATE_ONE_INSTANCE_AT_A_TIME` to reflect the situation.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user serranom opened a pull request:

          https://github.com/apache/twill/pull/29

          (TWILL-211) use ALLOCATE_ONE_INSTANCE_AT_A_TIME for retries to prevent…

          … poll starvation.

          These changes fix a lingering bug identified by the tests for TWILL-181. When creating a runnable container request, we check for number of instances equal to one. For such cases we always set the allocation request to use `ALLOCATE_ONE_INSTANCE_AT_A_TIME`. When retries occur, each failed instance results in a single provision request, each asking for one instance. But if `ALLOCATE_ONE_INSTANCE_AT_A_TIME` is not set, the `handleCompleted` code will never consider the request satisfied since it is waiting for all instances to have been started.

          The check for `null` at line 469 is necessary because these new provision requests do not have a placement policy.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/serranom/twill TWILL-211

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/twill/pull/29.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #29


          commit 999eb3c19427e7e95abb1a83d3e697febaa6741f
          Author: martin <martin@attivio.com>
          Date: 2017-01-29T20:28:18Z

          TWILL-211, use ALLOCATE_ONE_INSTANCE_AT_A_TIME for retries to prevent poll starvation


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user serranom opened a pull request: https://github.com/apache/twill/pull/29 ( TWILL-211 ) use ALLOCATE_ONE_INSTANCE_AT_A_TIME for retries to prevent… … poll starvation. These changes fix a lingering bug identified by the tests for TWILL-181 . When creating a runnable container request, we check for number of instances equal to one. For such cases we always set the allocation request to use `ALLOCATE_ONE_INSTANCE_AT_A_TIME`. When retries occur, each failed instance results in a single provision request, each asking for one instance. But if `ALLOCATE_ONE_INSTANCE_AT_A_TIME` is not set, the `handleCompleted` code will never consider the request satisfied since it is waiting for all instances to have been started. The check for `null` at line 469 is necessary because these new provision requests do not have a placement policy. You can merge this pull request into a Git repository by running: $ git pull https://github.com/serranom/twill TWILL-211 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/twill/pull/29.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #29 commit 999eb3c19427e7e95abb1a83d3e697febaa6741f Author: martin <martin@attivio.com> Date: 2017-01-29T20:28:18Z TWILL-211 , use ALLOCATE_ONE_INSTANCE_AT_A_TIME for retries to prevent poll starvation
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user hsaputra commented on the issue:

          https://github.com/apache/twill/pull/29

          This LGTM

          +1

          Show
          githubbot ASF GitHub Bot added a comment - Github user hsaputra commented on the issue: https://github.com/apache/twill/pull/29 This LGTM +1
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user hsaputra commented on the issue:

          https://github.com/apache/twill/pull/29

          Will merge this if no more comment

          Show
          githubbot ASF GitHub Bot added a comment - Github user hsaputra commented on the issue: https://github.com/apache/twill/pull/29 Will merge this if no more comment
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/twill/pull/29

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/twill/pull/29
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user poornachandra commented on the issue:

          https://github.com/apache/twill/pull/29

          LGTM from me too

          Show
          githubbot ASF GitHub Bot added a comment - Github user poornachandra commented on the issue: https://github.com/apache/twill/pull/29 LGTM from me too

            People

            • Assignee:
              mserrano Martin Serrano
              Reporter:
              mserrano Martin Serrano
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development