Uploaded image for project: 'Apache Twill'
  1. Apache Twill
  2. TWILL-181

Control the maximum number of retries for failed application starts

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0-incubating
    • Fix Version/s: 0.10.0
    • Component/s: yarn
    • Labels:
      None

      Description

      If an application consistently exits with a non-zero code, twill will attempt to restart indefinitely. I ran into this issue and a list search also reveals others.

      There should be a mechanism to specify the maximum number of retries until the application fails. Ideally by default there would be a non-infinite maximum.

        Issue Links

          Activity

          Hide
          hsaputra Henry Saputra added a comment -

          Thanks for closing this. I wonder if GIt plugin could automatically close this

          Show
          hsaputra Henry Saputra added a comment - Thanks for closing this. I wonder if GIt plugin could automatically close this
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/twill/pull/23

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/twill/pull/23
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user hsaputra commented on the issue:

          https://github.com/apache/twill/pull/23

          Looks like not related to this PR. Will merge EOD if no more comment.

          Thanks again for the hard work, @serranom

          Show
          githubbot ASF GitHub Bot added a comment - Github user hsaputra commented on the issue: https://github.com/apache/twill/pull/23 Looks like not related to this PR. Will merge EOD if no more comment. Thanks again for the hard work, @serranom
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user serranom commented on the issue:

          https://github.com/apache/twill/pull/23

          I looked at the latest CI run and the failure is related to some `twill-zookeeper` project issue – not yarn

          Show
          githubbot ASF GitHub Bot added a comment - Github user serranom commented on the issue: https://github.com/apache/twill/pull/23 I looked at the latest CI run and the failure is related to some `twill-zookeeper` project issue – not yarn
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user hsaputra commented on the issue:

          https://github.com/apache/twill/pull/23

          I am having problem applying the patch, could you kindly rebase the patch into single commit?
          Thanks

          Show
          githubbot ASF GitHub Bot added a comment - Github user hsaputra commented on the issue: https://github.com/apache/twill/pull/23 I am having problem applying the patch, could you kindly rebase the patch into single commit? Thanks
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user serranom commented on the issue:

          https://github.com/apache/twill/pull/23

          My pleasure. It was a good way to get to know twill code base.

          Show
          githubbot ASF GitHub Bot added a comment - Github user serranom commented on the issue: https://github.com/apache/twill/pull/23 My pleasure. It was a good way to get to know twill code base.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user hsaputra commented on the issue:

          https://github.com/apache/twill/pull/23

          Thanks much for working on the patch and for the update, @serranom.

          Did a pass for review.
          LGTM in general, add mini comment. Other than that +1 will merge after comment addressed.

          Show
          githubbot ASF GitHub Bot added a comment - Github user hsaputra commented on the issue: https://github.com/apache/twill/pull/23 Thanks much for working on the patch and for the update, @serranom. Did a pass for review. LGTM in general, add mini comment. Other than that +1 will merge after comment addressed.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user hsaputra commented on a diff in the pull request:

          https://github.com/apache/twill/pull/23#discussion_r99416151

          — Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java —
          @@ -477,10 +493,30 @@ void handleCompleted(YarnContainerStatus status, Multiset<String> restartRunnabl
          }
          }

          • private boolean shouldRetry(int exitCode) {
          • return exitCode != ContainerExitCodes.SUCCESS
          • && exitCode != ContainerExitCodes.DISKS_FAILED
          • && exitCode != ContainerExitCodes.INIT_FAILED;
            + private boolean shouldRetry(String runnableName, int instanceId, int exitCode) {
            + boolean possiblyRetry =
            + exitCode != ContainerExitCodes.SUCCESS &&
            + exitCode != ContainerExitCodes.DISKS_FAILED &&
            + exitCode != ContainerExitCodes.INIT_FAILED;
            +
            + if (possiblyRetry) {
            + int max = getMaxRetries(runnableName);
            + if (max == Integer.MAX_VALUE) { + return true; // retry without special log msg + }

            +
            + if (getRetryCount(runnableName, instanceId) == max) {

              • End diff –

          Since we call `getRetryCount` for this check, might as well cache it as local var to make sure we got the right one for this request:

          ```
          int retryCount = getRetryCount(runnableName, instanceId);
          if (getRetryCount(runnableName, instanceId) == max)

          { ... }

          else {
          LOG.info("Attempting {} of {} retries for instance {} of runnable {}.", retryCount + 1,
          max, instanceId, runnableName);
          return true;
          }
          ```

          Show
          githubbot ASF GitHub Bot added a comment - Github user hsaputra commented on a diff in the pull request: https://github.com/apache/twill/pull/23#discussion_r99416151 — Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java — @@ -477,10 +493,30 @@ void handleCompleted(YarnContainerStatus status, Multiset<String> restartRunnabl } } private boolean shouldRetry(int exitCode) { return exitCode != ContainerExitCodes.SUCCESS && exitCode != ContainerExitCodes.DISKS_FAILED && exitCode != ContainerExitCodes.INIT_FAILED; + private boolean shouldRetry(String runnableName, int instanceId, int exitCode) { + boolean possiblyRetry = + exitCode != ContainerExitCodes.SUCCESS && + exitCode != ContainerExitCodes.DISKS_FAILED && + exitCode != ContainerExitCodes.INIT_FAILED; + + if (possiblyRetry) { + int max = getMaxRetries(runnableName); + if (max == Integer.MAX_VALUE) { + return true; // retry without special log msg + } + + if (getRetryCount(runnableName, instanceId) == max) { End diff – Since we call `getRetryCount` for this check, might as well cache it as local var to make sure we got the right one for this request: ``` int retryCount = getRetryCount(runnableName, instanceId); if (getRetryCount(runnableName, instanceId) == max) { ... } else { LOG.info("Attempting {} of {} retries for instance {} of runnable {}.", retryCount + 1, max, instanceId, runnableName); return true; } ```
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user serranom commented on the issue:

          https://github.com/apache/twill/pull/23

          I found that there is still a race condition in ApplicationMasterService.launchRunnable. If the number of instances is increased right after the original request is fullfilled, the current logic can result in the original request not being polled resulting in future requests hanging. This seems to be a fairly unlikely case in the real world, but I'll file a JIRA for it. For now, I've reworked the test to wait until launchRunnable has done the polling for the original request. This PR is now complete.

          Show
          githubbot ASF GitHub Bot added a comment - Github user serranom commented on the issue: https://github.com/apache/twill/pull/23 I found that there is still a race condition in ApplicationMasterService.launchRunnable. If the number of instances is increased right after the original request is fullfilled, the current logic can result in the original request not being polled resulting in future requests hanging. This seems to be a fairly unlikely case in the real world, but I'll file a JIRA for it. For now, I've reworked the test to wait until launchRunnable has done the polling for the original request. This PR is now complete.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user serranom commented on the issue:

          https://github.com/apache/twill/pull/23

          ugh. I'll take a look, I may have messed up the merge.

          Show
          githubbot ASF GitHub Bot added a comment - Github user serranom commented on the issue: https://github.com/apache/twill/pull/23 ugh. I'll take a look, I may have messed up the merge.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user serranom commented on the issue:

          https://github.com/apache/twill/pull/23

          I think this is set now. I've updated the PR with instance-based maxRetries changes.

          Show
          githubbot ASF GitHub Bot added a comment - Github user serranom commented on the issue: https://github.com/apache/twill/pull/23 I think this is set now. I've updated the PR with instance-based maxRetries changes.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user serranom commented on the issue:

          https://github.com/apache/twill/pull/23

          I have new changes I will add to this PR once https://github.com/apache/twill/pull/29 is resolved since the tests will fail intermittently without that fix.

          Show
          githubbot ASF GitHub Bot added a comment - Github user serranom commented on the issue: https://github.com/apache/twill/pull/23 I have new changes I will add to this PR once https://github.com/apache/twill/pull/29 is resolved since the tests will fail intermittently without that fix.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user serranom commented on the issue:

          https://github.com/apache/twill/pull/23

          I have discovered the cause of the intermittent failure.

          This code (starting on line 703 of ApplicationMasterService, comments added by me):

          /*

          • The provisionRequest will either contain a single container (ALLOCATE_ONE_INSTANCE_AT_A_TIME), or all the
          • containers to satisfy the expectedContainers count. In the later case, the provision request is complete once
          • all the containers have run at which point we poll() to remove the provisioning request.
            */
            if (expectedContainers.getExpected(runnableName) == runningContainers.count(runnableName) ||
            provisioning.peek().getType().equals(AllocationSpecification.Type.ALLOCATE_ONE_INSTANCE_AT_A_TIME)) { provisioning.poll(); }

          There is a case when instances are failing (but not simultaneously) where the retries for the instances will be spread over two invocations of `ApplicationMasterService.handleCompleted`. This means they will be part of separate `RunnableContainerRequests` and thus will be provisioned separately. But because the code above does not anticipate this case, the first provisionRequest will never appear to be satisfied, never be polled and the total can never be met. This is an existing bug, since I think it could happen before my new code – its just that my test tickles it. If there is agreement, I can file a new JIRA and PR request or just fix it with this PR. Let me know.

          Show
          githubbot ASF GitHub Bot added a comment - Github user serranom commented on the issue: https://github.com/apache/twill/pull/23 I have discovered the cause of the intermittent failure. This code (starting on line 703 of ApplicationMasterService, comments added by me): /* The provisionRequest will either contain a single container (ALLOCATE_ONE_INSTANCE_AT_A_TIME), or all the containers to satisfy the expectedContainers count. In the later case, the provision request is complete once all the containers have run at which point we poll() to remove the provisioning request. */ if (expectedContainers.getExpected(runnableName) == runningContainers.count(runnableName) || provisioning.peek().getType().equals(AllocationSpecification.Type.ALLOCATE_ONE_INSTANCE_AT_A_TIME)) { provisioning.poll(); } There is a case when instances are failing (but not simultaneously) where the retries for the instances will be spread over two invocations of `ApplicationMasterService.handleCompleted`. This means they will be part of separate `RunnableContainerRequests` and thus will be provisioned separately. But because the code above does not anticipate this case, the first provisionRequest will never appear to be satisfied, never be polled and the total can never be met. This is an existing bug, since I think it could happen before my new code – its just that my test tickles it. If there is agreement, I can file a new JIRA and PR request or just fix it with this PR. Let me know.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user poornachandra commented on a diff in the pull request:

          https://github.com/apache/twill/pull/23#discussion_r98337486

          — Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java —
          @@ -113,9 +117,11 @@ public Integer apply(BitSet input) {
          private final Location applicationLocation;
          private final Set<String> runnableNames;
          private final Map<String, Map<String, String>> logLevels;
          + private final Map<String, Integer> maxRetries;
          — End diff –

          👍

          Show
          githubbot ASF GitHub Bot added a comment - Github user poornachandra commented on a diff in the pull request: https://github.com/apache/twill/pull/23#discussion_r98337486 — Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java — @@ -113,9 +117,11 @@ public Integer apply(BitSet input) { private final Location applicationLocation; private final Set<String> runnableNames; private final Map<String, Map<String, String>> logLevels; + private final Map<String, Integer> maxRetries; — End diff – 👍
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user serranom commented on a diff in the pull request:

          https://github.com/apache/twill/pull/23#discussion_r98334746

          — Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java —
          @@ -113,9 +117,11 @@ public Integer apply(BitSet input) {
          private final Location applicationLocation;
          private final Set<String> runnableNames;
          private final Map<String, Map<String, String>> logLevels;
          + private final Map<String, Integer> maxRetries;
          — End diff –

          but on second thought, the additional complexity of tracking by instance id would handle the following case better:

          • initially start with `x` instances, all of which succeed.
          • add 1 more instance, which fails

          If tracking by instance id, the new instance will get maxRetries. Without tracking by instance id, the new instance would get `maxRetries*(x+1)` retries. So I will add the complexity of keeping track by id. It seems worth it.

          Show
          githubbot ASF GitHub Bot added a comment - Github user serranom commented on a diff in the pull request: https://github.com/apache/twill/pull/23#discussion_r98334746 — Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java — @@ -113,9 +117,11 @@ public Integer apply(BitSet input) { private final Location applicationLocation; private final Set<String> runnableNames; private final Map<String, Map<String, String>> logLevels; + private final Map<String, Integer> maxRetries; — End diff – but on second thought, the additional complexity of tracking by instance id would handle the following case better: initially start with `x` instances, all of which succeed. add 1 more instance, which fails If tracking by instance id, the new instance will get maxRetries. Without tracking by instance id, the new instance would get `maxRetries*(x+1)` retries. So I will add the complexity of keeping track by id. It seems worth it.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user serranom commented on a diff in the pull request:

          https://github.com/apache/twill/pull/23#discussion_r98334088

          — Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java —
          @@ -113,9 +117,11 @@ public Integer apply(BitSet input) {
          private final Location applicationLocation;
          private final Set<String> runnableNames;
          private final Map<String, Map<String, String>> logLevels;
          + private final Map<String, Integer> maxRetries;
          — End diff –

          @poornachandra, I don't see anything linking one instance request to a retry of that failure. Failures get added to the multiset which then results in a new request for that many instances. This means in the end, there is nothing tying one attempt for a particular instanceId to the next one. So we could keep track of retries by instanceId, but it would end up functionally being the same as keeping track of retries for the entire Runnable. I think it would just add more complexity for little benefit to track by instanceId.

          Show
          githubbot ASF GitHub Bot added a comment - Github user serranom commented on a diff in the pull request: https://github.com/apache/twill/pull/23#discussion_r98334088 — Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java — @@ -113,9 +117,11 @@ public Integer apply(BitSet input) { private final Location applicationLocation; private final Set<String> runnableNames; private final Map<String, Map<String, String>> logLevels; + private final Map<String, Integer> maxRetries; — End diff – @poornachandra, I don't see anything linking one instance request to a retry of that failure. Failures get added to the multiset which then results in a new request for that many instances. This means in the end, there is nothing tying one attempt for a particular instanceId to the next one. So we could keep track of retries by instanceId, but it would end up functionally being the same as keeping track of retries for the entire Runnable. I think it would just add more complexity for little benefit to track by instanceId.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user serranom commented on the issue:

          https://github.com/apache/twill/pull/23

          @hsaputra , that is the failure I see as intermittent. My comment above has a relevant log excerpt. I'm planning on trying to dig into it today or tomorrow.

          Show
          githubbot ASF GitHub Bot added a comment - Github user serranom commented on the issue: https://github.com/apache/twill/pull/23 @hsaputra , that is the failure I see as intermittent. My comment above has a relevant log excerpt. I'm planning on trying to dig into it today or tomorrow.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user serranom commented on a diff in the pull request:

          https://github.com/apache/twill/pull/23#discussion_r98294126

          — Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java —
          @@ -113,9 +117,11 @@ public Integer apply(BitSet input) {
          private final Location applicationLocation;
          private final Set<String> runnableNames;
          private final Map<String, Map<String, String>> logLevels;
          + private final Map<String, Integer> maxRetries;
          — End diff –

          I'm taking a look at this.

          Show
          githubbot ASF GitHub Bot added a comment - Github user serranom commented on a diff in the pull request: https://github.com/apache/twill/pull/23#discussion_r98294126 — Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java — @@ -113,9 +117,11 @@ public Integer apply(BitSet input) { private final Location applicationLocation; private final Set<String> runnableNames; private final Map<String, Map<String, String>> logLevels; + private final Map<String, Integer> maxRetries; — End diff – I'm taking a look at this.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user hsaputra commented on the issue:

          https://github.com/apache/twill/pull/23

          Seems like one of the tests failed related to new one you added: ` MaxRetriesTestRun.maxRetriesTwoInstances`

          Could you quickly take a peek?
          https://s3.amazonaws.com/archive.travis-ci.org/jobs/192819935/log.txt

          Show
          githubbot ASF GitHub Bot added a comment - Github user hsaputra commented on the issue: https://github.com/apache/twill/pull/23 Seems like one of the tests failed related to new one you added: ` MaxRetriesTestRun.maxRetriesTwoInstances` Could you quickly take a peek? https://s3.amazonaws.com/archive.travis-ci.org/jobs/192819935/log.txt
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user poornachandra commented on a diff in the pull request:

          https://github.com/apache/twill/pull/23#discussion_r98088342

          — Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java —
          @@ -113,9 +117,11 @@ public Integer apply(BitSet input) {
          private final Location applicationLocation;
          private final Set<String> runnableNames;
          private final Map<String, Map<String, String>> logLevels;
          + private final Map<String, Integer> maxRetries;
          — End diff –

          Instance ids for runnable instances go from `0` to `num-instances - 1`. If an instance gets restarted, then it retains the same instance id. Using this instance id we can track the number of times a particular instance was restarted.

          You can get the instance id in `handleCompleted` method by using `getInstanceId(RunId runId)` method.

          Show
          githubbot ASF GitHub Bot added a comment - Github user poornachandra commented on a diff in the pull request: https://github.com/apache/twill/pull/23#discussion_r98088342 — Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java — @@ -113,9 +117,11 @@ public Integer apply(BitSet input) { private final Location applicationLocation; private final Set<String> runnableNames; private final Map<String, Map<String, String>> logLevels; + private final Map<String, Integer> maxRetries; — End diff – Instance ids for runnable instances go from `0` to `num-instances - 1`. If an instance gets restarted, then it retains the same instance id. Using this instance id we can track the number of times a particular instance was restarted. You can get the instance id in `handleCompleted` method by using `getInstanceId(RunId runId)` method.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user serranom commented on a diff in the pull request:

          https://github.com/apache/twill/pull/23#discussion_r98083899

          — Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java —
          @@ -113,9 +117,11 @@ public Integer apply(BitSet input) {
          private final Location applicationLocation;
          private final Set<String> runnableNames;
          private final Map<String, Map<String, String>> logLevels;
          + private final Map<String, Integer> maxRetries;
          — End diff –

          from my analysis the association of instance ids does not necessarily correspond to specific processes. it looked like there is a pool of requests, a new request being serviced gets the lowest instance id and a failed request gets put back on the queue. this is why i went with an instance adjusted number of retries. did i miss something?

          Show
          githubbot ASF GitHub Bot added a comment - Github user serranom commented on a diff in the pull request: https://github.com/apache/twill/pull/23#discussion_r98083899 — Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java — @@ -113,9 +117,11 @@ public Integer apply(BitSet input) { private final Location applicationLocation; private final Set<String> runnableNames; private final Map<String, Map<String, String>> logLevels; + private final Map<String, Integer> maxRetries; — End diff – from my analysis the association of instance ids does not necessarily correspond to specific processes. it looked like there is a pool of requests, a new request being serviced gets the lowest instance id and a failed request gets put back on the queue. this is why i went with an instance adjusted number of retries. did i miss something?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user poornachandra commented on a diff in the pull request:

          https://github.com/apache/twill/pull/23#discussion_r98075925

          — Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java —
          @@ -113,9 +117,11 @@ public Integer apply(BitSet input) {
          private final Location applicationLocation;
          private final Set<String> runnableNames;
          private final Map<String, Map<String, String>> logLevels;
          + private final Map<String, Integer> maxRetries;
          — End diff –

          On further thought, I think if we track the restarts per instance id then it would simplify the contract for max retries. This will be consistent with what the javadoc says too -
          ```
          Sets the maximum number of times (per instance) a runnable will be retried if it exits without success. The default behavior is to retry indefinitely.
          ```

          Show
          githubbot ASF GitHub Bot added a comment - Github user poornachandra commented on a diff in the pull request: https://github.com/apache/twill/pull/23#discussion_r98075925 — Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java — @@ -113,9 +117,11 @@ public Integer apply(BitSet input) { private final Location applicationLocation; private final Set<String> runnableNames; private final Map<String, Map<String, String>> logLevels; + private final Map<String, Integer> maxRetries; — End diff – On further thought, I think if we track the restarts per instance id then it would simplify the contract for max retries. This will be consistent with what the javadoc says too - ``` Sets the maximum number of times (per instance) a runnable will be retried if it exits without success. The default behavior is to retry indefinitely. ```
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user serranom commented on the issue:

          https://github.com/apache/twill/pull/23

          @hsaputra, since I have a bug to fix, the updated request will trigger the rebuild. thanks.

          Show
          githubbot ASF GitHub Bot added a comment - Github user serranom commented on the issue: https://github.com/apache/twill/pull/23 @hsaputra, since I have a bug to fix, the updated request will trigger the rebuild. thanks.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user serranom commented on a diff in the pull request:

          https://github.com/apache/twill/pull/23#discussion_r97994162

          — Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java —
          @@ -113,9 +117,11 @@ public Integer apply(BitSet input) {
          private final Location applicationLocation;
          private final Set<String> runnableNames;
          private final Map<String, Map<String, String>> logLevels;
          + private final Map<String, Integer> maxRetries;
          — End diff –

          Good catch. Yes. I missed that. I will update the pull request with a fix. When the number of instances increases, `deltaInstances * maxRetries` will be added to this instance adjusted maximum. When the number of instances decreases, the `deltaInstance*maxRetries` will be subtracted from the instance adjusted maximum. Additionally upon a decrease, if the number of retries exceeds the instance adjusted maximum it will be set to the instance adjusted maximum.

          Show
          githubbot ASF GitHub Bot added a comment - Github user serranom commented on a diff in the pull request: https://github.com/apache/twill/pull/23#discussion_r97994162 — Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java — @@ -113,9 +117,11 @@ public Integer apply(BitSet input) { private final Location applicationLocation; private final Set<String> runnableNames; private final Map<String, Map<String, String>> logLevels; + private final Map<String, Integer> maxRetries; — End diff – Good catch. Yes. I missed that. I will update the pull request with a fix. When the number of instances increases, `deltaInstances * maxRetries` will be added to this instance adjusted maximum. When the number of instances decreases, the `deltaInstance*maxRetries` will be subtracted from the instance adjusted maximum. Additionally upon a decrease, if the number of retries exceeds the instance adjusted maximum it will be set to the instance adjusted maximum.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user poornachandra commented on a diff in the pull request:

          https://github.com/apache/twill/pull/23#discussion_r97944230

          — Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java —
          @@ -113,9 +117,11 @@ public Integer apply(BitSet input) {
          private final Location applicationLocation;
          private final Set<String> runnableNames;
          private final Map<String, Map<String, String>> logLevels;
          + private final Map<String, Integer> maxRetries;
          — End diff –

          The max retries count in this map will have to be updated when the instance count of a runnable changes, right?

          Show
          githubbot ASF GitHub Bot added a comment - Github user poornachandra commented on a diff in the pull request: https://github.com/apache/twill/pull/23#discussion_r97944230 — Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java — @@ -113,9 +117,11 @@ public Integer apply(BitSet input) { private final Location applicationLocation; private final Set<String> runnableNames; private final Map<String, Map<String, String>> logLevels; + private final Map<String, Integer> maxRetries; — End diff – The max retries count in this map will have to be updated when the instance count of a runnable changes, right?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user serranom commented on the issue:

          https://github.com/apache/twill/pull/23

          Thanks Henry! No rush on my account, I was just checking in.

          Show
          githubbot ASF GitHub Bot added a comment - Github user serranom commented on the issue: https://github.com/apache/twill/pull/23 Thanks Henry! No rush on my account, I was just checking in.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user hsaputra commented on the issue:

          https://github.com/apache/twill/pull/23

          Reference: https://issues.apache.org/jira/browse/TWILL-181

          Hi @serranom , Could you also update the PR title to prefix with `(TWILL-181)` as shown in how to contribute page [1]?
          This will trigger the Git hook to close the JIRA when PR is merged.

          [1] http://twill.apache.org/HowToContribute.html

          Show
          githubbot ASF GitHub Bot added a comment - Github user hsaputra commented on the issue: https://github.com/apache/twill/pull/23 Reference: https://issues.apache.org/jira/browse/TWILL-181 Hi @serranom , Could you also update the PR title to prefix with `( TWILL-181 )` as shown in how to contribute page [1] ? This will trigger the Git hook to close the JIRA when PR is merged. [1] http://twill.apache.org/HowToContribute.html
          Hide
          chtyim Terence Yim added a comment -

          I think you can add it to TwillRuntimeSpecification (you do need to modify the TwillRuntimeSpecificationCodec as well)

          Show
          chtyim Terence Yim added a comment - I think you can add it to TwillRuntimeSpecification (you do need to modify the TwillRuntimeSpecificationCodec as well)
          Hide
          mserrano Martin Serrano added a comment -

          After looking at the save action, I don't see a good file to add the maxRetries to, but it seems overboard to add a file just for it. Any suggestions? I could stuff retries into the environment I suppose. But I don't want this to be too hacky.

          Show
          mserrano Martin Serrano added a comment - After looking at the save action, I don't see a good file to add the maxRetries to, but it seems overboard to add a file just for it. Any suggestions? I could stuff retries into the environment I suppose. But I don't want this to be too hacky.
          Hide
          mserrano Martin Serrano added a comment -

          Initial thoughts on design:

          • default maxRetries is Integer.MAX. we can change this at a later date if desired.
          • add withMaxTries() to TwillPreparer
          • add withMaxTries() to YarnTwillPreparer
          • add saveMaxTries() to YarnTwillPreparer.start()
          • it becomes part of the twillRuntimeSpecification and RuntimeSpecification interface
          • must be added to TwillRuntimeSpecificationCodec, each runnable gets its own max retries?
          • ApplicationMasterService.initRunningContainers is updated to pass a map of runnables to maxretries.
          • update RunningContainers so that it keeps count of the number of retries (per container?) and uses this in handleCompleted() to determine if it should retry. I'm a bit unsure of how to handle this. If my analysis of the code is correct, we do not keep track of failures of individual instances. In other words every instance is the same as any other. So if I'm starting 10 instances of a Runnable, and wanted a max retry count of 3, then I would scale that to 30. In other words, each instance gets 3 tries. But since the instances are interchangeable, there is no concept of a discrete instance being retried. If this is acceptable, then we'd just have a counter per runnable name.
          Show
          mserrano Martin Serrano added a comment - Initial thoughts on design: default maxRetries is Integer.MAX. we can change this at a later date if desired. add withMaxTries() to TwillPreparer add withMaxTries() to YarnTwillPreparer add saveMaxTries() to YarnTwillPreparer.start() it becomes part of the twillRuntimeSpecification and RuntimeSpecification interface must be added to TwillRuntimeSpecificationCodec, each runnable gets its own max retries? ApplicationMasterService.initRunningContainers is updated to pass a map of runnables to maxretries. update RunningContainers so that it keeps count of the number of retries (per container?) and uses this in handleCompleted() to determine if it should retry. I'm a bit unsure of how to handle this. If my analysis of the code is correct, we do not keep track of failures of individual instances. In other words every instance is the same as any other. So if I'm starting 10 instances of a Runnable, and wanted a max retry count of 3, then I would scale that to 30. In other words, each instance gets 3 tries. But since the instances are interchangeable, there is no concept of a discrete instance being retried. If this is acceptable, then we'd just have a counter per runnable name.
          Hide
          mserrano Martin Serrano added a comment -

          Will do.

          Show
          mserrano Martin Serrano added a comment - Will do.
          Hide
          chtyim Terence Yim added a comment -

          Thank Martin Serrano. When you have the design in mind, please share it in this JIRA so that we can discuss about the approach first.

          Show
          chtyim Terence Yim added a comment - Thank Martin Serrano . When you have the design in mind, please share it in this JIRA so that we can discuss about the approach first.
          Hide
          hsaputra Henry Saputra added a comment -

          Assign it to Martin Serrano

          Show
          hsaputra Henry Saputra added a comment - Assign it to Martin Serrano
          Hide
          mserrano Martin Serrano added a comment -

          I'm going to start taking a look at this.

          Show
          mserrano Martin Serrano added a comment - I'm going to start taking a look at this.

            People

            • Assignee:
              mserrano Martin Serrano
              Reporter:
              mserrano Martin Serrano
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development