Uploaded image for project: 'Apache Twill'
  1. Apache Twill
  2. TWILL-186

ApplicationMaster keeps restarting with NPE in the log.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0-incubating
    • Fix Version/s: 0.11.0
    • Component/s: core, yarn
    • Labels:
      None

      Description

      Seems like certain combination of the container sizes launched by AM, causing the AM to keep restarting.

      Following exception is seen in the app master container log:

      Aug 12, 2016 4:37:39 PM com.google.common.util.concurrent.AbstractExecutionThreadService$1$1 run
      WARNING: Error while attempting to shut down the service after failure.
      java.lang.NullPointerException
              at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.decResourceRequest(AMRMClientImpl.java:687)
              at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.removeContainerRequest(AMRMClientImpl.java:477)
              at org.apache.twill.internal.yarn.Hadoop21YarnAMClient.removeContainerRequest(Hadoop21YarnAMClient.java:116)
              at org.apache.twill.internal.yarn.Hadoop21YarnAMClient.removeContainerRequest(Hadoop21YarnAMClient.java:45)
              at org.apache.twill.internal.yarn.AbstractYarnAMClient.allocate(AbstractYarnAMClient.java:119)
              at org.apache.twill.internal.appmaster.ApplicationMasterService.doStop(ApplicationMasterService.java:281)
              at org.apache.twill.internal.AbstractTwillService.shutDown(AbstractTwillService.java:186)
              at com.google.common.util.concurrent.AbstractExecutionThreadService$1$1.run(AbstractExecutionThreadService.java:55)
              at java.lang.Thread.run(Thread.java:745)
      
      Exception in thread "ApplicationMasterService" java.lang.NullPointerException
              at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.decResourceRequest(AMRMClientImpl.java:687)
              at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.removeContainerRequest(AMRMClientImpl.java:477)
              at org.apache.twill.internal.yarn.Hadoop21YarnAMClient.removeContainerRequest(Hadoop21YarnAMClient.java:116)
              at org.apache.twill.internal.yarn.Hadoop21YarnAMClient.removeContainerRequest(Hadoop21YarnAMClient.java:45)
              at org.apache.twill.internal.yarn.AbstractYarnAMClient.allocate(AbstractYarnAMClient.java:119)
              at org.apache.twill.internal.appmaster.ApplicationMasterService.doRun(ApplicationMasterService.java:369)
              at org.apache.twill.internal.AbstractTwillService.run(AbstractTwillService.java:179)
              at com.google.common.util.concurrent.AbstractExecutionThreadService$1$1.run(AbstractExecutionThreadService.java:52)
              at java.lang.Thread.run(Thread.java:745)
      

        Issue Links

          Activity

          Hide
          chtyim Terence Yim added a comment -

          In short, the root cause is due to Yarn is giving more containers than Twill asked and Twill will just use it if there is any pending runnable container requests, resulting in a mismatched container size, hence causing NPE when trying to remove the container request after launching the runnable.

          Show
          chtyim Terence Yim added a comment - In short, the root cause is due to Yarn is giving more containers than Twill asked and Twill will just use it if there is any pending runnable container requests, resulting in a mismatched container size, hence causing NPE when trying to remove the container request after launching the runnable.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user anwar6953 commented on a diff in the pull request:

          https://github.com/apache/twill/pull/34#discussion_r103595352

          — Diff: twill-yarn/src/main/java/org/apache/twill/internal/yarn/AbstractYarnAMClient.java —
          @@ -50,12 +51,11 @@
          private static final Logger LOG = LoggerFactory.getLogger(AbstractYarnAMClient.class);

          // Map from a unique ID to inflight requests

          • private final Multimap<String, T> containerRequests;
            -
          • // List of requests pending to send through allocate call
          • private final List<T> requests;
            + private final Multimap<String, T> inflightRequests;
            + // Map from a unique ID to pending requests. It is for recording
              • End diff –

          It is for recording what?
          (incomplete sentence?)

          Show
          githubbot ASF GitHub Bot added a comment - Github user anwar6953 commented on a diff in the pull request: https://github.com/apache/twill/pull/34#discussion_r103595352 — Diff: twill-yarn/src/main/java/org/apache/twill/internal/yarn/AbstractYarnAMClient.java — @@ -50,12 +51,11 @@ private static final Logger LOG = LoggerFactory.getLogger(AbstractYarnAMClient.class); // Map from a unique ID to inflight requests private final Multimap<String, T> containerRequests; - // List of requests pending to send through allocate call private final List<T> requests; + private final Multimap<String, T> inflightRequests; + // Map from a unique ID to pending requests. It is for recording End diff – It is for recording what? (incomplete sentence?)
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user chtyim commented on a diff in the pull request:

          https://github.com/apache/twill/pull/34#discussion_r103596214

          — Diff: twill-yarn/src/main/java/org/apache/twill/internal/yarn/AbstractYarnAMClient.java —
          @@ -50,12 +51,11 @@
          private static final Logger LOG = LoggerFactory.getLogger(AbstractYarnAMClient.class);

          // Map from a unique ID to inflight requests

          • private final Multimap<String, T> containerRequests;
            -
          • // List of requests pending to send through allocate call
          • private final List<T> requests;
            + private final Multimap<String, T> inflightRequests;
            + // Map from a unique ID to pending requests. It is for recording
              • End diff –

          Oh. It is for recording the container requests that has yet to be sent to RM. Will update the comment.

          Show
          githubbot ASF GitHub Bot added a comment - Github user chtyim commented on a diff in the pull request: https://github.com/apache/twill/pull/34#discussion_r103596214 — Diff: twill-yarn/src/main/java/org/apache/twill/internal/yarn/AbstractYarnAMClient.java — @@ -50,12 +51,11 @@ private static final Logger LOG = LoggerFactory.getLogger(AbstractYarnAMClient.class); // Map from a unique ID to inflight requests private final Multimap<String, T> containerRequests; - // List of requests pending to send through allocate call private final List<T> requests; + private final Multimap<String, T> inflightRequests; + // Map from a unique ID to pending requests. It is for recording End diff – Oh. It is for recording the container requests that has yet to be sent to RM. Will update the comment.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user anwar6953 commented on the issue:

          https://github.com/apache/twill/pull/34

          LGTM

          Show
          githubbot ASF GitHub Bot added a comment - Github user anwar6953 commented on the issue: https://github.com/apache/twill/pull/34 LGTM
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user hsaputra commented on the issue:

          https://github.com/apache/twill/pull/34

          Hi @chtyim looks like there is only 1 commit for this PR?

          Show
          githubbot ASF GitHub Bot added a comment - Github user hsaputra commented on the issue: https://github.com/apache/twill/pull/34 Hi @chtyim looks like there is only 1 commit for this PR?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user chtyim commented on the issue:

          https://github.com/apache/twill/pull/34

          I squashed them after getting LGTM to prepare for the merge

          Show
          githubbot ASF GitHub Bot added a comment - Github user chtyim commented on the issue: https://github.com/apache/twill/pull/34 I squashed them after getting LGTM to prepare for the merge
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user chtyim commented on the issue:

          https://github.com/apache/twill/pull/34

          Seems like the github sync is lagging. I merged this change about 6 hours ago

          Show
          githubbot ASF GitHub Bot added a comment - Github user chtyim commented on the issue: https://github.com/apache/twill/pull/34 Seems like the github sync is lagging. I merged this change about 6 hours ago
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user hsaputra commented on the issue:

          https://github.com/apache/twill/pull/34

          Ah ok

          Show
          githubbot ASF GitHub Bot added a comment - Github user hsaputra commented on the issue: https://github.com/apache/twill/pull/34 Ah ok
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/twill/pull/34

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/twill/pull/34

            People

            • Assignee:
              chtyim Terence Yim
              Reporter:
              sagark Sagar Kapare
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development