Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5131

Distributed shell AM fails when extra container arrives during finishing

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Because of YARN-1902, extra container could be allocated to AM which causes AM failure.

      Logs look like:

      16/05/17 07:58:39 INFO distributedshell.ApplicationMaster: Launching shell command on a new container., containerId=container_e44_1463470957478_0018_01_000007, containerNode=host1:25454, containerNodeURI=host1:8042, containerResourceMemory3072, containerResourceVirtualCores1
      16/05/17 07:58:39 INFO distributedshell.ApplicationMaster: Setting up container launch container for containerid=container_e44_1463470957478_0018_01_000007
      16/05/17 07:58:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_e44_1463470957478_0018_01_000007
      16/05/17 07:58:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : host1:25454
      .......
      16/05/17 07:58:39 INFO distributedshell.ApplicationMaster: Application completed. Stopping running containers
      16/05/17 07:58:39 INFO impl.NMClientAsyncImpl: NM Client is being stopped.
      16/05/17 07:58:39 INFO impl.NMClientAsyncImpl: Waiting for eventDispatcherThread to be interrupted.
      16/05/17 07:58:39 INFO impl.NMClientAsyncImpl: eventDispatcherThread exited.
      16/05/17 07:58:39 ERROR distributedshell.ApplicationMaster: Failed to start Container container_e44_1463470957478_0018_01_000007
      16/05/17 07:58:39 INFO impl.NMClientAsyncImpl: Stopping NM client.
      ........
      16/05/17 07:58:39 INFO distributedshell.ApplicationMaster: Diagnostics., total=5, completed=6, allocated=6, failed=1
      16/05/17 07:58:39 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
      16/05/17 07:58:40 INFO distributedshell.ApplicationMaster: Application Master failed. exiting
      16/05/17 07:58:40 INFO impl.AMRMClientAsyncImpl: Interrupted while waiting for queue
      java.lang.InterruptedException
              at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
              at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
              at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:287)
      End of LogType:AppMaster.stde
      

        Activity

        Hide
        hitesh Hitesh Shah added a comment - - edited

        The error in the description is not really an error. The thread was interrupted and does not match the title related to an NPE.

        Show
        hitesh Hitesh Shah added a comment - - edited The error in the description is not really an error. The thread was interrupted and does not match the title related to an NPE.
        Hide
        leftnoteasy Wangda Tan added a comment -

        Hitesh Shah, yes you're correct, InterruptedException will not cause AM failure. Updating title and desc.

        The root cause of this issue is because of YARN-1902, YARN scheduler could allocate more container than required to AM. When AM is finishing when extra container arrives, container launch will fail because NMClient thread is interrupted, which causes following check fails:

            if (numFailedContainers.get() == 0 &&
                numCompletedContainers.get() == numTotalContainers) {
                // SUCCESSFUL
            }
        

        Instead we should deduct failed container from completed containers, uploading patch.

        Show
        leftnoteasy Wangda Tan added a comment - Hitesh Shah , yes you're correct, InterruptedException will not cause AM failure. Updating title and desc. The root cause of this issue is because of YARN-1902 , YARN scheduler could allocate more container than required to AM. When AM is finishing when extra container arrives, container launch will fail because NMClient thread is interrupted, which causes following check fails: if (numFailedContainers.get() == 0 && numCompletedContainers.get() == numTotalContainers) { // SUCCESSFUL } Instead we should deduct failed container from completed containers, uploading patch.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 11s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 8m 37s trunk passed
        +1 compile 0m 15s trunk passed
        +1 checkstyle 0m 13s trunk passed
        +1 mvnsite 0m 18s trunk passed
        +1 mvneclipse 0m 12s trunk passed
        +1 findbugs 0m 30s trunk passed
        +1 javadoc 0m 13s trunk passed
        +1 mvninstall 0m 16s the patch passed
        +1 compile 0m 13s the patch passed
        +1 javac 0m 13s the patch passed
        +1 checkstyle 0m 9s the patch passed
        +1 mvnsite 0m 17s the patch passed
        +1 mvneclipse 0m 10s the patch passed
        +1 whitespace 0m 0s Patch has no whitespace issues.
        +1 findbugs 0m 40s the patch passed
        +1 javadoc 0m 9s the patch passed
        +1 unit 7m 36s hadoop-yarn-applications-distributedshell in the patch passed.
        +1 asflicense 0m 21s Patch does not generate ASF License warnings.
        20m 58s



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:2c91fd8
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12805765/YARN-5131.1.patch
        JIRA Issue YARN-5131
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux c102a2d9de82 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 4b0f55b
        Default Java 1.8.0_91
        findbugs v3.0.0
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/11640/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/11640/console
        Powered by Apache Yetus 0.2.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 11s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 8m 37s trunk passed +1 compile 0m 15s trunk passed +1 checkstyle 0m 13s trunk passed +1 mvnsite 0m 18s trunk passed +1 mvneclipse 0m 12s trunk passed +1 findbugs 0m 30s trunk passed +1 javadoc 0m 13s trunk passed +1 mvninstall 0m 16s the patch passed +1 compile 0m 13s the patch passed +1 javac 0m 13s the patch passed +1 checkstyle 0m 9s the patch passed +1 mvnsite 0m 17s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 0m 40s the patch passed +1 javadoc 0m 9s the patch passed +1 unit 7m 36s hadoop-yarn-applications-distributedshell in the patch passed. +1 asflicense 0m 21s Patch does not generate ASF License warnings. 20m 58s Subsystem Report/Notes Docker Image:yetus/hadoop:2c91fd8 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12805765/YARN-5131.1.patch JIRA Issue YARN-5131 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux c102a2d9de82 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 4b0f55b Default Java 1.8.0_91 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/11640/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell Console output https://builds.apache.org/job/PreCommit-YARN-Build/11640/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
        Hide
        leftnoteasy Wangda Tan added a comment -

        Varun Vasudev, could you help to review this patch?

        Thanks,

        Show
        leftnoteasy Wangda Tan added a comment - Varun Vasudev , could you help to review this patch? Thanks,
        Hide
        djp Junping Du added a comment -

        +1. Patch LGTM. Will commit it shortly if no further comments from others.

        Show
        djp Junping Du added a comment - +1. Patch LGTM. Will commit it shortly if no further comments from others.
        Hide
        djp Junping Du added a comment -

        I have commit the patch to trunk and branch-2. Thanks Wangda Tan for patch contribution!

        Show
        djp Junping Du added a comment - I have commit the patch to trunk and branch-2. Thanks Wangda Tan for patch contribution!
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-trunk-Commit #9857 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9857/)
        YARN-5131. Distributed shell AM fails when extra container arrives (junping_du: rev 48c931331cc43970e31866732f9ac82ee806ee03)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #9857 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9857/ ) YARN-5131 . Distributed shell AM fails when extra container arrives (junping_du: rev 48c931331cc43970e31866732f9ac82ee806ee03) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
        Hide
        leftnoteasy Wangda Tan added a comment -

        Thanks Junping Du for review and commit!

        Show
        leftnoteasy Wangda Tan added a comment - Thanks Junping Du for review and commit!

          People

          • Assignee:
            leftnoteasy Wangda Tan
            Reporter:
            ssathish@hortonworks.com Sumana Sathish
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development