Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6771

RMContainerAllocator sends container diagnostics event after corresponding completion event

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.7.3
    • Fix Version/s: 2.8.0, 2.7.4, 3.0.0-alpha2
    • Component/s: mrv2
    • Labels:
      None
    • Target Version/s:

      Description

      Task containers can go over their resource limit, and killed by Node Manager. Then MR AM gets notified of the container status and diagnostics information through its heartbeat with RM. However, it is possible that the diagnostics information never gets into .jhist file, so when the job completes, the diagnostics information associated with the failed task attempts is empty. This makes it hard for users to root cause job failures that are often caused by memory leak.

      1. mapreduce6771.branch-2.8.patch
        8 kB
        Haibo Chen
      2. mapreduce6771.004.patch
        8 kB
        Haibo Chen
      3. mapreduce6771.003.patch
        8 kB
        Haibo Chen
      4. mapreduce6771.002.patch
        8 kB
        Haibo Chen
      5. TaUnsuccessfullyEventEmission.jpg
        1.69 MB
        Haibo Chen
      6. mapreduce6771.001.patch
        1 kB
        Haibo Chen

        Issue Links

          Activity

          Hide
          haibochen Haibo Chen added a comment -

          Thanks for your reviews Jason!

          Show
          haibochen Haibo Chen added a comment - Thanks for your reviews Jason!
          Hide
          jlowe Jason Lowe added a comment -

          Thanks, Haibo Chen! I committed this to trunk, branch-2, branch-2.8, and branch-2.7.

          Show
          jlowe Jason Lowe added a comment - Thanks, Haibo Chen ! I committed this to trunk, branch-2, branch-2.8, and branch-2.7.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10513 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10513/)
          MAPREDUCE-6771. RMContainerAllocator sends container diagnostics event (jlowe: rev a1b8251bf7a7e9b776c4483fa01f7d453420eba4)

          • (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
          • (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/rm/TestRMContainerAllocator.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10513 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10513/ ) MAPREDUCE-6771 . RMContainerAllocator sends container diagnostics event (jlowe: rev a1b8251bf7a7e9b776c4483fa01f7d453420eba4) (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/rm/TestRMContainerAllocator.java
          Hide
          jlowe Jason Lowe added a comment -

          +1 for the 2.8 patch as well. Committing this.

          Show
          jlowe Jason Lowe added a comment - +1 for the 2.8 patch as well. Committing this.
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 22m 58s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 12m 9s branch-2.8 passed
          +1 compile 0m 29s branch-2.8 passed with JDK v1.8.0_101
          +1 compile 0m 29s branch-2.8 passed with JDK v1.7.0_111
          +1 checkstyle 0m 21s branch-2.8 passed
          +1 mvnsite 0m 45s branch-2.8 passed
          +1 mvneclipse 0m 21s branch-2.8 passed
          +1 findbugs 0m 57s branch-2.8 passed
          +1 javadoc 0m 17s branch-2.8 passed with JDK v1.8.0_101
          +1 javadoc 0m 20s branch-2.8 passed with JDK v1.7.0_111
          +1 mvninstall 0m 28s the patch passed
          +1 compile 0m 23s the patch passed with JDK v1.8.0_101
          +1 javac 0m 23s the patch passed
          +1 compile 0m 25s the patch passed with JDK v1.7.0_111
          +1 javac 0m 25s the patch passed
          +1 checkstyle 0m 16s the patch passed
          +1 mvnsite 0m 32s the patch passed
          +1 mvneclipse 0m 15s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 6s the patch passed
          +1 javadoc 0m 14s the patch passed with JDK v1.8.0_101
          +1 javadoc 0m 18s the patch passed with JDK v1.7.0_111
          +1 unit 10m 24s hadoop-mapreduce-client-app in the patch passed with JDK v1.8.0_101.
          +1 unit 10m 12s hadoop-mapreduce-client-app in the patch passed with JDK v1.7.0_111.
          +1 asflicense 0m 19s The patch does not generate ASF License warnings.
          65m 17s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:5af2af1
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12830731/mapreduce6771.branch-2.8.patch
          JIRA Issue MAPREDUCE-6771
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux f6c96e1f11af 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision branch-2.8 / 7091753
          Default Java 1.7.0_111
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111
          findbugs v3.0.0
          JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6746/testReport/
          modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6746/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 22m 58s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 12m 9s branch-2.8 passed +1 compile 0m 29s branch-2.8 passed with JDK v1.8.0_101 +1 compile 0m 29s branch-2.8 passed with JDK v1.7.0_111 +1 checkstyle 0m 21s branch-2.8 passed +1 mvnsite 0m 45s branch-2.8 passed +1 mvneclipse 0m 21s branch-2.8 passed +1 findbugs 0m 57s branch-2.8 passed +1 javadoc 0m 17s branch-2.8 passed with JDK v1.8.0_101 +1 javadoc 0m 20s branch-2.8 passed with JDK v1.7.0_111 +1 mvninstall 0m 28s the patch passed +1 compile 0m 23s the patch passed with JDK v1.8.0_101 +1 javac 0m 23s the patch passed +1 compile 0m 25s the patch passed with JDK v1.7.0_111 +1 javac 0m 25s the patch passed +1 checkstyle 0m 16s the patch passed +1 mvnsite 0m 32s the patch passed +1 mvneclipse 0m 15s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 6s the patch passed +1 javadoc 0m 14s the patch passed with JDK v1.8.0_101 +1 javadoc 0m 18s the patch passed with JDK v1.7.0_111 +1 unit 10m 24s hadoop-mapreduce-client-app in the patch passed with JDK v1.8.0_101. +1 unit 10m 12s hadoop-mapreduce-client-app in the patch passed with JDK v1.7.0_111. +1 asflicense 0m 19s The patch does not generate ASF License warnings. 65m 17s Subsystem Report/Notes Docker Image:yetus/hadoop:5af2af1 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12830731/mapreduce6771.branch-2.8.patch JIRA Issue MAPREDUCE-6771 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux f6c96e1f11af 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2.8 / 7091753 Default Java 1.7.0_111 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111 findbugs v3.0.0 JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6746/testReport/ modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6746/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          haibochen Haibo Chen added a comment -

          I'll upload one for 2.8 shortly

          Show
          haibochen Haibo Chen added a comment - I'll upload one for 2.8 shortly
          Hide
          jlowe Jason Lowe added a comment -

          +1 for the latest patch. This seems like an important fix to get into 2.8 as well. Could you provide a patch for branch-2.8?

          Show
          jlowe Jason Lowe added a comment - +1 for the latest patch. This seems like an important fix to get into 2.8 as well. Could you provide a patch for branch-2.8?
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 16s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 8m 42s trunk passed
          +1 compile 0m 28s trunk passed
          +1 checkstyle 0m 21s trunk passed
          +1 mvnsite 0m 34s trunk passed
          +1 mvneclipse 0m 17s trunk passed
          +1 findbugs 0m 40s trunk passed
          +1 javadoc 0m 17s trunk passed
          +1 mvninstall 0m 26s the patch passed
          +1 compile 0m 23s the patch passed
          +1 javac 0m 23s the patch passed
          +1 checkstyle 0m 16s the patch passed
          +1 mvnsite 0m 30s the patch passed
          +1 mvneclipse 0m 13s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 0m 48s the patch passed
          +1 javadoc 0m 14s the patch passed
          +1 unit 9m 18s hadoop-mapreduce-client-app in the patch passed.
          +1 asflicense 0m 22s The patch does not generate ASF License warnings.
          24m 44s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12830355/mapreduce6771.004.patch
          JIRA Issue MAPREDUCE-6771
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 164c4a2c2ded 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 8ae4729
          Default Java 1.8.0_101
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6743/testReport/
          modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6743/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 16s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 8m 42s trunk passed +1 compile 0m 28s trunk passed +1 checkstyle 0m 21s trunk passed +1 mvnsite 0m 34s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 0m 40s trunk passed +1 javadoc 0m 17s trunk passed +1 mvninstall 0m 26s the patch passed +1 compile 0m 23s the patch passed +1 javac 0m 23s the patch passed +1 checkstyle 0m 16s the patch passed +1 mvnsite 0m 30s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 48s the patch passed +1 javadoc 0m 14s the patch passed +1 unit 9m 18s hadoop-mapreduce-client-app in the patch passed. +1 asflicense 0m 22s The patch does not generate ASF License warnings. 24m 44s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12830355/mapreduce6771.004.patch JIRA Issue MAPREDUCE-6771 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 164c4a2c2ded 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 8ae4729 Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6743/testReport/ modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6743/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          Odd, the precommit came back for patch v3 instead of v4. Kicking it again.

          Show
          jlowe Jason Lowe added a comment - Odd, the precommit came back for patch v3 instead of v4. Kicking it again.
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 16s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 8m 57s trunk passed
          +1 compile 0m 27s trunk passed
          +1 checkstyle 0m 21s trunk passed
          +1 mvnsite 0m 33s trunk passed
          +1 mvneclipse 0m 16s trunk passed
          +1 findbugs 0m 43s trunk passed
          +1 javadoc 0m 18s trunk passed
          +1 mvninstall 0m 27s the patch passed
          +1 compile 0m 24s the patch passed
          +1 javac 0m 24s the patch passed
          +1 checkstyle 0m 18s the patch passed
          +1 mvnsite 0m 30s the patch passed
          +1 mvneclipse 0m 13s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 0m 47s the patch passed
          +1 javadoc 0m 14s the patch passed
          +1 unit 9m 11s hadoop-mapreduce-client-app in the patch passed.
          +1 asflicense 0m 22s The patch does not generate ASF License warnings.
          24m 57s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828914/mapreduce6771.003.patch
          JIRA Issue MAPREDUCE-6771
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 4bf69bcf8535 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 4815d02
          Default Java 1.8.0_101
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6739/testReport/
          modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6739/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 16s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 8m 57s trunk passed +1 compile 0m 27s trunk passed +1 checkstyle 0m 21s trunk passed +1 mvnsite 0m 33s trunk passed +1 mvneclipse 0m 16s trunk passed +1 findbugs 0m 43s trunk passed +1 javadoc 0m 18s trunk passed +1 mvninstall 0m 27s the patch passed +1 compile 0m 24s the patch passed +1 javac 0m 24s the patch passed +1 checkstyle 0m 18s the patch passed +1 mvnsite 0m 30s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 47s the patch passed +1 javadoc 0m 14s the patch passed +1 unit 9m 11s hadoop-mapreduce-client-app in the patch passed. +1 asflicense 0m 22s The patch does not generate ASF License warnings. 24m 57s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828914/mapreduce6771.003.patch JIRA Issue MAPREDUCE-6771 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 4bf69bcf8535 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 4815d02 Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6739/testReport/ modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6739/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          haibochen Haibo Chen added a comment -

          Thanks Jason! That works perfectly. Uploading a new patch.

          Show
          haibochen Haibo Chen added a comment - Thanks Jason! That works perfectly. Uploading a new patch.
          Hide
          jlowe Jason Lowe added a comment -

          Use Mockito.isA instead of Mockito.any. The latter is throwing away the type information.

          Show
          jlowe Jason Lowe added a comment - Use Mockito.isA instead of Mockito.any. The latter is throwing away the type information.
          Hide
          haibochen Haibo Chen added a comment -

          Tried the following code to verify with InOrder

              InOrder inOrder = inOrder(eventHandler);
              inOrder.verify(eventHandler).handle(
                  any(TaskAttemptDiagnosticsUpdateEvent.class));
              inOrder.verify(eventHandler).handle(any(TaskAttemptEvent.class));
              inOrder.verifyNoMoreInteractions();
          

          Getting the following error:

          org.mockito.exceptions.verification.VerificationInOrderFailure: 
          Verification in order failure:
          eventHandler.handle(<any>);
          Wanted 1 time:
          -> at org.apache.hadoop.mapreduce.v2.app.rm.TestRMContainerAllocator.testHandlingFinishedContainers(TestRMContainerAllocator.java:3049)
          But was 2 times. Undesired invocation:
          -> at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.processFinishedContainer(RMContainerAllocator.java:864)
          
          	at org.apache.hadoop.mapreduce.v2.app.rm.TestRMContainerAllocator.testHandlingFinishedContainers(TestRMContainerAllocator.java:3049)
          
          

          I seems like mockito does not distinguish the two types of events passed to event handler.

          Show
          haibochen Haibo Chen added a comment - Tried the following code to verify with InOrder InOrder inOrder = inOrder(eventHandler); inOrder.verify(eventHandler).handle( any(TaskAttemptDiagnosticsUpdateEvent.class)); inOrder.verify(eventHandler).handle(any(TaskAttemptEvent.class)); inOrder.verifyNoMoreInteractions(); Getting the following error: org.mockito.exceptions.verification.VerificationInOrderFailure: Verification in order failure: eventHandler.handle(<any>); Wanted 1 time: -> at org.apache.hadoop.mapreduce.v2.app.rm.TestRMContainerAllocator.testHandlingFinishedContainers(TestRMContainerAllocator.java:3049) But was 2 times. Undesired invocation: -> at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.processFinishedContainer(RMContainerAllocator.java:864) at org.apache.hadoop.mapreduce.v2.app.rm.TestRMContainerAllocator.testHandlingFinishedContainers(TestRMContainerAllocator.java:3049) I seems like mockito does not distinguish the two types of events passed to event handler.
          Hide
          haibochen Haibo Chen added a comment -

          Thank you for your reivews, Jason!

          I did think about have a black-box test for this, but it seems too much efforts to generate finished containers from RM side. (Scheduler cannot inject finished containers into the AllocateReponse directly. Need to generate events in RM to trigger the RMAppAttemp to update its finished containers, which will then be returned to RMContainerAllocator in the heartbeat) But please let me know if there is an easy way to do so that I am not aware of.

          Will address the rest of your comments in the new patch.

          Show
          haibochen Haibo Chen added a comment - Thank you for your reivews, Jason! I did think about have a black-box test for this, but it seems too much efforts to generate finished containers from RM side. (Scheduler cannot inject finished containers into the AllocateReponse directly. Need to generate events in RM to trigger the RMAppAttemp to update its finished containers, which will then be returned to RMContainerAllocator in the heartbeat) But please let me know if there is an easy way to do so that I am not aware of. Will address the rest of your comments in the new patch.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for updating the patch!

          Rather than having the test reach into what is essentially a private function, processFinishedContainer, I think we can test this as a black-box. By providing a mocked event handler in the AppContext we can see what events are sent to it for order verification, and we can use Mockito InOrder verification to verify the order of events rather than having to manually capture them. Then we can write a test like the others using MockScheduler to test this without reaching into the internals of RMContainerAllocator.

          If you decide that's not worth it, processFinishedContainer should not be public but rather package-private which removes the need to mark it with the Private annotation.

          testHandleingFinishedContainers s/b testHandlingFinishedContainers

          Show
          jlowe Jason Lowe added a comment - Thanks for updating the patch! Rather than having the test reach into what is essentially a private function, processFinishedContainer, I think we can test this as a black-box. By providing a mocked event handler in the AppContext we can see what events are sent to it for order verification, and we can use Mockito InOrder verification to verify the order of events rather than having to manually capture them. Then we can write a test like the others using MockScheduler to test this without reaching into the internals of RMContainerAllocator. If you decide that's not worth it, processFinishedContainer should not be public but rather package-private which removes the need to mark it with the Private annotation. testHandleingFinishedContainers s/b testHandlingFinishedContainers
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 12s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 8m 34s trunk passed
          +1 compile 0m 29s trunk passed
          +1 checkstyle 0m 22s trunk passed
          +1 mvnsite 0m 36s trunk passed
          +1 mvneclipse 0m 18s trunk passed
          +1 findbugs 0m 54s trunk passed
          +1 javadoc 0m 19s trunk passed
          +1 mvninstall 0m 25s the patch passed
          +1 compile 0m 25s the patch passed
          +1 javac 0m 25s the patch passed
          +1 checkstyle 0m 16s the patch passed
          +1 mvnsite 0m 30s the patch passed
          +1 mvneclipse 0m 13s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 0m 50s the patch passed
          +1 javadoc 0m 14s the patch passed
          +1 unit 9m 24s hadoop-mapreduce-client-app in the patch passed.
          +1 asflicense 0m 21s The patch does not generate ASF License warnings.
          25m 6s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828914/mapreduce6771.003.patch
          JIRA Issue MAPREDUCE-6771
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 20789ead66d6 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / f6f3a44
          Default Java 1.8.0_101
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6725/testReport/
          modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6725/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 12s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 8m 34s trunk passed +1 compile 0m 29s trunk passed +1 checkstyle 0m 22s trunk passed +1 mvnsite 0m 36s trunk passed +1 mvneclipse 0m 18s trunk passed +1 findbugs 0m 54s trunk passed +1 javadoc 0m 19s trunk passed +1 mvninstall 0m 25s the patch passed +1 compile 0m 25s the patch passed +1 javac 0m 25s the patch passed +1 checkstyle 0m 16s the patch passed +1 mvnsite 0m 30s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 50s the patch passed +1 javadoc 0m 14s the patch passed +1 unit 9m 24s hadoop-mapreduce-client-app in the patch passed. +1 asflicense 0m 21s The patch does not generate ASF License warnings. 25m 6s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828914/mapreduce6771.003.patch JIRA Issue MAPREDUCE-6771 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 20789ead66d6 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / f6f3a44 Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6725/testReport/ modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6725/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          haibochen Haibo Chen added a comment -

          Uploaded a new one to address check style issues.

          Show
          haibochen Haibo Chen added a comment - Uploaded a new one to address check style issues.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 14s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 6m 58s trunk passed
          +1 compile 0m 23s trunk passed
          +1 checkstyle 0m 19s trunk passed
          +1 mvnsite 0m 29s trunk passed
          +1 mvneclipse 0m 16s trunk passed
          +1 findbugs 0m 35s trunk passed
          +1 javadoc 0m 15s trunk passed
          +1 mvninstall 0m 21s the patch passed
          +1 compile 0m 19s the patch passed
          -1 javac 0m 19s hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app generated 1 new + 10 unchanged - 0 fixed = 11 total (was 10)
          -1 checkstyle 0m 15s hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app: The patch generated 2 new + 226 unchanged - 0 fixed = 228 total (was 226)
          +1 mvnsite 0m 26s the patch passed
          +1 mvneclipse 0m 13s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 0m 39s the patch passed
          +1 javadoc 0m 13s the patch passed
          +1 unit 8m 49s hadoop-mapreduce-client-app in the patch passed.
          +1 asflicense 0m 16s The patch does not generate ASF License warnings.
          21m 40s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828906/mapreduce6771.002.patch
          JIRA Issue MAPREDUCE-6771
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux b492be6fb499 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 0e68e14
          Default Java 1.8.0_101
          findbugs v3.0.0
          javac https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6723/artifact/patchprocess/diff-compile-javac-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt
          checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6723/artifact/patchprocess/diff-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt
          Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6723/testReport/
          modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6723/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 14s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 58s trunk passed +1 compile 0m 23s trunk passed +1 checkstyle 0m 19s trunk passed +1 mvnsite 0m 29s trunk passed +1 mvneclipse 0m 16s trunk passed +1 findbugs 0m 35s trunk passed +1 javadoc 0m 15s trunk passed +1 mvninstall 0m 21s the patch passed +1 compile 0m 19s the patch passed -1 javac 0m 19s hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app generated 1 new + 10 unchanged - 0 fixed = 11 total (was 10) -1 checkstyle 0m 15s hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app: The patch generated 2 new + 226 unchanged - 0 fixed = 228 total (was 226) +1 mvnsite 0m 26s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 39s the patch passed +1 javadoc 0m 13s the patch passed +1 unit 8m 49s hadoop-mapreduce-client-app in the patch passed. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 21m 40s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828906/mapreduce6771.002.patch JIRA Issue MAPREDUCE-6771 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux b492be6fb499 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 0e68e14 Default Java 1.8.0_101 findbugs v3.0.0 javac https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6723/artifact/patchprocess/diff-compile-javac-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6723/artifact/patchprocess/diff-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6723/testReport/ modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6723/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          haibochen Haibo Chen added a comment -

          Sorry for the long delay. Uploading a new patch that adds a unit test.

          Show
          haibochen Haibo Chen added a comment - Sorry for the long delay. Uploading a new patch that adds a unit test.
          Hide
          jlowe Jason Lowe added a comment -

          Yep, let's get an incremental improvement while the complete solution is being thought out. I only brought up MAPREDUCE-4955 because the original summary implied this would fix all cases of the container diagnostic being dropped. Updating the headline to reflect the reduction in scope.

          Show
          jlowe Jason Lowe added a comment - Yep, let's get an incremental improvement while the complete solution is being thought out. I only brought up MAPREDUCE-4955 because the original summary implied this would fix all cases of the container diagnostic being dropped. Updating the headline to reflect the reduction in scope.
          Hide
          haibochen Haibo Chen added a comment -

          Note that we aren't stuck with TaskAttemptUnsuccessfulCompletion event for doing diagnostics.

          Agree. I am guessing the reason why diagnostics is included in TaskAttemptUnsuccessfulCompletionEvent is users only want to see diagnostics when task attempts fail. Parsing a new event and ignoring such events for successful task attempts does need additional change.

          but waiting for a container completion event is not something the state machine does today.

          There is no need to wait for container completion event. My proposal is to wait for transition into FAILED state. As long as the task attempt goes into FAILED state, which does not necessarily need to be triggered by a container completion event (Time out (TA_TIMED_OUT) is already built-in in transitions from FAIL_FINISHING_CONTAINER to FAILED), the diagnostics will be written into jhist file. But your point of having a wide window is susceptible to AM crash is still very convincing.

          Given that there is no clear preferable approach to address the case in MAPREDUCE-4955, do you think I can go ahead address the issue in this jira? The symptom of MAPREDUCE-4955 and this one is the same, but the cause is not quite exactly. The case in MAPREDUCE-4955 happens when AM thinks the task attempt is already dead, or the diagnostics comes after a taskUnsuccessfulCompletionEvent is generated already, whereas the case in this jira happens when the diagnostics comes in while task attempt is still in running state, or before a taskUnsuccessfulCompletionEvent. The case in this jira is easy to fix, and we can keep MAPREDUCE-4955 to address the other when we decide what to do.

          Show
          haibochen Haibo Chen added a comment - Note that we aren't stuck with TaskAttemptUnsuccessfulCompletion event for doing diagnostics. Agree. I am guessing the reason why diagnostics is included in TaskAttemptUnsuccessfulCompletionEvent is users only want to see diagnostics when task attempts fail. Parsing a new event and ignoring such events for successful task attempts does need additional change. but waiting for a container completion event is not something the state machine does today. There is no need to wait for container completion event. My proposal is to wait for transition into FAILED state. As long as the task attempt goes into FAILED state, which does not necessarily need to be triggered by a container completion event (Time out (TA_TIMED_OUT) is already built-in in transitions from FAIL_FINISHING_CONTAINER to FAILED), the diagnostics will be written into jhist file. But your point of having a wide window is susceptible to AM crash is still very convincing. Given that there is no clear preferable approach to address the case in MAPREDUCE-4955 , do you think I can go ahead address the issue in this jira? The symptom of MAPREDUCE-4955 and this one is the same, but the cause is not quite exactly. The case in MAPREDUCE-4955 happens when AM thinks the task attempt is already dead, or the diagnostics comes after a taskUnsuccessfulCompletionEvent is generated already, whereas the case in this jira happens when the diagnostics comes in while task attempt is still in running state, or before a taskUnsuccessfulCompletionEvent. The case in this jira is easy to fix, and we can keep MAPREDUCE-4955 to address the other when we decide what to do.
          Hide
          jlowe Jason Lowe added a comment -

          so my understanding of this is there should be ideally one such event in the jhist file

          Yes, ideally we should avoid emitting more than one TaskAttemptUnsuccessfulCompletion event. There are other tools besides the JHS that look at these jhist files, and I don't know how well they will handle more than one of these for the same attempt.

          Note that we aren't stuck with TaskAttemptUnsuccessfulCompletion event for doing diagnostics. We could use some new diagnostic event just for this purpose, but that too could cause troubles for jhist parsers that don't skip unknown records.

          As for the postponing we probably can move it farther down the state machine, but waiting for a container completion event is not something the state machine does today. For example, the FAIL_FINISHING_CONTAINER state is just waiting for the AM to send a kill container request to the NM and not actually waiting for the container completion event. It would need to do so. Another issue with postponing is the dependency on the container completion event. There have been issues in the past where the MR AM "missed" a container completion event and caused a scheduling hang. We'd need some kind of safety valve to prevent the AM from waiting forever for a completion event that would never arrive. Another issue with waiting is that if the AM crashes after a task reported failure but before the container completion event arrived then that won't be noticed by the subsequent AM attempt. (Yes, this race occurs today but the window would be significantly wider.) Those kinds of issues makes the "lets record some additional diagnostics after the fact" approach more appealing, since we do exactly what we do today with an addendum if a container completion event has more info after an attempt completion has already been recorded in the jhist file.

          Both approaches have pros and cons, and I'm not sure which I prefer yet.

          Show
          jlowe Jason Lowe added a comment - so my understanding of this is there should be ideally one such event in the jhist file Yes, ideally we should avoid emitting more than one TaskAttemptUnsuccessfulCompletion event. There are other tools besides the JHS that look at these jhist files, and I don't know how well they will handle more than one of these for the same attempt. Note that we aren't stuck with TaskAttemptUnsuccessfulCompletion event for doing diagnostics. We could use some new diagnostic event just for this purpose, but that too could cause troubles for jhist parsers that don't skip unknown records. As for the postponing we probably can move it farther down the state machine, but waiting for a container completion event is not something the state machine does today. For example, the FAIL_FINISHING_CONTAINER state is just waiting for the AM to send a kill container request to the NM and not actually waiting for the container completion event. It would need to do so. Another issue with postponing is the dependency on the container completion event. There have been issues in the past where the MR AM "missed" a container completion event and caused a scheduling hang. We'd need some kind of safety valve to prevent the AM from waiting forever for a completion event that would never arrive. Another issue with waiting is that if the AM crashes after a task reported failure but before the container completion event arrived then that won't be noticed by the subsequent AM attempt. (Yes, this race occurs today but the window would be significantly wider.) Those kinds of issues makes the "lets record some additional diagnostics after the fact" approach more appealing, since we do exactly what we do today with an addendum if a container completion event has more info after an attempt completion has already been recorded in the jhist file. Both approaches have pros and cons, and I'm not sure which I prefer yet.
          Hide
          haibochen Haibo Chen added a comment -

          Thanks Jason Lowe for bringing up the case in MAPREDUCE-4955. Looking at that jira, it is indeed another case where diagnostics could be lost.

          The AM would either need to postpone recording the attempt completion event until it receives the container completion event to see if there are any diagnostics or there needs to be a way to record postmortem diagnostics for attempts in the jhist file.

          The diagnostics are included as part of a TaskAttemptUnsuccessfullyCompletionEvent, so my understanding of this is there should be ideally one such event in the jhist file (If we emit multiple instances, JobHistoryParser will always take the last instance seen in .jhist file). Please correct me if I am wrong. Therefore, I am thinking of postponing recording the unsuccessfully completion event.

          postpone recording the attempt completion event until it receives the container completion event to see if there are any diagnostics

          TaskAttemptUnsuccessfullyCompletionEvent is generated upon receipt of TA_KILL, TA_TooManyFetchFailures and TA_FailMsg Postponing the event emission until a container completion event is received makes the handling of TA_FAILMSG semantically inconsistent with that of other cases. I wonder if it is semantically cleaner to postpone the completion event until the transition into the final states (FAILED, KILLED). The emission of TaskAttemptUnsuccessfullyCompletionEvents happens currently before transition into FAIL_FINISHING_CONTAINER, FAILED or KILLED state, but given that FAIL_FINISHING_CONTAINER will eventually transition into FAILED state, we could reduce the three cases into two (See the attachment show transitions during which an TaskAttemptUnsuccessfullyCompletionEvent is generated). That is, right before a task attempt goes into KILLED or FAILED, a TaskAttemptUnsuccessfullyCompletionEvents is written into the .jhist file.

          Show
          haibochen Haibo Chen added a comment - Thanks Jason Lowe for bringing up the case in MAPREDUCE-4955 . Looking at that jira, it is indeed another case where diagnostics could be lost. The AM would either need to postpone recording the attempt completion event until it receives the container completion event to see if there are any diagnostics or there needs to be a way to record postmortem diagnostics for attempts in the jhist file. The diagnostics are included as part of a TaskAttemptUnsuccessfullyCompletionEvent, so my understanding of this is there should be ideally one such event in the jhist file (If we emit multiple instances, JobHistoryParser will always take the last instance seen in .jhist file). Please correct me if I am wrong. Therefore, I am thinking of postponing recording the unsuccessfully completion event. postpone recording the attempt completion event until it receives the container completion event to see if there are any diagnostics TaskAttemptUnsuccessfullyCompletionEvent is generated upon receipt of TA_KILL, TA_TooManyFetchFailures and TA_FailMsg Postponing the event emission until a container completion event is received makes the handling of TA_FAILMSG semantically inconsistent with that of other cases. I wonder if it is semantically cleaner to postpone the completion event until the transition into the final states (FAILED, KILLED). The emission of TaskAttemptUnsuccessfullyCompletionEvents happens currently before transition into FAIL_FINISHING_CONTAINER, FAILED or KILLED state, but given that FAIL_FINISHING_CONTAINER will eventually transition into FAILED state, we could reduce the three cases into two (See the attachment show transitions during which an TaskAttemptUnsuccessfullyCompletionEvent is generated). That is, right before a task attempt goes into KILLED or FAILED, a TaskAttemptUnsuccessfullyCompletionEvents is written into the .jhist file.
          Hide
          jlowe Jason Lowe added a comment -

          Not sure how a unit test can be written. Any suggestion is greatly appreciated.

          A unit test could verify that when the RMCommunicator receives a container completion event with diagnostics it sends the diagnostic event before it sends the completion event. That test will fail before this change and pass afterwards.

          Show
          jlowe Jason Lowe added a comment - Not sure how a unit test can be written. Any suggestion is greatly appreciated. A unit test could verify that when the RMCommunicator receives a container completion event with diagnostics it sends the diagnostic event before it sends the completion event. That test will fail before this change and pass afterwards.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for the report and patch! This is closely related to MAPREDUCE-4955.

          I think this is necessary but not sufficient to fix the problem. As noted in MAPREDUCE-4955 the task can fail due to the SIGTERM and report that failure via the umbilical before the container completion event arrives at the AM. At that point the task attempt is already dead from the AM perspective and the .jhist entry already recorded, so the extra diagnostics have nowhere to go. The AM would either need to postpone recording the attempt completion event until it receives the container completion event to see if there are any diagnostics or there needs to be a way to record postmortem diagnostics for attempts in the jhist file.

          Show
          jlowe Jason Lowe added a comment - Thanks for the report and patch! This is closely related to MAPREDUCE-4955 . I think this is necessary but not sufficient to fix the problem. As noted in MAPREDUCE-4955 the task can fail due to the SIGTERM and report that failure via the umbilical before the container completion event arrives at the AM. At that point the task attempt is already dead from the AM perspective and the .jhist entry already recorded, so the extra diagnostics have nowhere to go. The AM would either need to postpone recording the attempt completion event until it receives the container completion event to see if there are any diagnostics or there needs to be a way to record postmortem diagnostics for attempts in the jhist file.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 13m 21s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 7m 17s trunk passed
          +1 compile 0m 22s trunk passed
          +1 checkstyle 0m 16s trunk passed
          +1 mvnsite 0m 29s trunk passed
          +1 mvneclipse 0m 15s trunk passed
          +1 findbugs 0m 34s trunk passed
          +1 javadoc 0m 15s trunk passed
          +1 mvninstall 0m 21s the patch passed
          +1 compile 0m 20s the patch passed
          +1 javac 0m 20s the patch passed
          +1 checkstyle 0m 13s the patch passed
          +1 mvnsite 0m 26s the patch passed
          +1 mvneclipse 0m 12s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 0m 39s the patch passed
          +1 javadoc 0m 12s the patch passed
          +1 unit 8m 44s hadoop-mapreduce-client-app in the patch passed.
          +1 asflicense 0m 17s The patch does not generate ASF License warnings.
          34m 58s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12825749/mapreduce6771.001.patch
          JIRA Issue MAPREDUCE-6771
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 3f99f0b54520 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 19c743c
          Default Java 1.8.0_101
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6700/testReport/
          modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6700/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 13m 21s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 7m 17s trunk passed +1 compile 0m 22s trunk passed +1 checkstyle 0m 16s trunk passed +1 mvnsite 0m 29s trunk passed +1 mvneclipse 0m 15s trunk passed +1 findbugs 0m 34s trunk passed +1 javadoc 0m 15s trunk passed +1 mvninstall 0m 21s the patch passed +1 compile 0m 20s the patch passed +1 javac 0m 20s the patch passed +1 checkstyle 0m 13s the patch passed +1 mvnsite 0m 26s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 39s the patch passed +1 javadoc 0m 12s the patch passed +1 unit 8m 44s hadoop-mapreduce-client-app in the patch passed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 34m 58s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12825749/mapreduce6771.001.patch JIRA Issue MAPREDUCE-6771 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 3f99f0b54520 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 19c743c Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6700/testReport/ modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6700/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          haibochen Haibo Chen added a comment -

          Uploading a patch to fix this. Not sure how a unit test can be written. Any suggestion is greatly appreciated.

          Show
          haibochen Haibo Chen added a comment - Uploading a patch to fix this. Not sure how a unit test can be written. Any suggestion is greatly appreciated.
          Hide
          haibochen Haibo Chen added a comment -

          If tasked are killed or failed on NM before they can notify AM, the user need to dig through NM logs, or task logs hoping they can find some useful information as to why the task attempt failed.

          Show
          haibochen Haibo Chen added a comment - If tasked are killed or failed on NM before they can notify AM, the user need to dig through NM logs, or task logs hoping they can find some useful information as to why the task attempt failed.
          Hide
          haibochen Haibo Chen added a comment -

          Analysis:

          RMContainerAllocator.getResources() {
            ...
                for (ContainerStatus cont : finishedContainers) {
                LOG.info("Received completed container " + cont.getContainerId());
                TaskAttemptId attemptID = assignedRequests.get(cont.getContainerId());
                if (attemptID == null) {
                  LOG.error("Container complete event for unknown container id "
                      + cont.getContainerId());
                } else {
                  pendingRelease.remove(cont.getContainerId());
                  assignedRequests.remove(attemptID);
                  
                  // send the container completed event to Task attempt
                  eventHandler.handle(createContainerFinishedEvent(cont, attemptID));
                  
                  // Send the diagnostics
                  String diagnostics = StringInterner.weakIntern(cont.getDiagnostics());
                  eventHandler.handle(new TaskAttemptDiagnosticsUpdateEvent(attemptID,
                      diagnostics));
          
                  preemptionPolicy.handleCompletedContainer(attemptID);
                }
            ...
          }
          

          The scenario in question is described as follows: A job is running, and one of tasks attempt running on a NM is killed by the NM because the container exceeds its resource limit. The container status/diagnostics is sent to RM by the NM and then later to MR AM in its periodical heartbeat with RM as shown above. In MR AM, the task attempt is still in RUNNING state from AM's perspective, since the task heartbeat has not timed out.

          Upon receiving from RM that the task attempt container has finished, the RMCommunicator thread will place a ContainerFinishedEvent and a TaskAttemptDiagnosticsUpdateEvent in the event queue.

          The ContainerFinishedEvent will cause the task attempt in MR AM to transition from RUNNING to FAILED and a TaskAttemptUnsuccessfulCompletionEvent that contains the associated diagnostics information to be written to the .jhist file. The TaskAttemptDiagnosticsUpdateEvent will update the diagnostics information associated with the task attempt.

          But since the ContainerFinishedEvent is placed and processed before the TaskAttemptDiagnosticsUpdateEvent, the TaskAttemptUnsuccessfulCompletionEvent written to .jhist file will not contain the diagnostics info received from RM.

          After the job is completed, the user tries to access the failed task attempts through JHS, the TaskAttemptUnsuccessfulCompletionEvent is parsed to generate the failed attempt page. The page will not have diagnostics info from RM (such as container killed by Node Manager...) because it was never written to .jhist in the first place.

          Show
          haibochen Haibo Chen added a comment - Analysis: RMContainerAllocator.getResources() { ... for (ContainerStatus cont : finishedContainers) { LOG.info( "Received completed container " + cont.getContainerId()); TaskAttemptId attemptID = assignedRequests.get(cont.getContainerId()); if (attemptID == null ) { LOG.error( "Container complete event for unknown container id " + cont.getContainerId()); } else { pendingRelease.remove(cont.getContainerId()); assignedRequests.remove(attemptID); // send the container completed event to Task attempt eventHandler.handle(createContainerFinishedEvent(cont, attemptID)); // Send the diagnostics String diagnostics = StringInterner.weakIntern(cont.getDiagnostics()); eventHandler.handle( new TaskAttemptDiagnosticsUpdateEvent(attemptID, diagnostics)); preemptionPolicy.handleCompletedContainer(attemptID); } ... } The scenario in question is described as follows: A job is running, and one of tasks attempt running on a NM is killed by the NM because the container exceeds its resource limit. The container status/diagnostics is sent to RM by the NM and then later to MR AM in its periodical heartbeat with RM as shown above. In MR AM, the task attempt is still in RUNNING state from AM's perspective, since the task heartbeat has not timed out. Upon receiving from RM that the task attempt container has finished, the RMCommunicator thread will place a ContainerFinishedEvent and a TaskAttemptDiagnosticsUpdateEvent in the event queue. The ContainerFinishedEvent will cause the task attempt in MR AM to transition from RUNNING to FAILED and a TaskAttemptUnsuccessfulCompletionEvent that contains the associated diagnostics information to be written to the .jhist file. The TaskAttemptDiagnosticsUpdateEvent will update the diagnostics information associated with the task attempt. But since the ContainerFinishedEvent is placed and processed before the TaskAttemptDiagnosticsUpdateEvent, the TaskAttemptUnsuccessfulCompletionEvent written to .jhist file will not contain the diagnostics info received from RM. After the job is completed, the user tries to access the failed task attempts through JHS, the TaskAttemptUnsuccessfulCompletionEvent is parsed to generate the failed attempt page. The page will not have diagnostics info from RM (such as container killed by Node Manager...) because it was never written to .jhist in the first place.

            People

            • Assignee:
              haibochen Haibo Chen
              Reporter:
              haibochen Haibo Chen
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development