Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4152

map task left hanging after AM dies trying to connect to RM

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.2
    • Fix Version/s: 0.23.3, 2.0.2-alpha
    • Component/s: mrv2
    • Labels:
      None

      Description

      We had an instance where the RM went down for more then an hour. The application master exited with "Could not contact RM after 360000 milliseconds"

      2012-04-11 10:43:36,040 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1333003059741_15999Job Transitioned from RUNNING to ERROR

      1. MAPREDUCE-4152.patch
        19 kB
        Thomas Graves
      2. MAPREDUCE-4152.patch
        19 kB
        Thomas Graves
      3. MAPREDUCE-4152.patch
        19 kB
        Thomas Graves
      4. MAPREDUCE-4152.patch
        8 kB
        Thomas Graves

        Activity

        Arun C Murthy made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Arun C Murthy made changes -
        Fix Version/s 2.0.2-alpha [ 12322471 ]
        Fix Version/s 3.0.0 [ 12320355 ]
        Fix Version/s 2.1.0-alpha [ 12321442 ]
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #1096 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1096/)
        MAPREDUCE-4152. map task left hanging after AM dies trying to connect to RM (Tom Graves via bobby) (Revision 1344283)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1344283
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskAttempt.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncherImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1096 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1096/ ) MAPREDUCE-4152 . map task left hanging after AM dies trying to connect to RM (Tom Graves via bobby) (Revision 1344283) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1344283 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskAttempt.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncherImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Build #273 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/273/)
        svn merge -c 1344283. FIXES: MAPREDUCE-4152. map task left hanging after AM dies trying to connect to RM (Tom Graves via bobby) (Revision 1344289)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1344289
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskAttempt.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncherImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #273 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/273/ ) svn merge -c 1344283. FIXES: MAPREDUCE-4152 . map task left hanging after AM dies trying to connect to RM (Tom Graves via bobby) (Revision 1344289) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1344289 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskAttempt.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncherImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #1062 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1062/)
        MAPREDUCE-4152. map task left hanging after AM dies trying to connect to RM (Tom Graves via bobby) (Revision 1344283)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1344283
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskAttempt.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncherImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1062 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1062/ ) MAPREDUCE-4152 . map task left hanging after AM dies trying to connect to RM (Tom Graves via bobby) (Revision 1344283) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1344283 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskAttempt.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncherImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #2318 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2318/)
        MAPREDUCE-4152. map task left hanging after AM dies trying to connect to RM (Tom Graves via bobby) (Revision 1344283)

        Result = FAILURE
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1344283
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskAttempt.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncherImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #2318 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2318/ ) MAPREDUCE-4152 . map task left hanging after AM dies trying to connect to RM (Tom Graves via bobby) (Revision 1344283) Result = FAILURE bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1344283 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskAttempt.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncherImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #2372 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2372/)
        MAPREDUCE-4152. map task left hanging after AM dies trying to connect to RM (Tom Graves via bobby) (Revision 1344283)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1344283
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskAttempt.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncherImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #2372 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2372/ ) MAPREDUCE-4152 . map task left hanging after AM dies trying to connect to RM (Tom Graves via bobby) (Revision 1344283) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1344283 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskAttempt.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncherImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk-Commit #2299 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2299/)
        MAPREDUCE-4152. map task left hanging after AM dies trying to connect to RM (Tom Graves via bobby) (Revision 1344283)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1344283
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskAttempt.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncherImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #2299 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2299/ ) MAPREDUCE-4152 . map task left hanging after AM dies trying to connect to RM (Tom Graves via bobby) (Revision 1344283) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1344283 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskAttempt.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncherImpl.java
        Robert Joseph Evans made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Fix Version/s 0.23.3 [ 12320060 ]
        Fix Version/s 2.0.1-alpha [ 12321442 ]
        Fix Version/s 3.0.0 [ 12320355 ]
        Resolution Fixed [ 1 ]
        Hide
        Robert Joseph Evans added a comment -

        Thanks Tom,

        I just put this into trunk, branch-2, and branch-0.23

        Show
        Robert Joseph Evans added a comment - Thanks Tom, I just put this into trunk, branch-2, and branch-0.23
        Hide
        Robert Joseph Evans added a comment -

        +1 looks good to me, I'll check it in.

        Show
        Robert Joseph Evans added a comment - +1 looks good to me, I'll check it in.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12528775/MAPREDUCE-4152.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2415//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2415//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12528775/MAPREDUCE-4152.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2415//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2415//console This message is automatically generated.
        Thomas Graves made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Thomas Graves made changes -
        Attachment MAPREDUCE-4152.patch [ 12528775 ]
        Hide
        Thomas Graves added a comment -

        slightly different check to avoid sending the event if the container is already done.

        Show
        Thomas Graves added a comment - slightly different check to avoid sending the event if the container is already done.
        Thomas Graves made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Thomas Graves made changes -
        Attachment MAPREDUCE-4152.patch [ 12528772 ]
        Hide
        Thomas Graves added a comment -

        I added the check to only try to kill it if its not already done/failed.

        Show
        Thomas Graves added a comment - I added the check to only try to kill it if its not already done/failed.
        Hide
        Robert Joseph Evans added a comment -

        I like the patch it looks good to me. I can agree with you Tom that there should be no issues with calling kill multiple times, because the protocol does not even provide a way for the NM to inform the AM that the kill failed. That being said I do think it would be cleaner to have the kill method check for the DONE state, and do nothing if it is in the DONE state.

        I also agree that once RM restart is in place and it can pick up where it left off we don't need this any more. But I have not seem much work going into RM restart lately, and I suspect that we will be in production before that happens. So I would like to see this go in.

        I am +1 on the patch even without the DONE state check in the Container Launcher.

        Show
        Robert Joseph Evans added a comment - I like the patch it looks good to me. I can agree with you Tom that there should be no issues with calling kill multiple times, because the protocol does not even provide a way for the NM to inform the AM that the kill failed. That being said I do think it would be cleaner to have the kill method check for the DONE state, and do nothing if it is in the DONE state. I also agree that once RM restart is in place and it can pick up where it left off we don't need this any more. But I have not seem much work going into RM restart lately, and I suspect that we will be in production before that happens. So I would like to see this go in. I am +1 on the patch even without the DONE state check in the Container Launcher.
        Hide
        Thomas Graves added a comment -

        Vinod, comments?

        Show
        Thomas Graves added a comment - Vinod, comments?
        Hide
        Thomas Graves added a comment -
        • I agree, but RM restart doesn't work, AM currently times out (See RMContainerAllocator - conf setting MR_AM_TO_RM_WAIT_INTERVAL_MS.), so I think it should clean up. When RM restart is implemented, the timeout of AM can possibly be removed and it won't cleanup. The killing of its task on shutdown being there won't hurt anything.
        • The RM does not kill all containers that were running because it doesn't know what containers were running. On restart it loses everything. Also when the RM does come back up, it tells all the node managers that heart beat in to reboot, so they lose the containers also.
        Show
        Thomas Graves added a comment - I agree, but RM restart doesn't work, AM currently times out (See RMContainerAllocator - conf setting MR_AM_TO_RM_WAIT_INTERVAL_MS.), so I think it should clean up. When RM restart is implemented, the timeout of AM can possibly be removed and it won't cleanup. The killing of its task on shutdown being there won't hurt anything. The RM does not kill all containers that were running because it doesn't know what containers were running. On restart it loses everything. Also when the RM does come back up, it tells all the node managers that heart beat in to reboot, so they lose the containers also.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Going back and forth on this one, apologies.

        So the situation is that RM went down somehow and AM exited without killing its tasks. This is expected IIRC. Here's what I think:

        • When RM restart works, AMs should never exit because of connection issues. (Of course, there is a corner case of AMs network itself being down, we should handle that somehow)
        • Even in the short term, if RM goes down and AM exits in the mean while, whenever RM is back up, it will(should) kill all the containers of this application( by commanding the NMs to do so).

        Given above, I don't see why the AM needs to handle this specially. May be I am missing something?

        Show
        Vinod Kumar Vavilapalli added a comment - Going back and forth on this one, apologies. So the situation is that RM went down somehow and AM exited without killing its tasks. This is expected IIRC. Here's what I think: When RM restart works, AMs should never exit because of connection issues. (Of course, there is a corner case of AMs network itself being down, we should handle that somehow) Even in the short term, if RM goes down and AM exits in the mean while, whenever RM is back up, it will(should) kill all the containers of this application( by commanding the NMs to do so). Given above, I don't see why the AM needs to handle this specially. May be I am missing something?
        Hide
        Thomas Graves added a comment -

        java doc warnings are not from this change - were there prior.

        Show
        Thomas Graves added a comment - java doc warnings are not from this change - were there prior.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12525188/MAPREDUCE-4152.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        -1 javadoc. The javadoc tool appears to have generated 16 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2342//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2342//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12525188/MAPREDUCE-4152.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified test files. -1 javadoc. The javadoc tool appears to have generated 16 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2342//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2342//console This message is automatically generated.
        Thomas Graves made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Thomas Graves made changes -
        Attachment MAPREDUCE-4152.patch [ 12525188 ]
        Hide
        Thomas Graves added a comment -
        • changed Container constructor to take individual parameters.
        • make TaskAttemptImpl handle TA_CONTAINER_CLEANED in RUNNING and after states.
        Show
        Thomas Graves added a comment - changed Container constructor to take individual parameters. make TaskAttemptImpl handle TA_CONTAINER_CLEANED in RUNNING and after states.
        Hide
        Thomas Graves added a comment -

        TA_CONTAINER_CLEANED is not handled in TaskAttempt when it is in running state, it will cause more cascading errors which we can avoid by making it a legal state at running too.

        Note, the app master isn't going to process this event when its called from the stop() service as the dispatcher is busy in the handleFinishEvent routine in the MRAppMaster. But we should handle it since its using the async dispatcher and we could add more threads to it.

        kill() can and should be made reentrant in case the container was already killed. (There is a small race when a container can be killed twice. This happens only after the patch, as stop() is out-of-band)

        I don't see anything wrong with this as is, do you see something specific I'm missing? Nothing bad happens by calling kill of a container that is already being killed or killed. We could add a check in the kill() routine to see if the container state is already ContainerState.DONE or FAILED and skip the extra call if you think that is better?

        Show
        Thomas Graves added a comment - TA_CONTAINER_CLEANED is not handled in TaskAttempt when it is in running state, it will cause more cascading errors which we can avoid by making it a legal state at running too. Note, the app master isn't going to process this event when its called from the stop() service as the dispatcher is busy in the handleFinishEvent routine in the MRAppMaster. But we should handle it since its using the async dispatcher and we could add more threads to it. kill() can and should be made reentrant in case the container was already killed. (There is a small race when a container can be killed twice. This happens only after the patch, as stop() is out-of-band) I don't see anything wrong with this as is, do you see something specific I'm missing? Nothing bad happens by calling kill of a container that is already being killed or killed. We could add a check in the kill() routine to see if the container state is already ContainerState.DONE or FAILED and skip the extra call if you think that is better?
        Thomas Graves made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Thomas Graves added a comment -

        Canceling patch to address the comments.

        Thanks for the review Vinod.

        From what I could find the NM doesn't actually kill anything except when told to do so or in the over the memory limit case. Since the RM was down and the AM went away there was no one to tell the NM to kill it. The thing that might make sense is to have the NM kill any containers when it is gracefully shutting down and when its starting up (in case of the crash case). It might not make sense for it to just kill the containers immediately since those containers could be running and finish just fine. When its shutting down its a bit easier since it knows what containers are running, the starting up is a bit harder since the NM then needs to know exactly what was running before it shut down and make sure it doesn't kill something it shouldn't. In this particular case when the RM comes back up, it tells the NM to reboot so it would kill the containers at that point.

        I think its a bit more of a corner case because normally the AM would have killed the task or the task would have finished normally. But I will file the jiras for those. Let me know if you have additional thoughts.

        I originally had a check in kill() to make sure it wasn't done but had somehow thought it wasn't needed, perhaps I misread something, will look again.

        I will make the other changes.

        Show
        Thomas Graves added a comment - Canceling patch to address the comments. Thanks for the review Vinod. From what I could find the NM doesn't actually kill anything except when told to do so or in the over the memory limit case. Since the RM was down and the AM went away there was no one to tell the NM to kill it. The thing that might make sense is to have the NM kill any containers when it is gracefully shutting down and when its starting up (in case of the crash case). It might not make sense for it to just kill the containers immediately since those containers could be running and finish just fine. When its shutting down its a bit easier since it knows what containers are running, the starting up is a bit harder since the NM then needs to know exactly what was running before it shut down and make sure it doesn't kill something it shouldn't. In this particular case when the RM comes back up, it tells the NM to reboot so it would kill the containers at that point. I think its a bit more of a corner case because normally the AM would have killed the task or the task would have finished normally. But I will file the jiras for those. Let me know if you have additional thoughts. I originally had a check in kill() to make sure it wasn't done but had somehow thought it wasn't needed, perhaps I misread something, will look again. I will make the other changes.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        The Job did not kill off the map task that it had running before exiting. In JobImpl when it moves from RUNNING to ERROR, all it does is send the JobUnsuccessfulCompletion event. I would think it would atleast try to kill any tasks it has.

        This is reasonable.

        Now there might also be another issue with NM as to why it didn't kill it.

        Can you please investigate? This seems a more important issue that needs fixing.

        Regarding the patch:

        • TA_CONTAINER_CLEANED is not handled in TaskAttempt when it is in running state, it will cause more cascading errors which we can avoid by making it a legal state at running too.
        • kill() can and should be made reentrant in case the container was already killed. (There is a small race when a container can be killed twice. This happens only after the patch, as stop() is out-of-band)
        • Very minor-nit: It is natural for the Container constructor to take in all the info it needs, instead of launcher event.
        Show
        Vinod Kumar Vavilapalli added a comment - The Job did not kill off the map task that it had running before exiting. In JobImpl when it moves from RUNNING to ERROR, all it does is send the JobUnsuccessfulCompletion event. I would think it would atleast try to kill any tasks it has. This is reasonable. Now there might also be another issue with NM as to why it didn't kill it. Can you please investigate? This seems a more important issue that needs fixing. Regarding the patch: TA_CONTAINER_CLEANED is not handled in TaskAttempt when it is in running state, it will cause more cascading errors which we can avoid by making it a legal state at running too. kill() can and should be made reentrant in case the container was already killed. (There is a small race when a container can be killed twice. This happens only after the patch, as stop() is out-of-band) Very minor-nit: It is natural for the Container constructor to take in all the info it needs, instead of launcher event.
        Hide
        Thomas Graves added a comment -

        The test was failing before this change and when I run manually it passes.

        Show
        Thomas Graves added a comment - The test was failing before this change and when I run manually it passes.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12524905/MAPREDUCE-4152.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.mapreduce.TestYarnClientProtocolProvider

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2322//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2322//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12524905/MAPREDUCE-4152.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.mapreduce.TestYarnClientProtocolProvider +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2322//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2322//console This message is automatically generated.
        Hide
        Thomas Graves added a comment -

        I also manually tested this. For the error case, I started a job with a small RM timeout set -Dyarn.app.mapreduce.am.scheduler.connection.wait.interval-ms=60 , brought down the RM, and then verified the AM killed its containers before exiting. Without this change the containers stay around after the AM goes away.

        Show
        Thomas Graves added a comment - I also manually tested this. For the error case, I started a job with a small RM timeout set -Dyarn.app.mapreduce.am.scheduler.connection.wait.interval-ms=60 , brought down the RM, and then verified the AM killed its containers before exiting. Without this change the containers stay around after the AM goes away.
        Thomas Graves made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Thomas Graves made changes -
        Field Original Value New Value
        Attachment MAPREDUCE-4152.patch [ 12524905 ]
        Hide
        Thomas Graves added a comment -

        It looks like the best way to have the app master clean up any containers that aren't completed is to do it via a service. I first investigating just having the job send the kill event when it transitioned to the ERROR state but it sends a kill event to each task, which then each task has to send a kill event to the task attempt, and then the task attempt send an event to the container launcher to tell the node manager to kill the container. That is a lot to do and I don't want to have the job wait for that to happen since its an error state. If you don't wait, then you have a race as to whether everything is actually processed. The other issue with sending events is that the final jobfinish event is handled by the same async dispatcher so it will be busy finishing/shutting down and won't process any further events. So it seems it has to be done by a service during the stop call. The container launcher already knows what containers it has that aren't complete so I chose to have the container launcher kill any containers that haven't completed when its stop routine is called. The bad part is the container launcher didn't have all the information required to actually kill the container so I had to add it, which I'm not completely happy with but seemed the best fit.

        I will attach a patch shortly.

        Show
        Thomas Graves added a comment - It looks like the best way to have the app master clean up any containers that aren't completed is to do it via a service. I first investigating just having the job send the kill event when it transitioned to the ERROR state but it sends a kill event to each task, which then each task has to send a kill event to the task attempt, and then the task attempt send an event to the container launcher to tell the node manager to kill the container. That is a lot to do and I don't want to have the job wait for that to happen since its an error state. If you don't wait, then you have a race as to whether everything is actually processed. The other issue with sending events is that the final jobfinish event is handled by the same async dispatcher so it will be busy finishing/shutting down and won't process any further events. So it seems it has to be done by a service during the stop call. The container launcher already knows what containers it has that aren't complete so I chose to have the container launcher kill any containers that haven't completed when its stop routine is called. The bad part is the container launcher didn't have all the information required to actually kill the container so I had to add it, which I'm not completely happy with but seemed the best fit. I will attach a patch shortly.
        Hide
        Thomas Graves added a comment -

        The Job did not kill off the map task that it had running before exiting. In JobImpl when it moves from RUNNING to ERROR, all it does is send the JobUnsuccessfulCompletion event. I would think it would atleast try to kill any tasks it has.

        Now there might also be another issue with NM as to why it didn't kill it. I need to investigate that further. The NM was also not able to connect to RM and I saw one of the threads restart. I'm guessing when that restarted it lost that container but I need to investigate that further.

        Show
        Thomas Graves added a comment - The Job did not kill off the map task that it had running before exiting. In JobImpl when it moves from RUNNING to ERROR, all it does is send the JobUnsuccessfulCompletion event. I would think it would atleast try to kill any tasks it has. Now there might also be another issue with NM as to why it didn't kill it. I need to investigate that further. The NM was also not able to connect to RM and I saw one of the threads restart. I'm guessing when that restarted it lost that container but I need to investigate that further.
        Thomas Graves created issue -

          People

          • Assignee:
            Thomas Graves
            Reporter:
            Thomas Graves
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development