Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4437

Race in MR ApplicationMaster can cause reducers to never be scheduled

    Details

      Description

      If the MR AM is notified of container completion by the RM before the AM receives notification of the container cleanup from the NM then it can fail to schedule reducers indefinitely. Logs showing the issue to follow.

        Activity

        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #1139 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1139/)
        MAPREDUCE-4437. Race in MR ApplicationMaster can cause reducers to never be scheduled (Jason Lowe via bobby) (Revision 1362209)

        Result = FAILURE
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1362209
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1139 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1139/ ) MAPREDUCE-4437 . Race in MR ApplicationMaster can cause reducers to never be scheduled (Jason Lowe via bobby) (Revision 1362209) Result = FAILURE bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1362209 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Build #316 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/316/)
        svn merge -c 1362209 FIXES: MAPREDUCE-4437. Race in MR ApplicationMaster can cause reducers to never be scheduled (Jason Lowe via bobby) (Revision 1362214)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1362214
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #316 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/316/ ) svn merge -c 1362209 FIXES: MAPREDUCE-4437 . Race in MR ApplicationMaster can cause reducers to never be scheduled (Jason Lowe via bobby) (Revision 1362214) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1362214 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #1106 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1106/)
        MAPREDUCE-4437. Race in MR ApplicationMaster can cause reducers to never be scheduled (Jason Lowe via bobby) (Revision 1362209)

        Result = FAILURE
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1362209
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1106 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1106/ ) MAPREDUCE-4437 . Race in MR ApplicationMaster can cause reducers to never be scheduled (Jason Lowe via bobby) (Revision 1362209) Result = FAILURE bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1362209 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #2503 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2503/)
        MAPREDUCE-4437. Race in MR ApplicationMaster can cause reducers to never be scheduled (Jason Lowe via bobby) (Revision 1362209)

        Result = FAILURE
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1362209
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #2503 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2503/ ) MAPREDUCE-4437 . Race in MR ApplicationMaster can cause reducers to never be scheduled (Jason Lowe via bobby) (Revision 1362209) Result = FAILURE bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1362209 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk-Commit #2483 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2483/)
        MAPREDUCE-4437. Race in MR ApplicationMaster can cause reducers to never be scheduled (Jason Lowe via bobby) (Revision 1362209)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1362209
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #2483 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2483/ ) MAPREDUCE-4437 . Race in MR ApplicationMaster can cause reducers to never be scheduled (Jason Lowe via bobby) (Revision 1362209) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1362209 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #2548 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2548/)
        MAPREDUCE-4437. Race in MR ApplicationMaster can cause reducers to never be scheduled (Jason Lowe via bobby) (Revision 1362209)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1362209
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #2548 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2548/ ) MAPREDUCE-4437 . Race in MR ApplicationMaster can cause reducers to never be scheduled (Jason Lowe via bobby) (Revision 1362209) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1362209 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
        Hide
        Robert Joseph Evans added a comment -

        Missed 2.1.0 branch, this went into 2.2 instead.

        Show
        Robert Joseph Evans added a comment - Missed 2.1.0 branch, this went into 2.2 instead.
        Hide
        Robert Joseph Evans added a comment -

        Thanks Jason,

        I put this patch into trunk, branch-2, and branch-0.23

        Show
        Robert Joseph Evans added a comment - Thanks Jason, I put this patch into trunk, branch-2, and branch-0.23
        Hide
        Robert Joseph Evans added a comment -

        The change looks good, and it is generic enough to be sure that we recalculate every time a container finishes.

        +1.

        Show
        Robert Joseph Evans added a comment - The change looks good, and it is generic enough to be sure that we recalculate every time a container finishes. +1.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12536456/MAPREDUCE-4437.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2588//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2588//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12536456/MAPREDUCE-4437.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2588//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2588//console This message is automatically generated.
        Hide
        Jason Lowe added a comment -

        Patch to recalculate the reduce schedule if the number of completed tasks in the job changes from the last time we recalculated.

        Show
        Jason Lowe added a comment - Patch to recalculate the reduce schedule if the number of completed tasks in the job changes from the last time we recalculated.
        Hide
        Jason Lowe added a comment -

        This is an excerpt of the AM log from a sleep job with 20 mappers and 20 reducers and mapreduce.job.reduce.slowstart.completedmaps=1.0:

        ...
        2012-07-12 15:51:35,721 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1342108144362_0002_01_000017
        2012-07-12 15:51:35,721 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1342108144362_0002_01_000015
        2012-07-12 15:51:35,721 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
        2012-07-12 15:51:35,721 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule...
        2012-07-12 15:51:35,721 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 20
        2012-07-12 15:51:35,721 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000014_0: Container killed by the ApplicationMaster.
        
        2012-07-12 15:51:35,721 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000005_0: Container killed by the ApplicationMaster.
        
        2012-07-12 15:51:35,721 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000009_0: Container killed by the ApplicationMaster.
        
        2012-07-12 15:51:35,721 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000007_0: Container killed by the ApplicationMaster.
        
        2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000010_0: Container killed by the ApplicationMaster.
        
        2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000003_0: Container killed by the ApplicationMaster.
        
        2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000018_0: Container killed by the ApplicationMaster.
        
        2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000012_0: Container killed by the ApplicationMaster.
        
        2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000008_0: Container killed by the ApplicationMaster.
        
        2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000002_0: 
        2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000017_0: 
        2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000011_0: Container killed by the ApplicationMaster.
        
        2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000001_0: Container killed by the ApplicationMaster.
        
        2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000016_0: Container killed by the ApplicationMaster.
        
        2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000015_0: Container killed by the ApplicationMaster.
        
        2012-07-12 15:51:35,723 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000013_0: Container killed by the ApplicationMaster.
        
        2012-07-12 15:51:36,695 INFO [Socket Reader #1 for port 50500] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successfull for 2 (auth:SIMPLE)
        2012-07-12 15:51:36,696 INFO [Socket Reader #1 for port 50500] SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successfull for 2 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$BlockingInterface
        2012-07-12 15:51:36,715 INFO [Socket Reader #2 for port 50500] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successfull for 2 (auth:SIMPLE)
        2012-07-12 15:51:36,716 INFO [Socket Reader #2 for port 50500] SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successfull for 2 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$BlockingInterface
        2012-07-12 15:51:36,723 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
        2012-07-12 15:51:36,725 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
        2012-07-12 15:51:36,752 INFO [IPC Server handler 0 on 50500] org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task report for MAP   job_1342108144362_0002. Report-size will be 20
        2012-07-12 15:51:36,774 INFO [IPC Server handler 0 on 50500] org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task report for REDUCE   job_1342108144362_0002. Report-size will be 20
        2012-07-12 15:51:37,727 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
        2012-07-12 15:51:37,729 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
        2012-07-12 15:51:38,731 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
        2012-07-12 15:51:38,733 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
        2012-07-12 15:51:38,905 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1342108144362_0002_m_000002_0 TaskAttempt Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
        2012-07-12 15:51:38,905 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attempt attempt_1342108144362_0002_m_000002_0
        2012-07-12 15:51:38,906 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1342108144362_0002_m_000002 Task Transitioned from RUNNING to SUCCEEDED
        2012-07-12 15:51:38,906 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1342108144362_0002_m_000017_0 TaskAttempt Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
        2012-07-12 15:51:38,906 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 19
        2012-07-12 15:51:38,906 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attempt attempt_1342108144362_0002_m_000017_0
        2012-07-12 15:51:38,906 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1342108144362_0002_m_000017 Task Transitioned from RUNNING to SUCCEEDED
        2012-07-12 15:51:38,906 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 20
        2012-07-12 15:51:39,735 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:20 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
        2012-07-12 15:51:39,736 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:20 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
        2012-07-12 15:51:40,739 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:20 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
        2012-07-12 15:51:40,741 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:20 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
        2012-07-12 15:51:41,743 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:20 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
        2012-07-12 15:51:41,744 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:20 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
        ...
        

        Note the Recalculating schedule... message in the above log is the last occurrence of that message in the entire log. What appears to have happened is the RM notified the AM of all 20 containers having completed before all of them were fully processed by the AM. Therefore when it went to recalculate the reducer schedule, the slow start threshold had not yet been met since it thought only 18 of the 20 maps had completed. Later it processed the remaining two map task completions, but that did not recalculate the reducer schedule. No other container completions or events will occur after that, so the reducer schedule never gets recalculated. Without recalculating the reducer schedule, the AM will perpetually mistakenly think it should not schedule any reducers, and the job hangs.

        Show
        Jason Lowe added a comment - This is an excerpt of the AM log from a sleep job with 20 mappers and 20 reducers and mapreduce.job.reduce.slowstart.completedmaps=1.0: ... 2012-07-12 15:51:35,721 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1342108144362_0002_01_000017 2012-07-12 15:51:35,721 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1342108144362_0002_01_000015 2012-07-12 15:51:35,721 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640 2012-07-12 15:51:35,721 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule... 2012-07-12 15:51:35,721 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 20 2012-07-12 15:51:35,721 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000014_0: Container killed by the ApplicationMaster. 2012-07-12 15:51:35,721 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000005_0: Container killed by the ApplicationMaster. 2012-07-12 15:51:35,721 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000009_0: Container killed by the ApplicationMaster. 2012-07-12 15:51:35,721 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000007_0: Container killed by the ApplicationMaster. 2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000010_0: Container killed by the ApplicationMaster. 2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000003_0: Container killed by the ApplicationMaster. 2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000018_0: Container killed by the ApplicationMaster. 2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000012_0: Container killed by the ApplicationMaster. 2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000008_0: Container killed by the ApplicationMaster. 2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000002_0: 2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000017_0: 2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000011_0: Container killed by the ApplicationMaster. 2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000001_0: Container killed by the ApplicationMaster. 2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000016_0: Container killed by the ApplicationMaster. 2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000015_0: Container killed by the ApplicationMaster. 2012-07-12 15:51:35,723 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000013_0: Container killed by the ApplicationMaster. 2012-07-12 15:51:36,695 INFO [Socket Reader #1 for port 50500] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successfull for 2 (auth:SIMPLE) 2012-07-12 15:51:36,696 INFO [Socket Reader #1 for port 50500] SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successfull for 2 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$BlockingInterface 2012-07-12 15:51:36,715 INFO [Socket Reader #2 for port 50500] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successfull for 2 (auth:SIMPLE) 2012-07-12 15:51:36,716 INFO [Socket Reader #2 for port 50500] SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successfull for 2 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$BlockingInterface 2012-07-12 15:51:36,723 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640 2012-07-12 15:51:36,725 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640 2012-07-12 15:51:36,752 INFO [IPC Server handler 0 on 50500] org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task report for MAP job_1342108144362_0002. Report-size will be 20 2012-07-12 15:51:36,774 INFO [IPC Server handler 0 on 50500] org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task report for REDUCE job_1342108144362_0002. Report-size will be 20 2012-07-12 15:51:37,727 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640 2012-07-12 15:51:37,729 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640 2012-07-12 15:51:38,731 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640 2012-07-12 15:51:38,733 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640 2012-07-12 15:51:38,905 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1342108144362_0002_m_000002_0 TaskAttempt Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED 2012-07-12 15:51:38,905 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attempt attempt_1342108144362_0002_m_000002_0 2012-07-12 15:51:38,906 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1342108144362_0002_m_000002 Task Transitioned from RUNNING to SUCCEEDED 2012-07-12 15:51:38,906 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1342108144362_0002_m_000017_0 TaskAttempt Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED 2012-07-12 15:51:38,906 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 19 2012-07-12 15:51:38,906 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attempt attempt_1342108144362_0002_m_000017_0 2012-07-12 15:51:38,906 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1342108144362_0002_m_000017 Task Transitioned from RUNNING to SUCCEEDED 2012-07-12 15:51:38,906 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 20 2012-07-12 15:51:39,735 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:20 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640 2012-07-12 15:51:39,736 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:20 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640 2012-07-12 15:51:40,739 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:20 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640 2012-07-12 15:51:40,741 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:20 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640 2012-07-12 15:51:41,743 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:20 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640 2012-07-12 15:51:41,744 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:20 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640 ... Note the Recalculating schedule... message in the above log is the last occurrence of that message in the entire log. What appears to have happened is the RM notified the AM of all 20 containers having completed before all of them were fully processed by the AM. Therefore when it went to recalculate the reducer schedule, the slow start threshold had not yet been met since it thought only 18 of the 20 maps had completed. Later it processed the remaining two map task completions, but that did not recalculate the reducer schedule. No other container completions or events will occur after that, so the reducer schedule never gets recalculated. Without recalculating the reducer schedule, the AM will perpetually mistakenly think it should not schedule any reducers, and the job hangs.

          People

          • Assignee:
            Jason Lowe
            Reporter:
            Jason Lowe
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development