Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6513

MR job got hanged forever when one NM unstable for some time

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      when job is in-progress which is having more tasks,one node became unstable due to some OS issue.After the node became unstable, the map on this node status changed to KILLED state.

      Currently maps which were running on unstable node are rescheduled, and all are in scheduled state and wait for RM assign container.Seen ask requests for map till Node is good (all those failed), there are no ask request after this. But AM keeps on preempting the reducers (it's recycling).

      Finally reducers are waiting for complete mappers and mappers did n't get container..

      My Question Is:
      ============
      why map requests did not sent AM ,once after node recovery.?

      1. MAPREDUCE-6513.01.patch
        37 kB
        Varun Saxena
      2. MAPREDUCE-6513.02.patch
        34 kB
        Varun Saxena
      3. MAPREDUCE-6513.03.patch
        34 kB
        Varun Saxena
      4. MAPREDUCE-6513.3_1.branch-2.7.patch
        38 kB
        Wangda Tan
      5. MAPREDUCE-6513.3_1.branch-2.8.patch
        34 kB
        Wangda Tan
      6. MAPREDUCE-6513.3.branch-2.8.patch
        34 kB
        Wangda Tan

        Issue Links

          Activity

          Hide
          varun_saxena Varun Saxena added a comment -

          Took logs for analysis from Bob offline. The scenario is as under :
          1. All the maps have completed.

          2015-10-13 04:38:42,229 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:651 AssignedMaps:0 AssignedReds:0 CompletedMaps:78 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14

          2. One node becomes unstable and hence some of the succeeded map tasks which ran on that node are killed

          2015-10-13 04:53:41,127 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000077_0
          2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000026_0
          2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000007_0
          2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000034_0
          2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000015_0
          2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000036_0
          

          3. As can be seen below 16 maps are now scheduled

          2015-10-13 04:53:42,128 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:0 ScheduledMaps:16 ScheduledReds:651 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14

          4. Node comes back up again after a while.

          5. After this we keep on seeing that reducers keep on getting preempted, scheduled and this goes on and on in a cycle. And mappers are never assigned(due to lower priority).

          2015-10-13 04:38:40,219 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:651 AssignedMaps:2 AssignedReds:0 CompletedMaps:78 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14
          2015-10-13 04:38:40,223 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:651 AssignedMaps:1 AssignedReds:0 CompletedMaps:78 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14
          2015-10-13 04:38:42,229 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:651 AssignedMaps:0 AssignedReds:0 CompletedMaps:78 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14
          2015-10-13 04:53:42,128 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:0 ScheduledMaps:16 ScheduledReds:651 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14
          2015-10-13 04:53:42,132 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14
          2015-10-13 04:54:49,433 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:84 ContRel:6 HostLocal:64 RackLocal:14
          2015-10-13 04:54:50,451 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:90 ContRel:12 HostLocal:64 RackLocal:14
          2015-10-13 04:54:51,470 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:95 ContRel:17 HostLocal:64 RackLocal:14
          2015-10-13 04:54:52,501 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:114 ContRel:36 HostLocal:64 RackLocal:14
          2015-10-13 04:54:53,553 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:129 ContRel:51 HostLocal:64 RackLocal:14
          2015-10-13 04:54:54,657 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:147 ContRel:69 HostLocal:64 RackLocal:14
          2015-10-13 04:54:55,708 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:155 ContRel:77 HostLocal:64 RackLocal:14
          .......
          
          2015-10-13 04:55:04,912 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:641 ScheduledMaps:16 ScheduledReds:10 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:216 ContRel:138 HostLocal:64 RackLocal:14
          2015-10-13 04:55:05,923 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:636 ScheduledMaps:16 ScheduledReds:10 AssignedMaps:0 AssignedReds:5 CompletedMaps:62 CompletedReds:0 ContAlloc:221 ContRel:138 HostLocal:64 RackLocal:14
          2015-10-13 04:55:06,929 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:631 ScheduledMaps:16 ScheduledReds:9 AssignedMaps:0 AssignedReds:11 CompletedMaps:62 CompletedReds:0 ContAlloc:227 ContRel:138 HostLocal:64 RackLocal:14
          2015-10-13 04:55:07,945 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:630 ScheduledMaps:16 ScheduledReds:4 AssignedMaps:0 AssignedReds:17 CompletedMaps:62 CompletedReds:0 ContAlloc:233 ContRel:138 HostLocal:64 RackLocal:14
          2015-10-13 04:55:08,967 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:630 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:21 CompletedMaps:62 CompletedReds:0 ContAlloc:238 ContRel:139 HostLocal:64 RackLocal:14
          2015-10-13 04:55:09,967 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:641 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:21 CompletedMaps:62 CompletedReds:0 ContAlloc:238 ContRel:139 HostLocal:64 RackLocal:14
          2015-10-13 04:55:09,979 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:641 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:16 CompletedMaps:62 CompletedReds:0 ContAlloc:253 ContRel:154 HostLocal:64 RackLocal:14
          2015-10-13 04:55:11,013 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:641 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:10 CompletedMaps:62 CompletedReds:0 ContAlloc:260 ContRel:161 HostLocal:64 RackLocal:14
          2015-10-13 04:55:12,013 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:646 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:10 CompletedMaps:62 CompletedReds:0 ContAlloc:260 ContRel:161 HostLocal:64 RackLocal:14
          2015-10-13 04:55:12,031 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:636 ScheduledMaps:16 ScheduledReds:10 AssignedMaps:0 AssignedReds:8 CompletedMaps:62 CompletedReds:0 ContAlloc:267 ContRel:168 HostLocal:64 RackLocal:14
          2015-10-13 04:55:13,053 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:624 ScheduledMaps:16 ScheduledReds:15 AssignedMaps:0 AssignedReds:12 CompletedMaps:62 CompletedReds:0 ContAlloc:274 ContRel:168 HostLocal:64 RackLocal:14
          2015-10-13 04:55:14,061 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:614 ScheduledMaps:16 ScheduledReds:18 AssignedMaps:0 AssignedReds:19 CompletedMaps:62 CompletedReds:0 ContAlloc:281 ContRel:168 HostLocal:64 RackLocal:14
          ....
          
          2015-10-13 04:58:18,813 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:44 CompletedMaps:62 CompletedReds:0 ContAlloc:1372 ContRel:964 HostLocal:64 RackLocal:14
          2015-10-13 04:58:18,830 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:43 CompletedMaps:62 CompletedReds:0 ContAlloc:1386 ContRel:978 HostLocal:64 RackLocal:14
          2015-10-13 04:58:19,855 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:615 ScheduledMaps:16 ScheduledReds:8 AssignedMaps:0 AssignedReds:32 CompletedMaps:62 CompletedReds:0 ContAlloc:1394 ContRel:986 HostLocal:64 RackLocal:14
          2015-10-13 04:58:20,877 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:614 ScheduledMaps:16 ScheduledReds:3 AssignedMaps:0 AssignedReds:38 CompletedMaps:62 CompletedReds:0 ContAlloc:1400 ContRel:986 HostLocal:64 RackLocal:14
          2015-10-13 04:58:21,890 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:605 ScheduledMaps:16 ScheduledReds:9 AssignedMaps:0 AssignedReds:38 CompletedMaps:62 CompletedReds:0 ContAlloc:1405 ContRel:988 HostLocal:64 RackLocal:14
          2015-10-13 04:58:22,897 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:605 ScheduledMaps:16 ScheduledReds:4 AssignedMaps:0 AssignedReds:43 CompletedMaps:62 CompletedReds:0 ContAlloc:1410 ContRel:988 HostLocal:64 RackLocal:14
          ...
          
          2015-10-13 04:58:18,813 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:44 CompletedMaps:62 CompletedReds:0 ContAlloc:1372 ContRel:964 HostLocal:64 RackLocal:14
          2015-10-13 04:58:18,830 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:43 CompletedMaps:62 CompletedReds:0 ContAlloc:1386 ContRel:978 HostLocal:64 RackLocal:14
          2015-10-13 04:58:19,855 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:615 ScheduledMaps:16 ScheduledReds:8 AssignedMaps:0 AssignedReds:32 CompletedMaps:62 CompletedReds:0 ContAlloc:1394 ContRel:986 HostLocal:64 RackLocal:14
          2015-10-13 04:58:20,877 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:614 ScheduledMaps:16 ScheduledReds:3 AssignedMaps:0 AssignedReds:38 CompletedMaps:62 CompletedReds:0 ContAlloc:1400 ContRel:986 HostLocal:64 RackLocal:14
          2015-10-13 04:58:21,890 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:605 ScheduledMaps:16 ScheduledReds:9 AssignedMaps:0 AssignedReds:38 CompletedMaps:62 CompletedReds:0 ContAlloc:1405 ContRel:988 HostLocal:64 RackLocal:14
          2015-10-13 04:58:22,897 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:605 ScheduledMaps:16 ScheduledReds:4 AssignedMaps:0 AssignedReds:43 CompletedMaps:62 CompletedReds:0 ContAlloc:1410 ContRel:988 HostLocal:64 RackLocal:14
          
          Show
          varun_saxena Varun Saxena added a comment - Took logs for analysis from Bob offline. The scenario is as under : 1. All the maps have completed. 2015-10-13 04:38:42,229 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:651 AssignedMaps:0 AssignedReds:0 CompletedMaps:78 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14 2. One node becomes unstable and hence some of the succeeded map tasks which ran on that node are killed 2015-10-13 04:53:41,127 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000077_0 2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000026_0 2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000007_0 2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000034_0 2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000015_0 2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000036_0 3. As can be seen below 16 maps are now scheduled 2015-10-13 04:53:42,128 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:0 ScheduledMaps:16 ScheduledReds:651 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14 4. Node comes back up again after a while. 5. After this we keep on seeing that reducers keep on getting preempted, scheduled and this goes on and on in a cycle. And mappers are never assigned(due to lower priority). 2015-10-13 04:38:40,219 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:651 AssignedMaps:2 AssignedReds:0 CompletedMaps:78 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14 2015-10-13 04:38:40,223 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:651 AssignedMaps:1 AssignedReds:0 CompletedMaps:78 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14 2015-10-13 04:38:42,229 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:651 AssignedMaps:0 AssignedReds:0 CompletedMaps:78 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14 2015-10-13 04:53:42,128 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:0 ScheduledMaps:16 ScheduledReds:651 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14 2015-10-13 04:53:42,132 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14 2015-10-13 04:54:49,433 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:84 ContRel:6 HostLocal:64 RackLocal:14 2015-10-13 04:54:50,451 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:90 ContRel:12 HostLocal:64 RackLocal:14 2015-10-13 04:54:51,470 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:95 ContRel:17 HostLocal:64 RackLocal:14 2015-10-13 04:54:52,501 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:114 ContRel:36 HostLocal:64 RackLocal:14 2015-10-13 04:54:53,553 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:129 ContRel:51 HostLocal:64 RackLocal:14 2015-10-13 04:54:54,657 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:147 ContRel:69 HostLocal:64 RackLocal:14 2015-10-13 04:54:55,708 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:155 ContRel:77 HostLocal:64 RackLocal:14 ....... 2015-10-13 04:55:04,912 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:641 ScheduledMaps:16 ScheduledReds:10 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:216 ContRel:138 HostLocal:64 RackLocal:14 2015-10-13 04:55:05,923 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:636 ScheduledMaps:16 ScheduledReds:10 AssignedMaps:0 AssignedReds:5 CompletedMaps:62 CompletedReds:0 ContAlloc:221 ContRel:138 HostLocal:64 RackLocal:14 2015-10-13 04:55:06,929 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:631 ScheduledMaps:16 ScheduledReds:9 AssignedMaps:0 AssignedReds:11 CompletedMaps:62 CompletedReds:0 ContAlloc:227 ContRel:138 HostLocal:64 RackLocal:14 2015-10-13 04:55:07,945 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:630 ScheduledMaps:16 ScheduledReds:4 AssignedMaps:0 AssignedReds:17 CompletedMaps:62 CompletedReds:0 ContAlloc:233 ContRel:138 HostLocal:64 RackLocal:14 2015-10-13 04:55:08,967 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:630 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:21 CompletedMaps:62 CompletedReds:0 ContAlloc:238 ContRel:139 HostLocal:64 RackLocal:14 2015-10-13 04:55:09,967 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:641 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:21 CompletedMaps:62 CompletedReds:0 ContAlloc:238 ContRel:139 HostLocal:64 RackLocal:14 2015-10-13 04:55:09,979 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:641 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:16 CompletedMaps:62 CompletedReds:0 ContAlloc:253 ContRel:154 HostLocal:64 RackLocal:14 2015-10-13 04:55:11,013 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:641 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:10 CompletedMaps:62 CompletedReds:0 ContAlloc:260 ContRel:161 HostLocal:64 RackLocal:14 2015-10-13 04:55:12,013 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:646 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:10 CompletedMaps:62 CompletedReds:0 ContAlloc:260 ContRel:161 HostLocal:64 RackLocal:14 2015-10-13 04:55:12,031 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:636 ScheduledMaps:16 ScheduledReds:10 AssignedMaps:0 AssignedReds:8 CompletedMaps:62 CompletedReds:0 ContAlloc:267 ContRel:168 HostLocal:64 RackLocal:14 2015-10-13 04:55:13,053 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:624 ScheduledMaps:16 ScheduledReds:15 AssignedMaps:0 AssignedReds:12 CompletedMaps:62 CompletedReds:0 ContAlloc:274 ContRel:168 HostLocal:64 RackLocal:14 2015-10-13 04:55:14,061 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:614 ScheduledMaps:16 ScheduledReds:18 AssignedMaps:0 AssignedReds:19 CompletedMaps:62 CompletedReds:0 ContAlloc:281 ContRel:168 HostLocal:64 RackLocal:14 .... 2015-10-13 04:58:18,813 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:44 CompletedMaps:62 CompletedReds:0 ContAlloc:1372 ContRel:964 HostLocal:64 RackLocal:14 2015-10-13 04:58:18,830 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:43 CompletedMaps:62 CompletedReds:0 ContAlloc:1386 ContRel:978 HostLocal:64 RackLocal:14 2015-10-13 04:58:19,855 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:615 ScheduledMaps:16 ScheduledReds:8 AssignedMaps:0 AssignedReds:32 CompletedMaps:62 CompletedReds:0 ContAlloc:1394 ContRel:986 HostLocal:64 RackLocal:14 2015-10-13 04:58:20,877 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:614 ScheduledMaps:16 ScheduledReds:3 AssignedMaps:0 AssignedReds:38 CompletedMaps:62 CompletedReds:0 ContAlloc:1400 ContRel:986 HostLocal:64 RackLocal:14 2015-10-13 04:58:21,890 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:605 ScheduledMaps:16 ScheduledReds:9 AssignedMaps:0 AssignedReds:38 CompletedMaps:62 CompletedReds:0 ContAlloc:1405 ContRel:988 HostLocal:64 RackLocal:14 2015-10-13 04:58:22,897 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:605 ScheduledMaps:16 ScheduledReds:4 AssignedMaps:0 AssignedReds:43 CompletedMaps:62 CompletedReds:0 ContAlloc:1410 ContRel:988 HostLocal:64 RackLocal:14 ... 2015-10-13 04:58:18,813 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:44 CompletedMaps:62 CompletedReds:0 ContAlloc:1372 ContRel:964 HostLocal:64 RackLocal:14 2015-10-13 04:58:18,830 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:43 CompletedMaps:62 CompletedReds:0 ContAlloc:1386 ContRel:978 HostLocal:64 RackLocal:14 2015-10-13 04:58:19,855 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:615 ScheduledMaps:16 ScheduledReds:8 AssignedMaps:0 AssignedReds:32 CompletedMaps:62 CompletedReds:0 ContAlloc:1394 ContRel:986 HostLocal:64 RackLocal:14 2015-10-13 04:58:20,877 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:614 ScheduledMaps:16 ScheduledReds:3 AssignedMaps:0 AssignedReds:38 CompletedMaps:62 CompletedReds:0 ContAlloc:1400 ContRel:986 HostLocal:64 RackLocal:14 2015-10-13 04:58:21,890 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:605 ScheduledMaps:16 ScheduledReds:9 AssignedMaps:0 AssignedReds:38 CompletedMaps:62 CompletedReds:0 ContAlloc:1405 ContRel:988 HostLocal:64 RackLocal:14 2015-10-13 04:58:22,897 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:605 ScheduledMaps:16 ScheduledReds:4 AssignedMaps:0 AssignedReds:43 CompletedMaps:62 CompletedReds:0 ContAlloc:1410 ContRel:988 HostLocal:64 RackLocal:14
          Hide
          varun_saxena Varun Saxena added a comment -

          The headroom is not very high(sometimes comes as 0 in response too) as other heavy apps are running. We notice that we always ramp up and ramping down never happens which schedules reducers too aggressively. As can be seen below, there is no ramp down(except first time - 651 ramp downs).
          And we always find ramp up happening.

          2015-10-13 04:36:53,038 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:53:42,132 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:651
          2015-10-13 04:53:43,135 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:53:44,137 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:53:45,140 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:53:46,143 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:53:47,146 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:53:48,149 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:53:49,152 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:53:50,155 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:53:51,158 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:53:52,161 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:53:53,164 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:53:54,167 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:53:55,170 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:53:56,181 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:53:57,184 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:53:58,187 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:53:59,190 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:00,193 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:01,205 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:02,208 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:03,211 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:04,213 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:05,216 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:06,219 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:07,221 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:08,225 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:09,228 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:10,231 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:11,235 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:12,239 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:13,242 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:14,245 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:15,248 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:16,276 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:17,280 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:18,283 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:19,286 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:20,289 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:21,292 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:22,295 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:23,298 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:24,301 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:25,304 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          2015-10-13 04:54:26,307 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
          
          2015-10-13 04:37:39,685 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: All maps assigned. Ramping up all remaining reduces:651
          2015-10-13 04:55:04,912 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 10
          2015-10-13 04:55:05,923 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 5
          2015-10-13 04:55:06,929 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 5
          2015-10-13 04:55:07,945 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 1
          2015-10-13 04:55:12,031 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 10
          2015-10-13 04:55:13,053 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 12
          2015-10-13 04:55:14,061 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 10
          2015-10-13 04:55:16,075 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 2
          2015-10-13 04:55:17,092 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 1
          2015-10-13 04:55:20,147 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 1
          2015-10-13 04:55:21,165 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 10
          2015-10-13 04:55:22,175 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 2
          2015-10-13 04:55:23,184 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 5
          2015-10-13 04:55:24,197 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 1
          2015-10-13 04:55:29,299 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 8
          2015-10-13 04:55:30,311 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 15
          2015-10-13 04:55:31,320 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 10
          2015-10-13 04:55:32,327 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 1
          2015-10-13 04:55:43,496 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 1
          2015-10-13 04:55:44,509 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 4
          2015-10-13 04:55:45,521 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 5
          2015-10-13 04:55:46,530 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 4
          2015-10-13 04:55:47,543 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 2
          2015-10-13 04:55:57,680 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 1
          2015-10-13 04:55:58,698 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 6
          2015-10-13 04:55:59,715 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 5
          2015-10-13 04:56:00,721 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 6
          2015-10-13 04:56:05,795 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 2
          2015-10-13 04:56:07,820 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 1
          2015-10-13 04:56:08,831 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 2
          2015-10-13 04:56:09,841 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 2
          2015-10-13 04:56:10,853 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 2
          2015-10-13 04:56:22,018 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 15
          2015-10-13 04:56:23,036 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 2
          2015-10-13 04:56:24,043 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 6
          2015-10-13 04:56:29,114 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 2
          2015-10-13 04:56:31,138 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 4
          2015-10-13 04:56:32,148 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 3
          2015-10-13 04:56:33,157 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 3
          2015-10-13 04:56:45,328 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 6
          2015-10-13 04:56:46,349 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 2
          2015-10-13 04:56:47,356 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 3
          2015-10-13 04:56:57,499 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 8
          2015-10-13 04:56:58,514 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 7
          2015-10-13 04:56:59,521 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 10
          
          Show
          varun_saxena Varun Saxena added a comment - The headroom is not very high(sometimes comes as 0 in response too) as other heavy apps are running. We notice that we always ramp up and ramping down never happens which schedules reducers too aggressively. As can be seen below, there is no ramp down(except first time - 651 ramp downs). And we always find ramp up happening. 2015-10-13 04:36:53,038 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:53:42,132 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:651 2015-10-13 04:53:43,135 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:53:44,137 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:53:45,140 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:53:46,143 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:53:47,146 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:53:48,149 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:53:49,152 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:53:50,155 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:53:51,158 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:53:52,161 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:53:53,164 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:53:54,167 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:53:55,170 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:53:56,181 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:53:57,184 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:53:58,187 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:53:59,190 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:00,193 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:01,205 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:02,208 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:03,211 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:04,213 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:05,216 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:06,219 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:07,221 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:08,225 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:09,228 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:10,231 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:11,235 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:12,239 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:13,242 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:14,245 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:15,248 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:16,276 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:17,280 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:18,283 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:19,286 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:20,289 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:21,292 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:22,295 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:23,298 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:24,301 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:25,304 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:54:26,307 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2015-10-13 04:37:39,685 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: All maps assigned. Ramping up all remaining reduces:651 2015-10-13 04:55:04,912 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 10 2015-10-13 04:55:05,923 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 5 2015-10-13 04:55:06,929 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 5 2015-10-13 04:55:07,945 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 1 2015-10-13 04:55:12,031 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 10 2015-10-13 04:55:13,053 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 12 2015-10-13 04:55:14,061 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 10 2015-10-13 04:55:16,075 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 2 2015-10-13 04:55:17,092 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 1 2015-10-13 04:55:20,147 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 1 2015-10-13 04:55:21,165 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 10 2015-10-13 04:55:22,175 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 2 2015-10-13 04:55:23,184 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 5 2015-10-13 04:55:24,197 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 1 2015-10-13 04:55:29,299 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 8 2015-10-13 04:55:30,311 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 15 2015-10-13 04:55:31,320 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 10 2015-10-13 04:55:32,327 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 1 2015-10-13 04:55:43,496 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 1 2015-10-13 04:55:44,509 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 4 2015-10-13 04:55:45,521 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 5 2015-10-13 04:55:46,530 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 4 2015-10-13 04:55:47,543 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 2 2015-10-13 04:55:57,680 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 1 2015-10-13 04:55:58,698 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 6 2015-10-13 04:55:59,715 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 5 2015-10-13 04:56:00,721 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 6 2015-10-13 04:56:05,795 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 2 2015-10-13 04:56:07,820 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 1 2015-10-13 04:56:08,831 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 2 2015-10-13 04:56:09,841 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 2 2015-10-13 04:56:10,853 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 2 2015-10-13 04:56:22,018 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 15 2015-10-13 04:56:23,036 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 2 2015-10-13 04:56:24,043 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 6 2015-10-13 04:56:29,114 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 2 2015-10-13 04:56:31,138 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 4 2015-10-13 04:56:32,148 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 3 2015-10-13 04:56:33,157 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 3 2015-10-13 04:56:45,328 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 6 2015-10-13 04:56:46,349 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 2 2015-10-13 04:56:47,356 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 3 2015-10-13 04:56:57,499 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 8 2015-10-13 04:56:58,514 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 7 2015-10-13 04:56:59,521 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping up 10
          Hide
          varun_saxena Varun Saxena added a comment -

          Configuration of yarn.app.mapreduce.am.job.reduce.rampup.limit is the default value of 0.5
          Because of this value, it is deemed that map has enough resources and reducers are ramped up.
          Should we really be ramping up if we have hanging map requests irrespective of configuration value ?

          Show
          varun_saxena Varun Saxena added a comment - Configuration of yarn.app.mapreduce.am.job.reduce.rampup.limit is the default value of 0.5 Because of this value, it is deemed that map has enough resources and reducers are ramped up. Should we really be ramping up if we have hanging map requests irrespective of configuration value ?
          Hide
          varun_saxena Varun Saxena added a comment -

          One more thing I noticed is that in RMContainerAllocator#preemptReducesIfNeeded, we simply clear the scheduled reduces map and put these reducers to pending. This is not updated in ask. So RM keeps on assigning and AM is not able to assign as no reducer is scheduled(check logs below the code). Although this eventually leads to these reducers not being assigned, but why we are not immediately updating the ask ?

                  LOG.info("Ramping down all scheduled reduces:"
                      + scheduledRequests.reduces.size());
                  for (ContainerRequest req : scheduledRequests.reduces.values()) {
                    pendingReduces.add(req);
                  }
                  scheduledRequests.reduces.clear();
          
          2015-10-13 04:55:04,912 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container not assigned : container_1437451211867_1485_01_000215
          2015-10-13 04:55:04,912 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Cannot assign container Container: [ContainerId: container_1437451211867_1485_01_000216, NodeId: hdszzdcxdat6g06u04p:26009, NodeHttpAddress: hdszzdcxdat6g06u04p:26010, Resource: <memory:4096, vCores:1>, Priority: 10, Token: Token { kind: ContainerToken, service: 10.2.33.236:26009 }, ] for a reduce as either  container memory less than required 4096 or no pending reduce tasks - reduces.isEmpty=true
          2015-10-13 04:55:04,912 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container not assigned : container_1437451211867_1485_01_000216
          2015-10-13 04:55:04,912 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Cannot assign container Container: [ContainerId: container_1437451211867_1485_01_000217, NodeId: hdszzdcxdat6g06u06p:26009, NodeHttpAddress: hdszzdcxdat6g06u06p:26010, Resource: <memory:4096, vCores:1>, Priority: 10, Token: Token { kind: ContainerToken, service: 10.2.33.239:26009 }, ] for a reduce as either  container memory less than required 4096 or no pending reduce tasks - reduces.isEmpty=true
          
          Show
          varun_saxena Varun Saxena added a comment - One more thing I noticed is that in RMContainerAllocator#preemptReducesIfNeeded, we simply clear the scheduled reduces map and put these reducers to pending. This is not updated in ask. So RM keeps on assigning and AM is not able to assign as no reducer is scheduled(check logs below the code). Although this eventually leads to these reducers not being assigned, but why we are not immediately updating the ask ? LOG.info( "Ramping down all scheduled reduces:" + scheduledRequests.reduces.size()); for (ContainerRequest req : scheduledRequests.reduces.values()) { pendingReduces.add(req); } scheduledRequests.reduces.clear(); 2015-10-13 04:55:04,912 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container not assigned : container_1437451211867_1485_01_000215 2015-10-13 04:55:04,912 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Cannot assign container Container: [ContainerId: container_1437451211867_1485_01_000216, NodeId: hdszzdcxdat6g06u04p:26009, NodeHttpAddress: hdszzdcxdat6g06u04p:26010, Resource: <memory:4096, vCores:1>, Priority: 10, Token: Token { kind: ContainerToken, service: 10.2.33.236:26009 }, ] for a reduce as either container memory less than required 4096 or no pending reduce tasks - reduces.isEmpty=true 2015-10-13 04:55:04,912 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container not assigned : container_1437451211867_1485_01_000216 2015-10-13 04:55:04,912 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Cannot assign container Container: [ContainerId: container_1437451211867_1485_01_000217, NodeId: hdszzdcxdat6g06u06p:26009, NodeHttpAddress: hdszzdcxdat6g06u06p:26010, Resource: <memory:4096, vCores:1>, Priority: 10, Token: Token { kind: ContainerToken, service: 10.2.33.239:26009 }, ] for a reduce as either container memory less than required 4096 or no pending reduce tasks - reduces.isEmpty=true
          Hide
          varun_saxena Varun Saxena added a comment -

          cc Jason Lowe, Karthik Kambatla, Devaraj K, your thoughts on this ?

          Show
          varun_saxena Varun Saxena added a comment - cc Jason Lowe , Karthik Kambatla , Devaraj K , your thoughts on this ?
          Hide
          kasha Karthik Kambatla added a comment -

          Yep, looks like a bug.

          Show
          kasha Karthik Kambatla added a comment - Yep, looks like a bug.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Varun Saxena thanks for your detailed analysis.
          From the logs you extracted from previous your comment I see that Ramping up of reducers is done nevertheless of scheduledMaps is zero or greater than zero. I think below code blindly should not ramp up the reducers

          if (rampUp > 0) {
                rampUp = Math.min(rampUp, numPendingReduces);
                LOG.info("Ramping up " + rampUp);
                rampUpReduces(rampUp);
              }
          

          I think checking for scheduledMaps==0 while ramping up should avoid the issue nevertheless of mapper priority. But again questions is what if schduledMaps are failed maps attempts? To handle this better way is check for all scheduledMaps priority. If all the scheduledMaps priority is less than reducers, then ramping up can be done.

          // if scheduledMaps is non ZERO then neverthless of mapper priority do not ramp up reducers.
          if (rampUp > 0 && scheduledMaps == 0) {
                rampUp = Math.min(rampUp, numPendingReduces);
                LOG.info("Ramping up " + rampUp);
                rampUpReduces(rampUp);
              }
          

          Any thoughts?

          Show
          rohithsharma Rohith Sharma K S added a comment - Varun Saxena thanks for your detailed analysis. From the logs you extracted from previous your comment I see that Ramping up of reducers is done nevertheless of scheduledMaps is zero or greater than zero. I think below code blindly should not ramp up the reducers if (rampUp > 0) { rampUp = Math .min(rampUp, numPendingReduces); LOG.info( "Ramping up " + rampUp); rampUpReduces(rampUp); } I think checking for scheduledMaps==0 while ramping up should avoid the issue nevertheless of mapper priority. But again questions is what if schduledMaps are failed maps attempts? To handle this better way is check for all scheduledMaps priority. If all the scheduledMaps priority is less than reducers, then ramping up can be done. // if scheduledMaps is non ZERO then neverthless of mapper priority do not ramp up reducers. if (rampUp > 0 && scheduledMaps == 0) { rampUp = Math .min(rampUp, numPendingReduces); LOG.info( "Ramping up " + rampUp); rampUpReduces(rampUp); } Any thoughts?
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Oh!! Above thought solution i.e rampUp > 0 && scheduledMaps == 0 breaks ramping up of few reducers But still I feel Ramping up of few intermediate reducers request should not be done. I am not known story behind why Ramping up has been introduced!!?

          Show
          rohithsharma Rohith Sharma K S added a comment - Oh!! Above thought solution i.e rampUp > 0 && scheduledMaps == 0 breaks ramping up of few reducers But still I feel Ramping up of few intermediate reducers request should not be done. I am not known story behind why Ramping up has been introduced!!?
          Hide
          varun_saxena Varun Saxena added a comment -

          Yes I agree. If there are map requests hanging around for a while we should probably not ramp up the reducers.
          Maybe some config can be kept to decide how long to wait till we consider that mappers have been starved ? Thoughts ?

          One more thing which I pointed out above is that we do not update the ask when we ramp down all the reducers(in preemptReducesIfNeeded()). Not sure why we do not do so.

          Show
          varun_saxena Varun Saxena added a comment - Yes I agree. If there are map requests hanging around for a while we should probably not ramp up the reducers. Maybe some config can be kept to decide how long to wait till we consider that mappers have been starved ? Thoughts ? One more thing which I pointed out above is that we do not update the ask when we ramp down all the reducers(in preemptReducesIfNeeded()). Not sure why we do not do so.
          Hide
          kasha Karthik Kambatla added a comment -

          Maybe some config can be kept to decide how long to wait till we consider that mappers have been starved ? Thoughts ?

          MAPREDUCE-6302 essentially adds that. Can we re-use the same config?

          Show
          kasha Karthik Kambatla added a comment - Maybe some config can be kept to decide how long to wait till we consider that mappers have been starved ? Thoughts ? MAPREDUCE-6302 essentially adds that. Can we re-use the same config?
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Right. The method RMContainerAllocator#getNumHangingRequests can be reused to get hanging mapper requests and ramp up if there is no hanging mappers.

          Show
          rohithsharma Rohith Sharma K S added a comment - Right. The method RMContainerAllocator#getNumHangingRequests can be reused to get hanging mapper requests and ramp up if there is no hanging mappers.
          Hide
          sunilg Sunil G added a comment -

          Hi Rohith Sharma K S
          Yes. getNumHangingRequests looks like a correct metric. Just to add a thought to this discussion, already placed existing reducer requests must be served by RM and AM has to reject all those requests. After this part only, newly placed map requests can be served. So as we discussed earlier, could we also spin out a discussion on resetting already placed reducer requests for a faster solution.

          Show
          sunilg Sunil G added a comment - Hi Rohith Sharma K S Yes. getNumHangingRequests looks like a correct metric. Just to add a thought to this discussion, already placed existing reducer requests must be served by RM and AM has to reject all those requests. After this part only, newly placed map requests can be served. So as we discussed earlier, could we also spin out a discussion on resetting already placed reducer requests for a faster solution.
          Hide
          varun_saxena Varun Saxena added a comment -

          Yes Sunil we need to update the ask to indicate to RM that it need not allocate for these reducers. This is what I talked about in one of my comments yesterday.
          In short in this JIRA I intend to have a two pronged approach to resolve it.
          1. Update the ask to tell RM that it need not allocate for ramped down reducers(ramped down in preemptReducesIfNeeded() method). This change we are currently testing.
          2. Introduce a config or reuse MAPREDUCE-6302 config to determine hanging map requests. And do not ramp up reducers if mappers are starved. I have not looked at post MAPREDUCE-6302 codd but this is the basic idea

          Show
          varun_saxena Varun Saxena added a comment - Yes Sunil we need to update the ask to indicate to RM that it need not allocate for these reducers. This is what I talked about in one of my comments yesterday. In short in this JIRA I intend to have a two pronged approach to resolve it. 1. Update the ask to tell RM that it need not allocate for ramped down reducers(ramped down in preemptReducesIfNeeded() method). This change we are currently testing. 2. Introduce a config or reuse MAPREDUCE-6302 config to determine hanging map requests. And do not ramp up reducers if mappers are starved. I have not looked at post MAPREDUCE-6302 codd but this is the basic idea
          Hide
          varun_saxena Varun Saxena added a comment -

          Yes Sunil we need to update the ask to indicate to RM that it need not allocate for these reducers. This is what I talked about in one of my comments yesterday.
          In short in this JIRA I intend to have a two pronged approach to resolve it.
          1. Update the ask to tell RM that it need not allocate for ramped down reducers(ramped down in preemptReducesIfNeeded() method). This change we are currently testing.
          2. Introduce a config or reuse MAPREDUCE-6302 config to determine hanging map requests. And do not ramp up reducers if mappers are starved. I have not looked at post MAPREDUCE-6302 codd but this is the basic idea

          Show
          varun_saxena Varun Saxena added a comment - Yes Sunil we need to update the ask to indicate to RM that it need not allocate for these reducers. This is what I talked about in one of my comments yesterday. In short in this JIRA I intend to have a two pronged approach to resolve it. 1. Update the ask to tell RM that it need not allocate for ramped down reducers(ramped down in preemptReducesIfNeeded() method). This change we are currently testing. 2. Introduce a config or reuse MAPREDUCE-6302 config to determine hanging map requests. And do not ramp up reducers if mappers are starved. I have not looked at post MAPREDUCE-6302 codd but this is the basic idea
          Hide
          sunilg Sunil G added a comment -

          Hi Varun Saxena
          I feel point 1 can be tracked separately as it may come up with more complexity. I can give an example.

          Initially AM has placed 10 requests for reducer at timeframe1. Assume in next heartbeat from AM, we are trying to reset this count to 5 because of these new issues what we found. However RM could have already allocated some containers for that already placed request in previous requests.

          So for the new heartbeat from AM, we will have a updated ask request for 5 reducer at timeframe1, and in the response we may have some newly allocated containers from RM for the previous requests placed. So AM has to reject or update with a new count in next heartbeat and it may go on.

          But AM will reject the allocated reducer container, however lot of rejection may occur in these corner cases. So we may need to be careful here.

          Show
          sunilg Sunil G added a comment - Hi Varun Saxena I feel point 1 can be tracked separately as it may come up with more complexity. I can give an example. Initially AM has placed 10 requests for reducer at timeframe1. Assume in next heartbeat from AM, we are trying to reset this count to 5 because of these new issues what we found. However RM could have already allocated some containers for that already placed request in previous requests. So for the new heartbeat from AM, we will have a updated ask request for 5 reducer at timeframe1, and in the response we may have some newly allocated containers from RM for the previous requests placed. So AM has to reject or update with a new count in next heartbeat and it may go on. But AM will reject the allocated reducer container, however lot of rejection may occur in these corner cases. So we may need to be careful here.
          Hide
          varun_saxena Varun Saxena added a comment -

          Yes we see rejections in our case too. I am fine with tracking it separately. Will file a JIRA. Can discuss further there.

          Show
          varun_saxena Varun Saxena added a comment - Yes we see rejections in our case too. I am fine with tracking it separately. Will file a JIRA. Can discuss further there.
          Hide
          sunilg Sunil G added a comment -

          OK, I also think so . Rohith Sharma K S how do you feel?

          Show
          sunilg Sunil G added a comment - OK, I also think so . Rohith Sharma K S how do you feel?
          Hide
          varun_saxena Varun Saxena added a comment -
          Show
          varun_saxena Varun Saxena added a comment - Filed MAPREDUCE-6514
          Hide
          chen317 chong chen added a comment -

          Varun, Thanks for your detail analysis. I do have a question though.

          By looking at the flow, if Mapper tasks failed due to node reason, why map reduce application master did not treat it map task fail case? If this is the case, then current logic will reset map priority to PRIORITY_FAST_FAIL_MAP instead of PRIORITY_MAP, so it will have higher priority than reducer based on design, then whatever you mention the problem won't be a problem any more. Any particular reason why failed task map was not recognized?

          Of course, current YARN RM/AM protocol design is not a strict delta based protocol, which suffers from inconsistency between all these parties and cause lots of race conditions. It is not an easy work to re-design the protocol, for now, what we can do is to fix them one by one. So, I agree to log an issue in 6514 to track this individual case.

          Chong

          Show
          chen317 chong chen added a comment - Varun, Thanks for your detail analysis. I do have a question though. By looking at the flow, if Mapper tasks failed due to node reason, why map reduce application master did not treat it map task fail case? If this is the case, then current logic will reset map priority to PRIORITY_FAST_FAIL_MAP instead of PRIORITY_MAP, so it will have higher priority than reducer based on design, then whatever you mention the problem won't be a problem any more. Any particular reason why failed task map was not recognized? Of course, current YARN RM/AM protocol design is not a strict delta based protocol, which suffers from inconsistency between all these parties and cause lots of race conditions. It is not an easy work to re-design the protocol, for now, what we can do is to fix them one by one. So, I agree to log an issue in 6514 to track this individual case. Chong
          Hide
          devaraj.k Devaraj K added a comment -

          I agree with chong chen, failed maps have highest priority(PRIORITY_FAST_FAIL_MAP) than the reducers(PRIORITY_REDUCE), MR AM should get a container for failed map than reducer here if resources are available for map.

          Bob.zhao/Varun Saxena, What is the map memory requesting for this Job? And do you have chance to share the complete log of MR App Master?

          Show
          devaraj.k Devaraj K added a comment - I agree with chong chen , failed maps have highest priority(PRIORITY_FAST_FAIL_MAP) than the reducers(PRIORITY_REDUCE), MR AM should get a container for failed map than reducer here if resources are available for map. Bob.zhao / Varun Saxena , What is the map memory requesting for this Job? And do you have chance to share the complete log of MR App Master?
          Hide
          varun_saxena Varun Saxena added a comment -

          Thanks chong chen and Devaraj K for sharing your thoughts on this.

          The obvious solution which we considered when we got this issue was to mark the map task as failed so that its priority becomes 5, which would mean Scheduler will assign resources to it before reducers. But after long discussion internally, we decided against it. Main reason being should we mark a mapper as failed when it is perfectly fine and has been marked succeeded. Also this would be counted towards task attempt failure. Whether to kill it or fail it frankly is a debatable topic and there was a long discussion on it in the JIRA where this code was added(refer to MAPREDUCE-3921)1
          cc Bikas Saha, Vinod Kumar Vavilapalli so that they can also share their thoughts on this.

          Moreover, once the map task has been killed its as good as an original task attempt which is in scheduled stage (with new task attempt scheduled). So if resources could be assigned to original attempt, they should be to this new attempt as well(if headroom is available). This made me think that there must be some other problem as well. Kindly note that slowstart.completedmaps config here was 0.05

          Assuming headroom coming from RM was correct and digging into logs we found a couple of issues. As pointed away there was a loop of reducers being preempted and ramped up again and again.
          Firstly we noticed that AM was always ramping up and never ramping down reducers. So we thought we can have a configuration which can decide when the maps are starved and not ramp up reducers if maps are starving. This would ensure that maps get more chance to be assigned in above scenario.
          Secondly, when we ramp down all the scheduled reduces, we were not updating the ask and hence RM was continuing to allocate resources for reducers(which were later rejected by AM) even though it could have assigned these resources to mappers straight away.

          Show
          varun_saxena Varun Saxena added a comment - Thanks chong chen and Devaraj K for sharing your thoughts on this. The obvious solution which we considered when we got this issue was to mark the map task as failed so that its priority becomes 5, which would mean Scheduler will assign resources to it before reducers. But after long discussion internally, we decided against it. Main reason being should we mark a mapper as failed when it is perfectly fine and has been marked succeeded. Also this would be counted towards task attempt failure. Whether to kill it or fail it frankly is a debatable topic and there was a long discussion on it in the JIRA where this code was added(refer to MAPREDUCE-3921 )1 cc Bikas Saha , Vinod Kumar Vavilapalli so that they can also share their thoughts on this. Moreover, once the map task has been killed its as good as an original task attempt which is in scheduled stage (with new task attempt scheduled). So if resources could be assigned to original attempt, they should be to this new attempt as well(if headroom is available). This made me think that there must be some other problem as well. Kindly note that slowstart.completedmaps config here was 0.05 Assuming headroom coming from RM was correct and digging into logs we found a couple of issues. As pointed away there was a loop of reducers being preempted and ramped up again and again. Firstly we noticed that AM was always ramping up and never ramping down reducers. So we thought we can have a configuration which can decide when the maps are starved and not ramp up reducers if maps are starving. This would ensure that maps get more chance to be assigned in above scenario. Secondly, when we ramp down all the scheduled reduces, we were not updating the ask and hence RM was continuing to allocate resources for reducers(which were later rejected by AM) even though it could have assigned these resources to mappers straight away.
          Hide
          varun_saxena Varun Saxena added a comment -

          Sorry meant "As pointed out there was a loop of reducers being preempted and ramped up again and again."

          Show
          varun_saxena Varun Saxena added a comment - Sorry meant "As pointed out there was a loop of reducers being preempted and ramped up again and again."
          Hide
          chen317 chong chen added a comment -

          How to re-schedule failure/killed tasks vs how to count task exit reason are two different things.

          For your case, node is not healthy, it is a typical abnormal case for task failure. And this is a small probability event in a healthy cluster environment. And for a small set of map task rerun, we should be smarter to let it complete quickly rather than bother going through this heavy reducer ramp up and down flow. Because this not only slowing down overall job scheduling throughput, but also adding unnecessary loads on YARN core scheduler. Workload requests (over 600 reducers) have been submitted to the system, for a small set of map tasks, AM ramps down all reducer and put those small mapper into the front of queue in order to get scheduling, then later, has to gradually re-submit them is not an efficient way to handle things. It generates unnecessary load on core scheduler. Since YARN is the central brain of big data system, it manages large scale multi-tenant cluster. The design philosophy should always keep that in mind and try to reduce unnecessary loads on core.

          I think what you discover later is a problem, we need to correct them. But for this particular case, I still prefer treating them abnormal failure and bump up task priority.

          Show
          chen317 chong chen added a comment - How to re-schedule failure/killed tasks vs how to count task exit reason are two different things. For your case, node is not healthy, it is a typical abnormal case for task failure. And this is a small probability event in a healthy cluster environment. And for a small set of map task rerun, we should be smarter to let it complete quickly rather than bother going through this heavy reducer ramp up and down flow. Because this not only slowing down overall job scheduling throughput, but also adding unnecessary loads on YARN core scheduler. Workload requests (over 600 reducers) have been submitted to the system, for a small set of map tasks, AM ramps down all reducer and put those small mapper into the front of queue in order to get scheduling, then later, has to gradually re-submit them is not an efficient way to handle things. It generates unnecessary load on core scheduler. Since YARN is the central brain of big data system, it manages large scale multi-tenant cluster. The design philosophy should always keep that in mind and try to reduce unnecessary loads on core. I think what you discover later is a problem, we need to correct them. But for this particular case, I still prefer treating them abnormal failure and bump up task priority.
          Hide
          chen317 chong chen added a comment -

          Another way to think about this. Current reducer ramp up and down are designed to handle normal case, like this one "slowstart.completedmaps config here was 0.05". Once all mapper tasks are scheduled and allocated, AM already submits all reducers to the system. At this stage, it is a nature thinking to handle failure mapper as abnormal case rather than resetting the whole thing and going through ramp up/down again.

          Show
          chen317 chong chen added a comment - Another way to think about this. Current reducer ramp up and down are designed to handle normal case, like this one "slowstart.completedmaps config here was 0.05". Once all mapper tasks are scheduled and allocated, AM already submits all reducers to the system. At this stage, it is a nature thinking to handle failure mapper as abnormal case rather than resetting the whole thing and going through ramp up/down again.
          Hide
          sunilg Sunil G added a comment -

          In my opinion, the failure of node is not an issue caused by Job (or AM). It was a case where a node is down due to some other problem (OS bug/some maintenance work). I feel its better we do not account such cases as a Task fail towards attempt, because it can result to an ultimate Job fail. (a problem in cluster/yarn need not have to account towards any job/application failure counts)
          So if we could handle the bug, like "do not ramp up reducers when there is a hanged map" seems more a better approach here. Thoughts?

          Show
          sunilg Sunil G added a comment - In my opinion, the failure of node is not an issue caused by Job (or AM). It was a case where a node is down due to some other problem (OS bug/some maintenance work). I feel its better we do not account such cases as a Task fail towards attempt, because it can result to an ultimate Job fail. (a problem in cluster/yarn need not have to account towards any job/application failure counts) So if we could handle the bug, like "do not ramp up reducers when there is a hanged map" seems more a better approach here. Thoughts?
          Hide
          chen317 chong chen added a comment -

          How to account for task failure vs how to re-schedule tasks are two different things? I don't understand why we have to tie these two together. This seems to be a design limitation. Clearly, for this case, raising priority is an optimum solution. Since AM already finishes ramp up reducer once (651 reducers), to repeat that process, you have to ramp down the whole thing and gradually ramp up again, which generates another round of communication overhead between AM and RM/scheduler.

          Show
          chen317 chong chen added a comment - How to account for task failure vs how to re-schedule tasks are two different things? I don't understand why we have to tie these two together. This seems to be a design limitation. Clearly, for this case, raising priority is an optimum solution. Since AM already finishes ramp up reducer once (651 reducers), to repeat that process, you have to ramp down the whole thing and gradually ramp up again, which generates another round of communication overhead between AM and RM/scheduler.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Went through the discussion. Here's what we should do, mostly agreeing with what chong chen says.

          • Node failure should not be counted towards task-attempt count. So, yes, let's continue to mark such tasks as killed.
          • Rescheduling of this killed task can (and must) take higher priority independent of whether it is marked as killed or failed. In fact, this was how we originally designed the failed-map-should-have-higher-priority concept. In sprit, fail-fast-map actually meant maps which retroactively failed, like in this case.

          Varun Saxena, I can take a stab at this if you don't have cycles. Let me know either-ways.

          IAC, this has been a long-standing problem (though I'm very surprised nobody caught this till now), so I'd propose we move this out into 2.7.3 so I can make progress on the 2.7.2 release. Thoughts? /cc Bob.zhao

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Went through the discussion. Here's what we should do, mostly agreeing with what chong chen says. Node failure should not be counted towards task-attempt count. So, yes, let's continue to mark such tasks as killed. Rescheduling of this killed task can (and must) take higher priority independent of whether it is marked as killed or failed. In fact, this was how we originally designed the failed-map-should-have-higher-priority concept. In sprit, fail-fast-map actually meant maps which retroactively failed, like in this case. Varun Saxena , I can take a stab at this if you don't have cycles. Let me know either-ways. IAC, this has been a long-standing problem (though I'm very surprised nobody caught this till now), so I'd propose we move this out into 2.7.3 so I can make progress on the 2.7.2 release. Thoughts? /cc Bob.zhao
          Hide
          varun_saxena Varun Saxena added a comment -

          Thanks Vinod Kumar Vavilapalli for your input.

          During an offline chat with Varun Vasudev, Sunil G and Rohith Sharma K S yesterday, this JIRA came up for discussion and we too were in general agreement with chong chen as to why should we mix up rescheduling with higher priority and task failure. If node becomes unusable, as maps were already completed, they should be taken up immediately and if we set higher priority, we will achieve that. We can though still not mark this as failed attempt.

          I was infact about to raise a JIRA to handle that separately to get attention to this issue.
          But based on your comment on MAPREDUCE-6514, lets move what I was planning to do here to there. So that we can discuss further on it. If required, one more JIRA can be raised.

          And we can adopt the approach here.
          I think I will get cycles for this as this issue came from our customer.

          Also I think no need to hold up 2.7.2 for this and we can move it to 2.7.3. Bob.zhao should be ok with this as well as he is in my team only. If required i.e. if we decide not to use 2.7.3 or 2.7.3 is late, I will merge this in our internal branch.

          Show
          varun_saxena Varun Saxena added a comment - Thanks Vinod Kumar Vavilapalli for your input. During an offline chat with Varun Vasudev , Sunil G and Rohith Sharma K S yesterday, this JIRA came up for discussion and we too were in general agreement with chong chen as to why should we mix up rescheduling with higher priority and task failure. If node becomes unusable, as maps were already completed, they should be taken up immediately and if we set higher priority, we will achieve that. We can though still not mark this as failed attempt. I was infact about to raise a JIRA to handle that separately to get attention to this issue. But based on your comment on MAPREDUCE-6514 , lets move what I was planning to do here to there. So that we can discuss further on it. If required, one more JIRA can be raised. And we can adopt the approach here. I think I will get cycles for this as this issue came from our customer. Also I think no need to hold up 2.7.2 for this and we can move it to 2.7.3. Bob.zhao should be ok with this as well as he is in my team only. If required i.e. if we decide not to use 2.7.3 or 2.7.3 is late, I will merge this in our internal branch.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          I think Release-2.7.2 NEED-NOT to hold because of this issue since this issue is very very rare to appear. And it is very hard to reproduce!! Given If solution is ready-agreed-available , then it is good to move to 2.7.2 only. I am fine with either way too!!

          Coming back to issue discussion,

          Rescheduling of this killed task can (and must) take higher priority independent of whether it is marked as killed or failed

          Best way to solve this. This solve other uncovered scenario which about to cause like this issue i.e if completed OR running tasks are killed using MR client. While trying to reproduce this current issue, I was used to kill completed tasks using MR client. And for 3-4 iteration similar to this issue ramping up happened but at some point of time the calculations were going ABNORMAL to NORMAL!!

          And one of the challenge is about regression. Even though increasing the priority solves hang issues in one way, I am thinking that does configuring slow start value to different values cause hang i.e going in loop. Any thoughts?

          Show
          rohithsharma Rohith Sharma K S added a comment - I think Release-2.7.2 NEED-NOT to hold because of this issue since this issue is very very rare to appear. And it is very hard to reproduce!! Given If solution is ready- agreed -available , then it is good to move to 2.7.2 only. I am fine with either way too!! Coming back to issue discussion, Rescheduling of this killed task can (and must) take higher priority independent of whether it is marked as killed or failed Best way to solve this. This solve other uncovered scenario which about to cause like this issue i.e if completed OR running tasks are killed using MR client. While trying to reproduce this current issue, I was used to kill completed tasks using MR client. And for 3-4 iteration similar to this issue ramping up happened but at some point of time the calculations were going ABNORMAL to NORMAL!! And one of the challenge is about regression . Even though increasing the priority solves hang issues in one way, I am thinking that does configuring slow start value to different values cause hang i.e going in loop. Any thoughts?
          Hide
          varun_saxena Varun Saxena added a comment -

          Vinod Kumar Vavilapalli, attaching an initial patch. Kindly review.

          This patch primarily does the following :

          1. When an unusable node is reported, task attempt kill events are sent for completed and running map tasks which ran on the node. A flag has been added in this event to indicate whether next task attempt will be rescheduled(scheduled with higher priority of 5). On unusable node it has been marked to be rescheduled. If a task attempt is killed by client, it will not be rescheduled with higher priority. I am not a 100% convinced if user initiated kill should lead to a higher priority. Your thoughts on this ?
          2. Anyways, this rescheduled flag is then forwarded to Tasklmpl in attempt killed event after killing of the attempt is complete.
          3. Based on this flag task will then create a new attempt and send a TA_RESCHEDULE or TA_SCHEDULE event on processing attempt kill event. As it is a kill event, its not counted towards failed attempt. Anyways. if attempt has to be rescheduled, TaskAttemptImpl will send a container request event to RMContainerAllocator. From here on, this will be treated like a failed map and hence priority will be 5. Like for failed maps, node or rack locality is not ensured. Node locality anyways cannot be ensured till node comes up.
          4. As on recovery, we only consider SUCCESSFUL tasks, I think we need not update this flag in history file.
          Show
          varun_saxena Varun Saxena added a comment - Vinod Kumar Vavilapalli , attaching an initial patch. Kindly review. This patch primarily does the following : When an unusable node is reported, task attempt kill events are sent for completed and running map tasks which ran on the node. A flag has been added in this event to indicate whether next task attempt will be rescheduled(scheduled with higher priority of 5). On unusable node it has been marked to be rescheduled. If a task attempt is killed by client, it will not be rescheduled with higher priority. I am not a 100% convinced if user initiated kill should lead to a higher priority. Your thoughts on this ? Anyways, this rescheduled flag is then forwarded to Tasklmpl in attempt killed event after killing of the attempt is complete. Based on this flag task will then create a new attempt and send a TA_RESCHEDULE or TA_SCHEDULE event on processing attempt kill event. As it is a kill event, its not counted towards failed attempt. Anyways. if attempt has to be rescheduled, TaskAttemptImpl will send a container request event to RMContainerAllocator. From here on, this will be treated like a failed map and hence priority will be 5. Like for failed maps, node or rack locality is not ensured. Node locality anyways cannot be ensured till node comes up. As on recovery, we only consider SUCCESSFUL tasks, I think we need not update this flag in history file.
          Hide
          leftnoteasy Wangda Tan added a comment -

          I linked MAPREDUCE-6541 to this JIRA, they're different fixes for similar issues.

          Show
          leftnoteasy Wangda Tan added a comment - I linked MAPREDUCE-6541 to this JIRA, they're different fixes for similar issues.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Tx for the update, Varun Saxena!

          Apologies for missing your updated patch for this long!

          (Reviewing an MR patch after a looong time!)

          First up, the patch doesn't apply anymore, can you please update?

          I tried to review it despite the conflicts, some comments:

          • The logic looks good overall! You are right that user initiated kill should not lead to a higher priority.
          • We want to be sure that existing semantics in RMContainerAllocator about failed-maps are really about task-attempts that need to be rescheduled and not just failed-maps. I briefly looked, but it will be good for you to also reverify!
          • TestTaskAttempt.java
            • Most (all?) of code in can be reused between testContainerKillOnNew and testContainerKillOnUnassigned.
            • Also in existing tests, we should leave rescheduleAttempt to be false except in the new one testKillMapTaskAfterSuccess. You have enough coverage elsewhere that we should simply drop these changes except for the new tests.
          • TestMRApp.java.testUpdatedNodes: Instead of checking for reschedule events, is it possible to explicitly check for the higher priority of the corresponding request?
          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Tx for the update, Varun Saxena ! Apologies for missing your updated patch for this long! (Reviewing an MR patch after a looong time!) First up, the patch doesn't apply anymore, can you please update? I tried to review it despite the conflicts, some comments: The logic looks good overall! You are right that user initiated kill should not lead to a higher priority. We want to be sure that existing semantics in RMContainerAllocator about failed-maps are really about task-attempts that need to be rescheduled and not just failed-maps. I briefly looked, but it will be good for you to also reverify! TestTaskAttempt.java Most (all?) of code in can be reused between testContainerKillOnNew and testContainerKillOnUnassigned. Also in existing tests, we should leave rescheduleAttempt to be false except in the new one testKillMapTaskAfterSuccess. You have enough coverage elsewhere that we should simply drop these changes except for the new tests. TestMRApp.java.testUpdatedNodes: Instead of checking for reschedule events, is it possible to explicitly check for the higher priority of the corresponding request?
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Varun Saxena, let me know if you can update this soon enough for 2.7.3 in a couple of days. Otherwise, we can simply move this to 2.8 in few weeks.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Varun Saxena , let me know if you can update this soon enough for 2.7.3 in a couple of days. Otherwise, we can simply move this to 2.8 in few weeks.
          Hide
          varun_saxena Varun Saxena added a comment -

          Vinod Kumar Vavilapalli, Sorry I was taken over by some internal work so could not update.
          I will update a patch by tomorrow.

          Show
          varun_saxena Varun Saxena added a comment - Vinod Kumar Vavilapalli , Sorry I was taken over by some internal work so could not update. I will update a patch by tomorrow.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 13s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 3 new or modified test files.
          +1 mvninstall 6m 41s trunk passed
          +1 compile 0m 18s trunk passed with JDK v1.8.0_77
          +1 compile 0m 23s trunk passed with JDK v1.7.0_95
          +1 checkstyle 0m 23s trunk passed
          +1 mvnsite 0m 28s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 findbugs 0m 43s trunk passed
          +1 javadoc 0m 15s trunk passed with JDK v1.8.0_77
          +1 javadoc 0m 17s trunk passed with JDK v1.7.0_95
          +1 mvninstall 0m 22s the patch passed
          +1 compile 0m 16s the patch passed with JDK v1.8.0_77
          +1 javac 0m 16s the patch passed
          +1 compile 0m 21s the patch passed with JDK v1.7.0_95
          +1 javac 0m 21s the patch passed
          -1 checkstyle 0m 20s hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app: patch generated 2 new + 548 unchanged - 1 fixed = 550 total (was 549)
          +1 mvnsite 0m 25s the patch passed
          +1 mvneclipse 0m 11s the patch passed
          -1 whitespace 0m 0s The patch has 1 line(s) with tabs.
          +1 findbugs 0m 52s the patch passed
          +1 javadoc 0m 12s the patch passed with JDK v1.8.0_77
          +1 javadoc 0m 14s the patch passed with JDK v1.7.0_95
          +1 unit 9m 9s hadoop-mapreduce-client-app in the patch passed with JDK v1.8.0_77.
          +1 unit 9m 48s hadoop-mapreduce-client-app in the patch passed with JDK v1.7.0_95.
          +1 asflicense 0m 21s Patch does not generate ASF License warnings.
          33m 23s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:fbe3e86
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12797702/MAPREDUCE-6513.02.patch
          JIRA Issue MAPREDUCE-6513
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 568c8ea75ff0 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 594c70f
          Default Java 1.7.0_95
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_77 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6420/artifact/patchprocess/diff-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt
          whitespace https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6420/artifact/patchprocess/whitespace-tabs.txt
          JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6420/testReport/
          modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6420/console
          Powered by Apache Yetus 0.2.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 13s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 3 new or modified test files. +1 mvninstall 6m 41s trunk passed +1 compile 0m 18s trunk passed with JDK v1.8.0_77 +1 compile 0m 23s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 23s trunk passed +1 mvnsite 0m 28s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 0m 43s trunk passed +1 javadoc 0m 15s trunk passed with JDK v1.8.0_77 +1 javadoc 0m 17s trunk passed with JDK v1.7.0_95 +1 mvninstall 0m 22s the patch passed +1 compile 0m 16s the patch passed with JDK v1.8.0_77 +1 javac 0m 16s the patch passed +1 compile 0m 21s the patch passed with JDK v1.7.0_95 +1 javac 0m 21s the patch passed -1 checkstyle 0m 20s hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app: patch generated 2 new + 548 unchanged - 1 fixed = 550 total (was 549) +1 mvnsite 0m 25s the patch passed +1 mvneclipse 0m 11s the patch passed -1 whitespace 0m 0s The patch has 1 line(s) with tabs. +1 findbugs 0m 52s the patch passed +1 javadoc 0m 12s the patch passed with JDK v1.8.0_77 +1 javadoc 0m 14s the patch passed with JDK v1.7.0_95 +1 unit 9m 9s hadoop-mapreduce-client-app in the patch passed with JDK v1.8.0_77. +1 unit 9m 48s hadoop-mapreduce-client-app in the patch passed with JDK v1.7.0_95. +1 asflicense 0m 21s Patch does not generate ASF License warnings. 33m 23s Subsystem Report/Notes Docker Image:yetus/hadoop:fbe3e86 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12797702/MAPREDUCE-6513.02.patch JIRA Issue MAPREDUCE-6513 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 568c8ea75ff0 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 594c70f Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_77 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6420/artifact/patchprocess/diff-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt whitespace https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6420/artifact/patchprocess/whitespace-tabs.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6420/testReport/ modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6420/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 14s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 3 new or modified test files.
          +1 mvninstall 6m 43s trunk passed
          +1 compile 0m 20s trunk passed with JDK v1.8.0_77
          +1 compile 0m 23s trunk passed with JDK v1.7.0_95
          +1 checkstyle 0m 22s trunk passed
          +1 mvnsite 0m 27s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 findbugs 0m 44s trunk passed
          +1 javadoc 0m 16s trunk passed with JDK v1.8.0_77
          +1 javadoc 0m 19s trunk passed with JDK v1.7.0_95
          +1 mvninstall 0m 22s the patch passed
          +1 compile 0m 16s the patch passed with JDK v1.8.0_77
          +1 javac 0m 16s the patch passed
          +1 compile 0m 20s the patch passed with JDK v1.7.0_95
          +1 javac 0m 20s the patch passed
          -1 checkstyle 0m 21s hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app: patch generated 1 new + 548 unchanged - 1 fixed = 549 total (was 549)
          +1 mvnsite 0m 25s the patch passed
          +1 mvneclipse 0m 12s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 0m 54s the patch passed
          +1 javadoc 0m 13s the patch passed with JDK v1.8.0_77
          +1 javadoc 0m 15s the patch passed with JDK v1.7.0_95
          +1 unit 9m 10s hadoop-mapreduce-client-app in the patch passed with JDK v1.8.0_77.
          +1 unit 9m 47s hadoop-mapreduce-client-app in the patch passed with JDK v1.7.0_95.
          +1 asflicense 0m 17s Patch does not generate ASF License warnings.
          33m 40s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:fbe3e86
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12797710/MAPREDUCE-6513.03.patch
          JIRA Issue MAPREDUCE-6513
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 81a3b046ca4a 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 594c70f
          Default Java 1.7.0_95
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_77 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6421/artifact/patchprocess/diff-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt
          JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6421/testReport/
          modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6421/console
          Powered by Apache Yetus 0.2.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 14s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 3 new or modified test files. +1 mvninstall 6m 43s trunk passed +1 compile 0m 20s trunk passed with JDK v1.8.0_77 +1 compile 0m 23s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 22s trunk passed +1 mvnsite 0m 27s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 0m 44s trunk passed +1 javadoc 0m 16s trunk passed with JDK v1.8.0_77 +1 javadoc 0m 19s trunk passed with JDK v1.7.0_95 +1 mvninstall 0m 22s the patch passed +1 compile 0m 16s the patch passed with JDK v1.8.0_77 +1 javac 0m 16s the patch passed +1 compile 0m 20s the patch passed with JDK v1.7.0_95 +1 javac 0m 20s the patch passed -1 checkstyle 0m 21s hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app: patch generated 1 new + 548 unchanged - 1 fixed = 549 total (was 549) +1 mvnsite 0m 25s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 0m 54s the patch passed +1 javadoc 0m 13s the patch passed with JDK v1.8.0_77 +1 javadoc 0m 15s the patch passed with JDK v1.7.0_95 +1 unit 9m 10s hadoop-mapreduce-client-app in the patch passed with JDK v1.8.0_77. +1 unit 9m 47s hadoop-mapreduce-client-app in the patch passed with JDK v1.7.0_95. +1 asflicense 0m 17s Patch does not generate ASF License warnings. 33m 40s Subsystem Report/Notes Docker Image:yetus/hadoop:fbe3e86 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12797710/MAPREDUCE-6513.03.patch JIRA Issue MAPREDUCE-6513 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 81a3b046ca4a 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 594c70f Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_77 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6421/artifact/patchprocess/diff-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6421/testReport/ modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6421/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
          Hide
          varun_saxena Varun Saxena added a comment -

          For checkstyle issue to be fixed I would need to change indentation of surrounding code which is not required to be changed. So I have left it as it is.

          Regarding checking for priority as compared to rescheduled event, well the priority is set in RMContainerAllocator. In TestMRApp, there is a custom allocator so we cannot check that.
          We can however check ContainerRequestEvent and see if the flag for earlier map task-attempt failed is set or not. If its set RMContainerAllocator will set the priority of next map task to 5.
          And we have coverage in TestRMContainerAllocator for that part of the flow.

          Show
          varun_saxena Varun Saxena added a comment - For checkstyle issue to be fixed I would need to change indentation of surrounding code which is not required to be changed. So I have left it as it is. Regarding checking for priority as compared to rescheduled event, well the priority is set in RMContainerAllocator. In TestMRApp, there is a custom allocator so we cannot check that. We can however check ContainerRequestEvent and see if the flag for earlier map task-attempt failed is set or not. If its set RMContainerAllocator will set the priority of next map task to 5. And we have coverage in TestRMContainerAllocator for that part of the flow.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          I'm doing a final pass of the review, in the mean while, Wangda Tan can you look too?

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - I'm doing a final pass of the review, in the mean while, Wangda Tan can you look too?
          Hide
          leftnoteasy Wangda Tan added a comment -

          Patch looks good to me, thanks Varun Saxena!

          Show
          leftnoteasy Wangda Tan added a comment - Patch looks good to me, thanks Varun Saxena !
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #9613 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9613/)
          MAPREDUCE-6513. MR job got hanged forever when one NM unstable for some (wangda: rev 8b2880c0b62102fc5c8b6962752f72cb2c416a01)

          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskAttempt.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/TaskTAttemptKilledEvent.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/TaskAttemptKillEvent.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #9613 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9613/ ) MAPREDUCE-6513 . MR job got hanged forever when one NM unstable for some (wangda: rev 8b2880c0b62102fc5c8b6962752f72cb2c416a01) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskAttempt.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/TaskTAttemptKilledEvent.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/TaskAttemptKillEvent.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
          Hide
          leftnoteasy Wangda Tan added a comment -

          Committed to branch-2 / trunk.

          Thanks Varun Saxena for working on the patch, and thanks Devaraj K/chong chen/Sunil G/Vinod Kumar Vavilapalli/Rohith Sharma K S for reviews!

          Rebased & attached patch for branch-2.8, pending Jenkins.

          Show
          leftnoteasy Wangda Tan added a comment - Committed to branch-2 / trunk. Thanks Varun Saxena for working on the patch, and thanks Devaraj K / chong chen / Sunil G / Vinod Kumar Vavilapalli / Rohith Sharma K S for reviews! Rebased & attached patch for branch-2.8, pending Jenkins.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          -1 patch 0m 4s MAPREDUCE-6513 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help.



          Subsystem Report/Notes
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12798789/MAPREDUCE-6513-1-branch-2.8.patch
          JIRA Issue MAPREDUCE-6513
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6430/console
          Powered by Apache Yetus 0.2.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. -1 patch 0m 4s MAPREDUCE-6513 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. Subsystem Report/Notes JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12798789/MAPREDUCE-6513-1-branch-2.8.patch JIRA Issue MAPREDUCE-6513 Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6430/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 11m 59s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 3 new or modified test files.
          +1 mvninstall 9m 7s branch-2.8 passed
          +1 compile 0m 21s branch-2.8 passed with JDK v1.8.0_77
          +1 compile 0m 22s branch-2.8 passed with JDK v1.7.0_95
          +1 checkstyle 0m 26s branch-2.8 passed
          +1 mvnsite 0m 30s branch-2.8 passed
          +1 mvneclipse 0m 18s branch-2.8 passed
          +1 findbugs 0m 54s branch-2.8 passed
          +1 javadoc 0m 15s branch-2.8 passed with JDK v1.8.0_77
          +1 javadoc 0m 18s branch-2.8 passed with JDK v1.7.0_95
          +1 mvninstall 0m 22s the patch passed
          +1 compile 0m 16s the patch passed with JDK v1.8.0_77
          +1 javac 0m 16s the patch passed
          +1 compile 0m 20s the patch passed with JDK v1.7.0_95
          +1 javac 0m 20s the patch passed
          -1 checkstyle 0m 19s hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app: patch generated 3 new + 560 unchanged - 3 fixed = 563 total (was 563)
          +1 mvnsite 0m 24s the patch passed
          +1 mvneclipse 0m 11s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 0m 58s the patch passed
          +1 javadoc 0m 13s the patch passed with JDK v1.8.0_77
          +1 javadoc 0m 15s the patch passed with JDK v1.7.0_95
          -1 unit 8m 40s hadoop-mapreduce-client-app in the patch failed with JDK v1.8.0_77.
          -1 unit 9m 25s hadoop-mapreduce-client-app in the patch failed with JDK v1.7.0_95.
          +1 asflicense 0m 25s Patch does not generate ASF License warnings.
          47m 20s



          Reason Tests
          JDK v1.8.0_77 Failed junit tests hadoop.mapreduce.v2.app.TestMRApp
          JDK v1.7.0_95 Failed junit tests hadoop.mapreduce.v2.app.TestMRApp



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:c60792e
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12798806/MAPREDUCE-6513.3.branch-2.8.patch
          JIRA Issue MAPREDUCE-6513
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux d3074c6027b3 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision branch-2.8 / 8b1e784
          Default Java 1.7.0_95
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_77 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6431/artifact/patchprocess/diff-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt
          unit https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6431/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.8.0_77.txt
          unit https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6431/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.7.0_95.txt
          unit test logs https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6431/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.8.0_77.txt https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6431/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.7.0_95.txt
          JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6431/testReport/
          modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6431/console
          Powered by Apache Yetus 0.2.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 11m 59s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 3 new or modified test files. +1 mvninstall 9m 7s branch-2.8 passed +1 compile 0m 21s branch-2.8 passed with JDK v1.8.0_77 +1 compile 0m 22s branch-2.8 passed with JDK v1.7.0_95 +1 checkstyle 0m 26s branch-2.8 passed +1 mvnsite 0m 30s branch-2.8 passed +1 mvneclipse 0m 18s branch-2.8 passed +1 findbugs 0m 54s branch-2.8 passed +1 javadoc 0m 15s branch-2.8 passed with JDK v1.8.0_77 +1 javadoc 0m 18s branch-2.8 passed with JDK v1.7.0_95 +1 mvninstall 0m 22s the patch passed +1 compile 0m 16s the patch passed with JDK v1.8.0_77 +1 javac 0m 16s the patch passed +1 compile 0m 20s the patch passed with JDK v1.7.0_95 +1 javac 0m 20s the patch passed -1 checkstyle 0m 19s hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app: patch generated 3 new + 560 unchanged - 3 fixed = 563 total (was 563) +1 mvnsite 0m 24s the patch passed +1 mvneclipse 0m 11s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 0m 58s the patch passed +1 javadoc 0m 13s the patch passed with JDK v1.8.0_77 +1 javadoc 0m 15s the patch passed with JDK v1.7.0_95 -1 unit 8m 40s hadoop-mapreduce-client-app in the patch failed with JDK v1.8.0_77. -1 unit 9m 25s hadoop-mapreduce-client-app in the patch failed with JDK v1.7.0_95. +1 asflicense 0m 25s Patch does not generate ASF License warnings. 47m 20s Reason Tests JDK v1.8.0_77 Failed junit tests hadoop.mapreduce.v2.app.TestMRApp JDK v1.7.0_95 Failed junit tests hadoop.mapreduce.v2.app.TestMRApp Subsystem Report/Notes Docker Image:yetus/hadoop:c60792e JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12798806/MAPREDUCE-6513.3.branch-2.8.patch JIRA Issue MAPREDUCE-6513 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux d3074c6027b3 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2.8 / 8b1e784 Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_77 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6431/artifact/patchprocess/diff-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt unit https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6431/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.8.0_77.txt unit https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6431/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6431/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.8.0_77.txt https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6431/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.7.0_95.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6431/testReport/ modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6431/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
          Hide
          leftnoteasy Wangda Tan added a comment -

          Committed MAPREDUCE-4785 to branch-2.7/branch-2.8. Attached a new patch.

          Show
          leftnoteasy Wangda Tan added a comment - Committed MAPREDUCE-4785 to branch-2.7/branch-2.8. Attached a new patch.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 16s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 3 new or modified test files.
          +1 mvninstall 8m 43s branch-2.8 passed
          +1 compile 0m 18s branch-2.8 passed with JDK v1.8.0_77
          +1 compile 0m 21s branch-2.8 passed with JDK v1.7.0_95
          +1 checkstyle 0m 26s branch-2.8 passed
          +1 mvnsite 0m 28s branch-2.8 passed
          +1 mvneclipse 0m 19s branch-2.8 passed
          +1 findbugs 0m 53s branch-2.8 passed
          +1 javadoc 0m 15s branch-2.8 passed with JDK v1.8.0_77
          +1 javadoc 0m 17s branch-2.8 passed with JDK v1.7.0_95
          +1 mvninstall 0m 22s the patch passed
          +1 compile 0m 15s the patch passed with JDK v1.8.0_77
          +1 javac 0m 15s the patch passed
          +1 compile 0m 19s the patch passed with JDK v1.7.0_95
          +1 javac 0m 19s the patch passed
          -1 checkstyle 0m 19s hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app: patch generated 3 new + 560 unchanged - 3 fixed = 563 total (was 563)
          +1 mvnsite 0m 23s the patch passed
          +1 mvneclipse 0m 11s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 0m 51s the patch passed
          +1 javadoc 0m 12s the patch passed with JDK v1.8.0_77
          +1 javadoc 0m 15s the patch passed with JDK v1.7.0_95
          +1 unit 8m 38s hadoop-mapreduce-client-app in the patch passed with JDK v1.8.0_77.
          +1 unit 9m 20s hadoop-mapreduce-client-app in the patch passed with JDK v1.7.0_95.
          +1 asflicense 0m 17s Patch does not generate ASF License warnings.
          34m 35s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:c60792e
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12798844/MAPREDUCE-6513.3_1.branch-2.8.patch
          JIRA Issue MAPREDUCE-6513
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 7c4e45f0f19e 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision branch-2.8 / 8da0a49
          Default Java 1.7.0_95
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_77 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6432/artifact/patchprocess/diff-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt
          JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6432/testReport/
          modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6432/console
          Powered by Apache Yetus 0.2.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 16s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 3 new or modified test files. +1 mvninstall 8m 43s branch-2.8 passed +1 compile 0m 18s branch-2.8 passed with JDK v1.8.0_77 +1 compile 0m 21s branch-2.8 passed with JDK v1.7.0_95 +1 checkstyle 0m 26s branch-2.8 passed +1 mvnsite 0m 28s branch-2.8 passed +1 mvneclipse 0m 19s branch-2.8 passed +1 findbugs 0m 53s branch-2.8 passed +1 javadoc 0m 15s branch-2.8 passed with JDK v1.8.0_77 +1 javadoc 0m 17s branch-2.8 passed with JDK v1.7.0_95 +1 mvninstall 0m 22s the patch passed +1 compile 0m 15s the patch passed with JDK v1.8.0_77 +1 javac 0m 15s the patch passed +1 compile 0m 19s the patch passed with JDK v1.7.0_95 +1 javac 0m 19s the patch passed -1 checkstyle 0m 19s hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app: patch generated 3 new + 560 unchanged - 3 fixed = 563 total (was 563) +1 mvnsite 0m 23s the patch passed +1 mvneclipse 0m 11s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 0m 51s the patch passed +1 javadoc 0m 12s the patch passed with JDK v1.8.0_77 +1 javadoc 0m 15s the patch passed with JDK v1.7.0_95 +1 unit 8m 38s hadoop-mapreduce-client-app in the patch passed with JDK v1.8.0_77. +1 unit 9m 20s hadoop-mapreduce-client-app in the patch passed with JDK v1.7.0_95. +1 asflicense 0m 17s Patch does not generate ASF License warnings. 34m 35s Subsystem Report/Notes Docker Image:yetus/hadoop:c60792e JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12798844/MAPREDUCE-6513.3_1.branch-2.8.patch JIRA Issue MAPREDUCE-6513 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 7c4e45f0f19e 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2.8 / 8da0a49 Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_77 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6432/artifact/patchprocess/diff-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6432/testReport/ modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6432/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
          Hide
          leftnoteasy Wangda Tan added a comment -

          Committed to branch-2.8.

          We need to backport MAPREDUCE-5817 to branch-2.7 before this patch, otherwise it will cause a couple of conflicts. Waiting for suggestions from Sangjin and Karthik regarding backporting of MAPREDUCE-5817.

          Show
          leftnoteasy Wangda Tan added a comment - Committed to branch-2.8. We need to backport MAPREDUCE-5817 to branch-2.7 before this patch, otherwise it will cause a couple of conflicts. Waiting for suggestions from Sangjin and Karthik regarding backporting of MAPREDUCE-5817 .
          Hide
          leftnoteasy Wangda Tan added a comment -

          Rebased branch-2.7 patch.

          Since MAPREDUCE-6513 is on top of MAPREDUCE-5465, and scope of MAPREDUCE-5465 seems too big to pull into branch-2.7. I just manually resolved a couple of conflicts. Ran related unit tests, all passed.

          Varun Saxena, Vinod Kumar Vavilapalli, could you take a final look at attached patch?

          Thanks,

          Show
          leftnoteasy Wangda Tan added a comment - Rebased branch-2.7 patch. Since MAPREDUCE-6513 is on top of MAPREDUCE-5465 , and scope of MAPREDUCE-5465 seems too big to pull into branch-2.7. I just manually resolved a couple of conflicts. Ran related unit tests, all passed. Varun Saxena , Vinod Kumar Vavilapalli , could you take a final look at attached patch? Thanks,
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 8m 37s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 3 new or modified test files.
          +1 mvninstall 8m 5s branch-2.7 passed
          +1 compile 0m 16s branch-2.7 passed with JDK v1.8.0_77
          +1 compile 0m 21s branch-2.7 passed with JDK v1.7.0_95
          +1 checkstyle 0m 41s branch-2.7 passed
          +1 mvnsite 0m 29s branch-2.7 passed
          +1 mvneclipse 0m 19s branch-2.7 passed
          +1 findbugs 0m 50s branch-2.7 passed
          +1 javadoc 0m 16s branch-2.7 passed with JDK v1.8.0_77
          +1 javadoc 0m 16s branch-2.7 passed with JDK v1.7.0_95
          +1 mvninstall 0m 21s the patch passed
          +1 compile 0m 15s the patch passed with JDK v1.8.0_77
          -1 javac 2m 24s hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.8.0_77 with JDK v1.8.0_77 generated 1 new + 84 unchanged - 0 fixed = 85 total (was 84)
          +1 javac 0m 15s the patch passed
          +1 compile 0m 18s the patch passed with JDK v1.7.0_95
          -1 javac 2m 42s hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.7.0_95 with JDK v1.7.0_95 generated 1 new + 85 unchanged - 0 fixed = 86 total (was 85)
          +1 javac 0m 18s the patch passed
          -1 checkstyle 0m 35s hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app: patch generated 33 new + 1673 unchanged - 2 fixed = 1706 total (was 1675)
          +1 mvnsite 0m 23s the patch passed
          +1 mvneclipse 0m 10s the patch passed
          -1 whitespace 0m 0s The patch has 2871 line(s) that end in whitespace. Use git apply --whitespace=fix.
          -1 whitespace 1m 11s The patch has 303 line(s) with tabs.
          +1 findbugs 0m 50s the patch passed
          +1 javadoc 0m 11s the patch passed with JDK v1.8.0_77
          +1 javadoc 0m 14s the patch passed with JDK v1.7.0_95
          -1 unit 8m 5s hadoop-mapreduce-client-app in the patch failed with JDK v1.8.0_77.
          +1 unit 8m 46s hadoop-mapreduce-client-app in the patch passed with JDK v1.7.0_95.
          -1 asflicense 0m 57s Patch generated 67 ASF License warnings.
          43m 47s



          Reason Tests
          JDK v1.8.0_77 Failed junit tests hadoop.mapreduce.v2.app.TestRuntimeEstimators



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:c420dfe
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12799439/MAPREDUCE-6513.3_1.branch-2.7.patch
          JIRA Issue MAPREDUCE-6513
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 5f68ddfc0c13 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision branch-2.7 / cc6ae6f
          Default Java 1.7.0_95
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_77 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
          findbugs v3.0.0
          javac hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.8.0_77: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6448/artifact/patchprocess/diff-compile-javac-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.8.0_77.txt
          javac hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.7.0_95: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6448/artifact/patchprocess/diff-compile-javac-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.7.0_95.txt
          checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6448/artifact/patchprocess/diff-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt
          whitespace https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6448/artifact/patchprocess/whitespace-eol.txt
          whitespace https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6448/artifact/patchprocess/whitespace-tabs.txt
          unit https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6448/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.8.0_77.txt
          unit test logs https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6448/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.8.0_77.txt
          JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6448/testReport/
          asflicense https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6448/artifact/patchprocess/patch-asflicense-problems.txt
          modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6448/console
          Powered by Apache Yetus 0.2.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 8m 37s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 3 new or modified test files. +1 mvninstall 8m 5s branch-2.7 passed +1 compile 0m 16s branch-2.7 passed with JDK v1.8.0_77 +1 compile 0m 21s branch-2.7 passed with JDK v1.7.0_95 +1 checkstyle 0m 41s branch-2.7 passed +1 mvnsite 0m 29s branch-2.7 passed +1 mvneclipse 0m 19s branch-2.7 passed +1 findbugs 0m 50s branch-2.7 passed +1 javadoc 0m 16s branch-2.7 passed with JDK v1.8.0_77 +1 javadoc 0m 16s branch-2.7 passed with JDK v1.7.0_95 +1 mvninstall 0m 21s the patch passed +1 compile 0m 15s the patch passed with JDK v1.8.0_77 -1 javac 2m 24s hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.8.0_77 with JDK v1.8.0_77 generated 1 new + 84 unchanged - 0 fixed = 85 total (was 84) +1 javac 0m 15s the patch passed +1 compile 0m 18s the patch passed with JDK v1.7.0_95 -1 javac 2m 42s hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.7.0_95 with JDK v1.7.0_95 generated 1 new + 85 unchanged - 0 fixed = 86 total (was 85) +1 javac 0m 18s the patch passed -1 checkstyle 0m 35s hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app: patch generated 33 new + 1673 unchanged - 2 fixed = 1706 total (was 1675) +1 mvnsite 0m 23s the patch passed +1 mvneclipse 0m 10s the patch passed -1 whitespace 0m 0s The patch has 2871 line(s) that end in whitespace. Use git apply --whitespace=fix. -1 whitespace 1m 11s The patch has 303 line(s) with tabs. +1 findbugs 0m 50s the patch passed +1 javadoc 0m 11s the patch passed with JDK v1.8.0_77 +1 javadoc 0m 14s the patch passed with JDK v1.7.0_95 -1 unit 8m 5s hadoop-mapreduce-client-app in the patch failed with JDK v1.8.0_77. +1 unit 8m 46s hadoop-mapreduce-client-app in the patch passed with JDK v1.7.0_95. -1 asflicense 0m 57s Patch generated 67 ASF License warnings. 43m 47s Reason Tests JDK v1.8.0_77 Failed junit tests hadoop.mapreduce.v2.app.TestRuntimeEstimators Subsystem Report/Notes Docker Image:yetus/hadoop:c420dfe JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12799439/MAPREDUCE-6513.3_1.branch-2.7.patch JIRA Issue MAPREDUCE-6513 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 5f68ddfc0c13 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2.7 / cc6ae6f Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_77 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 javac hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.8.0_77: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6448/artifact/patchprocess/diff-compile-javac-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.8.0_77.txt javac hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.7.0_95: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6448/artifact/patchprocess/diff-compile-javac-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.7.0_95.txt checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6448/artifact/patchprocess/diff-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt whitespace https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6448/artifact/patchprocess/whitespace-eol.txt whitespace https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6448/artifact/patchprocess/whitespace-tabs.txt unit https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6448/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.8.0_77.txt unit test logs https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6448/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app-jdk1.8.0_77.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6448/testReport/ asflicense https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6448/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6448/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
          Hide
          jianhe Jian He added a comment -

          looks like TaskAttemptKillEvent will be sent twice for each mapper
          First at below code in RMContainerAllocator#handleUpdatedNodes, JobImpl will in turn send the TaskAttemptKillEvent event for each mapper on the unusable node.

                // send event to the job to act upon completed tasks
                eventHandler.handle(new JobUpdatedNodesEvent(getJob().getID(),
                    updatedNodes));
          

          Second time at this code in the same method

                      // If map, reschedule next task attempt.
                      boolean rescheduleNextAttempt = (i == 0) ? true : false;
                      eventHandler.handle(new TaskAttemptKillEvent(tid,
                          "TaskAttempt killed because it ran on unusable node"
                              + taskAttemptNodeId, rescheduleNextAttempt));
                    }
          

          This is how it was long time ago, Not sure why that is. With the new change, will this cause more container requests get scheduled ?

          Show
          jianhe Jian He added a comment - looks like TaskAttemptKillEvent will be sent twice for each mapper First at below code in RMContainerAllocator#handleUpdatedNodes, JobImpl will in turn send the TaskAttemptKillEvent event for each mapper on the unusable node. // send event to the job to act upon completed tasks eventHandler.handle( new JobUpdatedNodesEvent(getJob().getID(), updatedNodes)); Second time at this code in the same method // If map, reschedule next task attempt. boolean rescheduleNextAttempt = (i == 0) ? true : false ; eventHandler.handle( new TaskAttemptKillEvent(tid, "TaskAttempt killed because it ran on unusable node" + taskAttemptNodeId, rescheduleNextAttempt)); } This is how it was long time ago, Not sure why that is. With the new change, will this cause more container requests get scheduled ?
          Hide
          varun_saxena Varun Saxena added a comment -

          Jian He, the former is for rescheduling completed maps(as this output maybe unusable) and latter is for assigned maps.

          Show
          varun_saxena Varun Saxena added a comment - Jian He , the former is for rescheduling completed maps(as this output maybe unusable) and latter is for assigned maps.
          Hide
          varun_saxena Varun Saxena added a comment -

          Wangda Tan, will check the 2.7 patch and let you know.

          Show
          varun_saxena Varun Saxena added a comment - Wangda Tan , will check the 2.7 patch and let you know.
          Hide
          varun_saxena Varun Saxena added a comment -

          Rebased 2.7 patch LGTM.

          Show
          varun_saxena Varun Saxena added a comment - Rebased 2.7 patch LGTM.
          Hide
          jianhe Jian He added a comment -

          Committed to branch-2.7, thanks Wangda !
          Thanks Varun for reviewing the patch !

          Show
          jianhe Jian He added a comment - Committed to branch-2.7, thanks Wangda ! Thanks Varun for reviewing the patch !
          Hide
          leftnoteasy Wangda Tan added a comment -

          Credit to Varun Saxena for working on this patch!

          Show
          leftnoteasy Wangda Tan added a comment - Credit to Varun Saxena for working on this patch!
          Hide
          varun_saxena Varun Saxena added a comment -

          Thanks Wangda Tan, Jian He and Vinod Kumar Vavilapalli for the review and commit.
          Thanks chong chen, Devaraj K, Rohith Sharma K S and Sunil G for the additional reviews and discussions.

          Show
          varun_saxena Varun Saxena added a comment - Thanks Wangda Tan , Jian He and Vinod Kumar Vavilapalli for the review and commit. Thanks chong chen , Devaraj K , Rohith Sharma K S and Sunil G for the additional reviews and discussions.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Closing the JIRA as part of 2.7.3 release.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Closing the JIRA as part of 2.7.3 release.

            People

            • Assignee:
              varun_saxena Varun Saxena
              Reporter:
              Jobo Bob.zhao
            • Votes:
              0 Vote for this issue
              Watchers:
              26 Start watching this issue

              Dates

              • Due:
                Created:
                Updated:
                Resolved:

                Development