Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6485

MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.4.1, 2.6.0, 2.7.1, 3.0.0-alpha1
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: applicationmaster
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The scenarios is like this:
      With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces will take resource and start to run when all the map have not finished.
      But It could happened that when all the resources are taken up by running reduces, there is still one map not finished.
      Under this condition , the last map have two task attempts .
      As for the first attempt was killed due to timeout(mapreduce.task.timeout), and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP then to FAILED, but failed map attempt would not be restarted for there is still one speculate map attempt in progressing.
      As for the second attempt which was started due to having enable map task speculative is pending at UNASSINGED state because of no resource available. But the second map attempt request have lower priority than reduces, so preemption would not happened.
      As a result all reduces would not finished because of there is one map left. and the last map hanged there because of no resource available. so, the job would never finish.

      1. MAPREDUCE-6485.001.patch
        3 kB
        Xianyin Xin
      2. MAPREDUCE-6485.004.patch
        10 kB
        Xianyin Xin
      3. MAPREDUCE-6485.005.patch
        10 kB
        Xianyin Xin
      4. MAPREDUCE-6485.006.patch
        10 kB
        Xianyin Xin
      5. MAPREDUCE-6845.002.patch
        11 kB
        Xianyin Xin
      6. MAPREDUCE-6845.003.patch
        11 kB
        Xianyin Xin

        Activity

        Hide
        varun_saxena Varun Saxena added a comment -

        Do you have AM logs ?

        Show
        varun_saxena Varun Saxena added a comment - Do you have AM logs ?
        Hide
        Jobo Bob.zhao added a comment -

        @Varun Saxena Thanks for your checking this issue. Below are some related logs you can refer to.
        1. All logs in AM about first attempt:

        03:30:32,457 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from NEW to UNASSIGNED
        03:33:22,037 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_003425 to attempt_1439640536651_4710_m_002807_0
        03:33:22,038 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from UNASSIGNED to ASSIGNED
        03:33:22,044 INFO [ContainerLauncher #27] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_LAUNCH for container container_1439640536651_4710_01_003425 taskAttempt attempt_1439640536651_4710_m_002807_0
        03:33:22,044 INFO [ContainerLauncher #27] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Launching attempt_1439640536651_4710_m_002807_0
        03:33:22,071 INFO [ContainerLauncher #27] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Shuffle port returned by ContainerManager for attempt_1439640536651_4710_m_002807_0 : 26008
        03:33:22,071 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: TaskAttempt: [attempt_1439640536651_4710_m_002807_0] using containerId: [container_1439640536651_4710_01_003425 on NM: [SCCHDPHIV02129:26009]
        03:33:22,071 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from ASSIGNED to RUNNING
        03:33:31,481 INFO [IPC Server handler 24 on 27102] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: jvm_1439640536651_4710_m_003425 given task: attempt_1439640536651_4710_m_002807_0
        03:43:31,473 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439640536651_4710_m_002807_0: AttemptID:attempt_1439640536651_4710_m_002807_0 Timed out after 600 secs
        03:43:31,473 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439640536651_4710_m_002807_0: AttemptID:attempt_1439640536651_4710_m_002807_0 Timed out after 600 secs
        03:43:31,474 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP
        03:43:31,474 INFO [ContainerLauncher #23] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1439640536651_4710_01_003425 taskAttempt attempt_1439640536651_4710_m_002807_0
        03:43:31,474 INFO [ContainerLauncher #23] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1439640536651_4710_m_002807_0
        03:43:31,478 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from FAIL_CONTAINER_CLEANUP to FAIL_TASK_CLEANUP
        03:43:31,478 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from FAIL_TASK_CLEANUP to FAILED
        03:43:32,701 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439640536651_4710_m_002807_0: Container killed by the ApplicationMaster.
        

        2. All logs in AM about second attempt (only have one split):

        03:39:55,339 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_1 TaskAttempt Transitioned from NEW to UNASSIGNED
        

        3. Checking the log with time_stamp we can see later available resource are all allocated to reduces not the last map attempt after the second attempt started :

        03:39:55,978 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_015149 to attempt_1439640536651_4710_r_000669_0
        03:39:55,978 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:330 AssignedMaps:1 AssignedReds:669 CompletedMaps:14257 CompletedReds:0 ContAlloc:14958 ContRel:30 HostLocal:14221 RackLocal:38
        03:40:01,004 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1
        03:40:01,004 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce
        03:40:01,004 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_015150 to attempt_1439640536651_4710_r_000670_0
        03:40:01,004 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:329 AssignedMaps:1 AssignedReds:670 CompletedMaps:14257 CompletedReds:0 ContAlloc:14959 ContRel:30 HostLocal:14221 RackLocal:38
        03:40:05,025 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1
        03:40:05,025 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce
        03:40:05,025 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_015151 to attempt_1439640536651_4710_r_000671_0
        03:40:05,025 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:328 AssignedMaps:1 AssignedReds:671 CompletedMaps:14257 CompletedReds:0 ContAlloc:14960 ContRel:30 HostLocal:14221 RackLocal:38
        03:40:15,078 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1
        03:40:15,078 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce
        03:40:15,140 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_015152 to attempt_1439640536651_4710_r_000672_0
        03:40:15,140 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:327 AssignedMaps:1 AssignedReds:672 CompletedMaps:14257 CompletedReds:0 ContAlloc:14961 ContRel:30 HostLocal:14221 RackLocal:38
        03:40:19,161 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1
        03:40:19,161 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce
        03:40:19,161 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_015153 to attempt_1439640536651_4710_r_000673_0
        03:40:19,161 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:326 AssignedMaps:1 AssignedReds:673 CompletedMaps:14257 CompletedReds:0 ContAlloc:14962 ContRel:30 HostLocal:14221 RackLocal:38
        03:43:32,701 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1439640536651_4710_01_003425
        03:43:32,701 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1
        03:43:32,701 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce
        03:43:32,701 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_015154 to attempt_1439640536651_4710_r_000674_0
        03:43:32,701 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:325 AssignedMaps:0 AssignedReds:674 CompletedMaps:14257 CompletedReds:0 ContAlloc:14963 ContRel:30 HostLocal:14221 RackLocal:38
        ....
        
        Show
        Jobo Bob.zhao added a comment - @ Varun Saxena Thanks for your checking this issue. Below are some related logs you can refer to. 1. All logs in AM about first attempt: 03:30:32,457 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from NEW to UNASSIGNED 03:33:22,037 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_003425 to attempt_1439640536651_4710_m_002807_0 03:33:22,038 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from UNASSIGNED to ASSIGNED 03:33:22,044 INFO [ContainerLauncher #27] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_LAUNCH for container container_1439640536651_4710_01_003425 taskAttempt attempt_1439640536651_4710_m_002807_0 03:33:22,044 INFO [ContainerLauncher #27] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Launching attempt_1439640536651_4710_m_002807_0 03:33:22,071 INFO [ContainerLauncher #27] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Shuffle port returned by ContainerManager for attempt_1439640536651_4710_m_002807_0 : 26008 03:33:22,071 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: TaskAttempt: [attempt_1439640536651_4710_m_002807_0] using containerId: [container_1439640536651_4710_01_003425 on NM: [SCCHDPHIV02129:26009] 03:33:22,071 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from ASSIGNED to RUNNING 03:33:31,481 INFO [IPC Server handler 24 on 27102] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: jvm_1439640536651_4710_m_003425 given task: attempt_1439640536651_4710_m_002807_0 03:43:31,473 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439640536651_4710_m_002807_0: AttemptID:attempt_1439640536651_4710_m_002807_0 Timed out after 600 secs 03:43:31,473 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439640536651_4710_m_002807_0: AttemptID:attempt_1439640536651_4710_m_002807_0 Timed out after 600 secs 03:43:31,474 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP 03:43:31,474 INFO [ContainerLauncher #23] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1439640536651_4710_01_003425 taskAttempt attempt_1439640536651_4710_m_002807_0 03:43:31,474 INFO [ContainerLauncher #23] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1439640536651_4710_m_002807_0 03:43:31,478 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from FAIL_CONTAINER_CLEANUP to FAIL_TASK_CLEANUP 03:43:31,478 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from FAIL_TASK_CLEANUP to FAILED 03:43:32,701 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439640536651_4710_m_002807_0: Container killed by the ApplicationMaster. 2. All logs in AM about second attempt (only have one split): 03:39:55,339 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_1 TaskAttempt Transitioned from NEW to UNASSIGNED 3. Checking the log with time_stamp we can see later available resource are all allocated to reduces not the last map attempt after the second attempt started : 03:39:55,978 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_015149 to attempt_1439640536651_4710_r_000669_0 03:39:55,978 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:330 AssignedMaps:1 AssignedReds:669 CompletedMaps:14257 CompletedReds:0 ContAlloc:14958 ContRel:30 HostLocal:14221 RackLocal:38 03:40:01,004 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1 03:40:01,004 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce 03:40:01,004 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_015150 to attempt_1439640536651_4710_r_000670_0 03:40:01,004 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:329 AssignedMaps:1 AssignedReds:670 CompletedMaps:14257 CompletedReds:0 ContAlloc:14959 ContRel:30 HostLocal:14221 RackLocal:38 03:40:05,025 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1 03:40:05,025 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce 03:40:05,025 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_015151 to attempt_1439640536651_4710_r_000671_0 03:40:05,025 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:328 AssignedMaps:1 AssignedReds:671 CompletedMaps:14257 CompletedReds:0 ContAlloc:14960 ContRel:30 HostLocal:14221 RackLocal:38 03:40:15,078 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1 03:40:15,078 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce 03:40:15,140 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_015152 to attempt_1439640536651_4710_r_000672_0 03:40:15,140 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:327 AssignedMaps:1 AssignedReds:672 CompletedMaps:14257 CompletedReds:0 ContAlloc:14961 ContRel:30 HostLocal:14221 RackLocal:38 03:40:19,161 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1 03:40:19,161 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce 03:40:19,161 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_015153 to attempt_1439640536651_4710_r_000673_0 03:40:19,161 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:326 AssignedMaps:1 AssignedReds:673 CompletedMaps:14257 CompletedReds:0 ContAlloc:14962 ContRel:30 HostLocal:14221 RackLocal:38 03:43:32,701 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1439640536651_4710_01_003425 03:43:32,701 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1 03:43:32,701 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce 03:43:32,701 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_015154 to attempt_1439640536651_4710_r_000674_0 03:43:32,701 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:325 AssignedMaps:0 AssignedReds:674 CompletedMaps:14257 CompletedReds:0 ContAlloc:14963 ContRel:30 HostLocal:14221 RackLocal:38 ....
        Hide
        kasha Karthik Kambatla added a comment -

        I believe this is a duplicate of MAPREDUCE-6302, and an artifact of the scheduler's headroom calculation not being accurate.

        Please re-open if you disagree.

        Show
        kasha Karthik Kambatla added a comment - I believe this is a duplicate of MAPREDUCE-6302 , and an artifact of the scheduler's headroom calculation not being accurate. Please re-open if you disagree.
        Hide
        xinxianyin Xianyin Xin added a comment -

        Hi Karthik Kambatla, the two are similar but not the same. In MAPREDUCE-6302, a map is marked as failed so it can raise another resource request with higher priority than reduce, and thus can consume the preempt resource. In this case, the map is killed due to time out. In the current implementation, the killed map is not marked as failed, so it won't raise new resource request and retry to do the map. Then it fails completely and the whole job hangs there.

        Now reopen it.

        Show
        xinxianyin Xianyin Xin added a comment - Hi Karthik Kambatla , the two are similar but not the same. In MAPREDUCE-6302 , a map is marked as failed so it can raise another resource request with higher priority than reduce, and thus can consume the preempt resource. In this case, the map is killed due to time out. In the current implementation, the killed map is not marked as failed, so it won't raise new resource request and retry to do the map. Then it fails completely and the whole job hangs there. Now reopen it.
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        the killed map is not marked as failed, so it won't raise new resource request and retry to do the map

        Timed out task attempts are marked as FAILED. New attempt will launch with priority 5 i.e failed map request.

        Show
        rohithsharma Rohith Sharma K S added a comment - the killed map is not marked as failed, so it won't raise new resource request and retry to do the map Timed out task attempts are marked as FAILED. New attempt will launch with priority 5 i.e failed map request.
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        Got the logs from Bob.zhao offline. One of the potential issue is when speculative task attempt launches. The below is the flow of events which causes problem.

        1. There are T1..Tn task attempts are there. The tasks T1..Tn-1 are finished and task attempt Tn is running.
        2. Reducers are scheduled(RR is sent to RM). Some of the reducers attempts are running. So Tn th task + Reducer tasks are occupied full cluster. And 50 more reducers task are in queue i.e RM has to assign 50 containers with 10 i.e reducer priority.
        3. Speculative task for Tn map task attempt is spawned with priority 20 and ResourceRequest has sent to RM.
        4. Since Reducers has higher priority, RM will not assign any containers to mapper task resource request(speculative task).
        5. Tn mapper task has timed out because of some reason. So task attempt is marked as failed.
        6. But new attempt with failed map task priority WILL NOT BE created since already one attempt i.e speculative attempt is there with scheduledMap. MR job hangs forever here.

        Correct me if I am wrong, I think MAPREDUCE-6302 solution will not solve fully because even if reducer preemption happens it has to preempt all the scheduled reducers containers by which means RM has to assign container to reducers and preempt. After all the queued reducers request are served since reducers has higher priority and pick up the mapper resource request for assignment.

        Show
        rohithsharma Rohith Sharma K S added a comment - Got the logs from Bob.zhao offline. One of the potential issue is when speculative task attempt launches. The below is the flow of events which causes problem. There are T1..Tn task attempts are there. The tasks T1..Tn-1 are finished and task attempt Tn is running. Reducers are scheduled(RR is sent to RM). Some of the reducers attempts are running. So Tn th task + Reducer tasks are occupied full cluster. And 50 more reducers task are in queue i.e RM has to assign 50 containers with 10 i.e reducer priority. Speculative task for Tn map task attempt is spawned with priority 20 and ResourceRequest has sent to RM. Since Reducers has higher priority, RM will not assign any containers to mapper task resource request(speculative task). Tn mapper task has timed out because of some reason. So task attempt is marked as failed. But new attempt with failed map task priority WILL NOT BE created since already one attempt i.e speculative attempt is there with scheduledMap. MR job hangs forever here. Correct me if I am wrong, I think MAPREDUCE-6302 solution will not solve fully because even if reducer preemption happens it has to preempt all the scheduled reducers containers by which means RM has to assign container to reducers and preempt. After all the queued reducers request are served since reducers has higher priority and pick up the mapper resource request for assignment.
        Hide
        xinxianyin Xianyin Xin added a comment -

        Thanks Rohith Sharma K S, you are right. I find this can be fixed through a simple approach. Now in TaskImpl, A failed task can only been rescheduled when there's no inProgressAttempts and successfulAttempts,

              if (task.failedAttempts.size() < task.maxAttempts) {
                task.handleTaskAttemptCompletion(
                    taskAttemptId, 
                    TaskAttemptCompletionEventStatus.FAILED);
                // we don't need a new event if we already have a spare
                task.inProgressAttempts.remove(taskAttemptId);
                if (task.inProgressAttempts.size() == 0
                    && task.successfulAttempt == null) {
                  task.addAndScheduleAttempt(Avataar.VIRGIN);
                }
              } else {
        

        In this case, the speculative attempt is in progress so no new attempt will be created. However, the speculative attempt can't get resource since it's priority is 20 and the cluster is full of reduces. Now we can check if all of the inProgressAttempts are in SCHEDULED state (no inProgressAttempts is really running). If yes, it indicate that the speculative attempts are all waiting for resource, which mean the cluster resource is in short. The attempt failed in this case should create a new attempt that has higher priority which can preempt resource, like this,

                if ((task.inProgressAttempts.size() == 0 || all_of_in_progress_attempts_are_waiting_for_resource)
                    && task.successfulAttempt == null) {
                  task.addAndScheduleAttempt(Avataar.VIRGIN);
                }
        

        thoughts?

        Show
        xinxianyin Xianyin Xin added a comment - Thanks Rohith Sharma K S , you are right. I find this can be fixed through a simple approach. Now in TaskImpl , A failed task can only been rescheduled when there's no inProgressAttempts and successfulAttempts , if (task.failedAttempts.size() < task.maxAttempts) { task.handleTaskAttemptCompletion( taskAttemptId, TaskAttemptCompletionEventStatus.FAILED); // we don't need a new event if we already have a spare task.inProgressAttempts.remove(taskAttemptId); if (task.inProgressAttempts.size() == 0 && task.successfulAttempt == null ) { task.addAndScheduleAttempt(Avataar.VIRGIN); } } else { In this case, the speculative attempt is in progress so no new attempt will be created. However, the speculative attempt can't get resource since it's priority is 20 and the cluster is full of reduces. Now we can check if all of the inProgressAttempts are in SCHEDULED state (no inProgressAttempts is really running). If yes, it indicate that the speculative attempts are all waiting for resource, which mean the cluster resource is in short. The attempt failed in this case should create a new attempt that has higher priority which can preempt resource, like this, if ((task.inProgressAttempts.size() == 0 || all_of_in_progress_attempts_are_waiting_for_resource) && task.successfulAttempt == null ) { task.addAndScheduleAttempt(Avataar.VIRGIN); } thoughts?
        Hide
        xinxianyin Xianyin Xin added a comment -

        Correct above comment: we should check if all of the inProgressAttempts are in UNASSIGNED state, other than SCHEDULED.

        Show
        xinxianyin Xianyin Xin added a comment - Correct above comment: we should check if all of the inProgressAttempts are in UNASSIGNED state, other than SCHEDULED .
        Hide
        Jobo Bob.zhao added a comment -

        Xianyin Xin,Thanks for your deeply analysis. Though we have found the rootcause of this issue, could you provide one patch for this?

        Show
        Jobo Bob.zhao added a comment - Xianyin Xin ,Thanks for your deeply analysis. Though we have found the rootcause of this issue, could you provide one patch for this?
        Hide
        xinxianyin Xianyin Xin added a comment -

        Upload a patch, pls review.

        Show
        xinxianyin Xianyin Xin added a comment - Upload a patch, pls review.
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        Looked deeper in to reducer preemption code and find that even if headroom is available for assigning map request and pendingReducers are zero and scheduledReducers are more then neither preemption will be triggered nor Ramping down of reducers will happen. RM always allocates containers to reducers. If scheduledReducers are more, then at some point of time cluster resources are fully acquired by reducers. Say if reducers memory is 5GB and mapper memory is 4GB. Headroom 4GB is available where 1 mapper can be assigned to it. But since more reducers requests are there to assign, RM always skip the assignment for reducers since capacity 5GB is greater then 4GB headroom.

        Show
        rohithsharma Rohith Sharma K S added a comment - Looked deeper in to reducer preemption code and find that even if headroom is available for assigning map request and pendingReducers are zero and scheduledReducers are more then neither preemption will be triggered nor Ramping down of reducers will happen . RM always allocates containers to reducers. If scheduledReducers are more, then at some point of time cluster resources are fully acquired by reducers. Say if reducers memory is 5GB and mapper memory is 4GB. Headroom 4GB is available where 1 mapper can be assigned to it. But since more reducers requests are there to assign, RM always skip the assignment for reducers since capacity 5GB is greater then 4GB headroom.
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        Overall patch looks good to me.. Can you add test for handling regression?

        Show
        rohithsharma Rohith Sharma K S added a comment - Overall patch looks good to me.. Can you add test for handling regression?
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        nit : can you check for greater then rather not equal?
        task.inProgressAttempts.size() != 0

        Show
        rohithsharma Rohith Sharma K S added a comment - nit : can you check for greater then rather not equal? task.inProgressAttempts.size() != 0
        Hide
        xinxianyin Xianyin Xin added a comment -

        Thanks for review, Rohith Sharma K S

        Can you add test for handling regression?

        Add a test case which mock the scenario.

        nit : can you check for greater then rather not equal?
        task.inProgressAttempts.size() != 0

        Fixed.

        Show
        xinxianyin Xianyin Xin added a comment - Thanks for review, Rohith Sharma K S Can you add test for handling regression? Add a test case which mock the scenario. nit : can you check for greater then rather not equal? task.inProgressAttempts.size() != 0 Fixed.
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        Thinking that TaskAttemptStateInternal which is exposed only testing but using in TaskImpl. I am not sure how much it is good to have this way.
        How about adding new method TaskAttemptImpl which check of NEW and UNASSIGNED states and return boolean value like below. New exposed method can be used in TaskImpl.

        // The meaning of scheduled in life cycle of task is ResourceRequests which are sent to RM but not yet assigned.
        private static final EnumSet<TaskAttemptStateInternal> SCHEDULED_TASK_STATES = 
        EnumSet.of(TaskAttemptStateInternal.NEW,TaskAttemptStateInternal.UNASSIGNED);
        
        public boolean isScheduledTask(){
        return SCHEDULED_TASK_STATES.contains(getInternalState())
        }
        
        Show
        rohithsharma Rohith Sharma K S added a comment - Thinking that TaskAttemptStateInternal which is exposed only testing but using in TaskImpl. I am not sure how much it is good to have this way. How about adding new method TaskAttemptImpl which check of NEW and UNASSIGNED states and return boolean value like below. New exposed method can be used in TaskImpl. // The meaning of scheduled in life cycle of task is ResourceRequests which are sent to RM but not yet assigned. private static final EnumSet<TaskAttemptStateInternal> SCHEDULED_TASK_STATES = EnumSet.of(TaskAttemptStateInternal.NEW,TaskAttemptStateInternal.UNASSIGNED); public boolean isScheduledTask(){ return SCHEDULED_TASK_STATES.contains(getInternalState()) }
        Hide
        xinxianyin Xianyin Xin added a comment -

        Thanks Rohith Sharma K S, better solution! Attached new patch.

        Show
        xinxianyin Xianyin Xin added a comment - Thanks Rohith Sharma K S , better solution! Attached new patch.
        Hide
        xinxianyin Xianyin Xin added a comment -

        Discussed with Rohith Sharma K S off-line, and changed the method of checking the taskAttempt whether being hanging for resource according to Rohith Sharma K S's suggestion. Now the checking looks,

        boolean isContainerAssigned() {
          return container == null ? false : true;
        }
        

        other than checking the internal states in patch ver-003.

        Show
        xinxianyin Xianyin Xin added a comment - Discussed with Rohith Sharma K S off-line, and changed the method of checking the taskAttempt whether being hanging for resource according to Rohith Sharma K S 's suggestion. Now the checking looks, boolean isContainerAssigned() { return container == null ? false : true ; } other than checking the internal states in patch ver-003.
        Hide
        xinxianyin Xianyin Xin added a comment -

        Tick jenkins.

        Show
        xinxianyin Xianyin Xin added a comment - Tick jenkins.
        Hide
        hadoopqa Hadoop QA added a comment -



        -1 overall



        Vote Subsystem Runtime Comment
        0 pre-patch 19m 18s Pre-patch trunk compilation is healthy.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
        +1 javac 9m 6s There were no new javac warning messages.
        +1 javadoc 11m 55s There were no new javadoc warning messages.
        +1 release audit 0m 27s The applied patch does not increase the total number of release audit warnings.
        -1 checkstyle 0m 43s The applied patch generated 1 new checkstyle issues (total was 356, now 355).
        -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix.
        +1 install 1m 44s mvn install still works.
        +1 eclipse:eclipse 0m 38s The patch built with eclipse:eclipse.
        +1 findbugs 1m 18s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
        +1 mapreduce tests 10m 5s Tests passed in hadoop-mapreduce-client-app.
            55m 18s  



        Subsystem Report/Notes
        Patch URL http://issues.apache.org/jira/secure/attachment/12764223/MAPREDUCE-6485.004.patch
        Optional Tests javadoc javac unit findbugs checkstyle
        git revision trunk / d6fa34e
        checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6032/artifact/patchprocess/diffcheckstylehadoop-mapreduce-client-app.txt
        whitespace https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6032/artifact/patchprocess/whitespace.txt
        hadoop-mapreduce-client-app test log https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6032/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt
        Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6032/testReport/
        Java 1.7.0_55
        uname Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6032/console

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 19m 18s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 9m 6s There were no new javac warning messages. +1 javadoc 11m 55s There were no new javadoc warning messages. +1 release audit 0m 27s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 0m 43s The applied patch generated 1 new checkstyle issues (total was 356, now 355). -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 install 1m 44s mvn install still works. +1 eclipse:eclipse 0m 38s The patch built with eclipse:eclipse. +1 findbugs 1m 18s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 mapreduce tests 10m 5s Tests passed in hadoop-mapreduce-client-app.     55m 18s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12764223/MAPREDUCE-6485.004.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / d6fa34e checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6032/artifact/patchprocess/diffcheckstylehadoop-mapreduce-client-app.txt whitespace https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6032/artifact/patchprocess/whitespace.txt hadoop-mapreduce-client-app test log https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6032/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6032/testReport/ Java 1.7.0_55 uname Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6032/console This message was automatically generated.
        Hide
        xinxianyin Xianyin Xin added a comment -

        The checkstyle and whitespace warnings are not generated in the patch. However, attach ver-005 to fix the whitespace warning.

        Show
        xinxianyin Xianyin Xin added a comment - The checkstyle and whitespace warnings are not generated in the patch. However, attach ver-005 to fix the whitespace warning.
        Hide
        hadoopqa Hadoop QA added a comment -



        -1 overall



        Vote Subsystem Runtime Comment
        0 pre-patch 17m 11s Pre-patch trunk compilation is healthy.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
        +1 javac 7m 50s There were no new javac warning messages.
        +1 javadoc 10m 19s There were no new javadoc warning messages.
        +1 release audit 0m 24s The applied patch does not increase the total number of release audit warnings.
        -1 checkstyle 0m 34s The applied patch generated 1 new checkstyle issues (total was 356, now 355).
        +1 whitespace 0m 1s The patch has no lines that end in whitespace.
        +1 install 1m 27s mvn install still works.
        +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse.
        +1 findbugs 1m 9s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
        +1 mapreduce tests 9m 40s Tests passed in hadoop-mapreduce-client-app.
            49m 13s  



        Subsystem Report/Notes
        Patch URL http://issues.apache.org/jira/secure/attachment/12764243/MAPREDUCE-6485.005.patch
        Optional Tests javadoc javac unit findbugs checkstyle
        git revision trunk / d6fa34e
        checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6034/artifact/patchprocess/diffcheckstylehadoop-mapreduce-client-app.txt
        hadoop-mapreduce-client-app test log https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6034/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt
        Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6034/testReport/
        Java 1.7.0_55
        uname Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6034/console

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 17m 11s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 7m 50s There were no new javac warning messages. +1 javadoc 10m 19s There were no new javadoc warning messages. +1 release audit 0m 24s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 0m 34s The applied patch generated 1 new checkstyle issues (total was 356, now 355). +1 whitespace 0m 1s The patch has no lines that end in whitespace. +1 install 1m 27s mvn install still works. +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse. +1 findbugs 1m 9s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 mapreduce tests 9m 40s Tests passed in hadoop-mapreduce-client-app.     49m 13s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12764243/MAPREDUCE-6485.005.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / d6fa34e checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6034/artifact/patchprocess/diffcheckstylehadoop-mapreduce-client-app.txt hadoop-mapreduce-client-app test log https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6034/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6034/testReport/ Java 1.7.0_55 uname Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6034/console This message was automatically generated.
        Hide
        kasha Karthik Kambatla added a comment - - edited

        Looks good. One minor comment - isContainerAssigned() - can we do the following please

        boolean isContainerAssigned() {
          return container != null;
        }
        
        Show
        kasha Karthik Kambatla added a comment - - edited Looks good. One minor comment - isContainerAssigned() - can we do the following please boolean isContainerAssigned() { return container != null ; }
        Hide
        xinxianyin Xianyin Xin added a comment -

        Thanks, Karthik Kambatla.

        isContainerAssigned() - can we do the following please

        boolean isContainerAssigned() {
          return container != null;
        }
        

        fixed.

        Show
        xinxianyin Xianyin Xin added a comment - Thanks, Karthik Kambatla . isContainerAssigned() - can we do the following please boolean isContainerAssigned() { return container != null ; } fixed.
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        +1 for latest patch.. pending jenkins

        Show
        rohithsharma Rohith Sharma K S added a comment - +1 for latest patch.. pending jenkins
        Hide
        hadoopqa Hadoop QA added a comment -



        -1 overall



        Vote Subsystem Runtime Comment
        0 pre-patch 16m 41s Pre-patch trunk compilation is healthy.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
        +1 javac 7m 56s There were no new javac warning messages.
        +1 javadoc 10m 2s There were no new javadoc warning messages.
        +1 release audit 0m 24s The applied patch does not increase the total number of release audit warnings.
        -1 checkstyle 0m 36s The applied patch generated 1 new checkstyle issues (total was 356, now 355).
        +1 whitespace 0m 1s The patch has no lines that end in whitespace.
        +1 install 1m 29s mvn install still works.
        +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse.
        +1 findbugs 1m 9s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
        +1 mapreduce tests 9m 39s Tests passed in hadoop-mapreduce-client-app.
            48m 34s  



        Subsystem Report/Notes
        Patch URL http://issues.apache.org/jira/secure/attachment/12764359/MAPREDUCE-6485.006.patch
        Optional Tests javadoc javac unit findbugs checkstyle
        git revision trunk / 071733d
        checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6039/artifact/patchprocess/diffcheckstylehadoop-mapreduce-client-app.txt
        hadoop-mapreduce-client-app test log https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6039/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt
        Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6039/testReport/
        Java 1.7.0_55
        uname Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6039/console

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 16m 41s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 7m 56s There were no new javac warning messages. +1 javadoc 10m 2s There were no new javadoc warning messages. +1 release audit 0m 24s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 0m 36s The applied patch generated 1 new checkstyle issues (total was 356, now 355). +1 whitespace 0m 1s The patch has no lines that end in whitespace. +1 install 1m 29s mvn install still works. +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse. +1 findbugs 1m 9s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 mapreduce tests 9m 39s Tests passed in hadoop-mapreduce-client-app.     48m 34s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12764359/MAPREDUCE-6485.006.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 071733d checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6039/artifact/patchprocess/diffcheckstylehadoop-mapreduce-client-app.txt hadoop-mapreduce-client-app test log https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6039/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6039/testReport/ Java 1.7.0_55 uname Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6039/console This message was automatically generated.
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        I will commit it tomorrow if there is no more comments on the patch..

        Show
        rohithsharma Rohith Sharma K S added a comment - I will commit it tomorrow if there is no more comments on the patch..
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        Committing shortly..

        Show
        rohithsharma Rohith Sharma K S added a comment - Committing shortly..
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        committed to branch-2/trunk.. Thanks Xianyin Xin for contributions!! Karthik Kambatla for the additional review..

        Show
        rohithsharma Rohith Sharma K S added a comment - committed to branch-2/trunk.. Thanks Xianyin Xin for contributions!! Karthik Kambatla for the additional review..
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #8554 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8554/)
        MAPREDUCE-6485. Create a new task attempt with failed map task priority (rohithsharmaks: rev 439f43ad3defbac907eda2d139a793f153544430)

        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        • hadoop-mapreduce-project/CHANGES.txt
        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #8554 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8554/ ) MAPREDUCE-6485 . Create a new task attempt with failed map task priority (rohithsharmaks: rev 439f43ad3defbac907eda2d139a793f153544430) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java hadoop-mapreduce-project/CHANGES.txt hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #479 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/479/)
        MAPREDUCE-6485. Create a new task attempt with failed map task priority (rohithsharmaks: rev 439f43ad3defbac907eda2d139a793f153544430)

        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        • hadoop-mapreduce-project/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #479 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/479/ ) MAPREDUCE-6485 . Create a new task attempt with failed map task priority (rohithsharmaks: rev 439f43ad3defbac907eda2d139a793f153544430) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java hadoop-mapreduce-project/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #471 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/471/)
        MAPREDUCE-6485. Create a new task attempt with failed map task priority (rohithsharmaks: rev 439f43ad3defbac907eda2d139a793f153544430)

        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • hadoop-mapreduce-project/CHANGES.txt
        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #471 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/471/ ) MAPREDUCE-6485 . Create a new task attempt with failed map task priority (rohithsharmaks: rev 439f43ad3defbac907eda2d139a793f153544430) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java hadoop-mapreduce-project/CHANGES.txt hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Yarn-trunk #1209 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/1209/)
        MAPREDUCE-6485. Create a new task attempt with failed map task priority (rohithsharmaks: rev 439f43ad3defbac907eda2d139a793f153544430)

        • hadoop-mapreduce-project/CHANGES.txt
        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #1209 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/1209/ ) MAPREDUCE-6485 . Create a new task attempt with failed map task priority (rohithsharmaks: rev 439f43ad3defbac907eda2d139a793f153544430) hadoop-mapreduce-project/CHANGES.txt hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk #2414 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2414/)
        MAPREDUCE-6485. Create a new task attempt with failed map task priority (rohithsharmaks: rev 439f43ad3defbac907eda2d139a793f153544430)

        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        • hadoop-mapreduce-project/CHANGES.txt
        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2414 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2414/ ) MAPREDUCE-6485 . Create a new task attempt with failed map task priority (rohithsharmaks: rev 439f43ad3defbac907eda2d139a793f153544430) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java hadoop-mapreduce-project/CHANGES.txt hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #445 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/445/)
        MAPREDUCE-6485. Create a new task attempt with failed map task priority (rohithsharmaks: rev 439f43ad3defbac907eda2d139a793f153544430)

        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        • hadoop-mapreduce-project/CHANGES.txt
        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #445 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/445/ ) MAPREDUCE-6485 . Create a new task attempt with failed map task priority (rohithsharmaks: rev 439f43ad3defbac907eda2d139a793f153544430) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java hadoop-mapreduce-project/CHANGES.txt hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk #2385 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2385/)
        MAPREDUCE-6485. Create a new task attempt with failed map task priority (rohithsharmaks: rev 439f43ad3defbac907eda2d139a793f153544430)

        • hadoop-mapreduce-project/CHANGES.txt
        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java
        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2385 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2385/ ) MAPREDUCE-6485 . Create a new task attempt with failed map task priority (rohithsharmaks: rev 439f43ad3defbac907eda2d139a793f153544430) hadoop-mapreduce-project/CHANGES.txt hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
        Hide
        xinxianyin Xianyin Xin added a comment -

        Thanks Rohith Sharma K S and Karthik Kambatla for review and valuable comments!

        Show
        xinxianyin Xianyin Xin added a comment - Thanks Rohith Sharma K S and Karthik Kambatla for review and valuable comments!
        Hide
        daemon YunFan Zhou added a comment -

        Varun Saxena Rohith Sharma K S Hi, Rohith Sharma K S,Karthik Kambatla.
        Could you please help me with my problems?
        The job I ran was hanged, but it was different from the scenario you encountered. What I've observed is that there are 15564 maps in total, and only one map is Pending. All reduce is pending because the map is not finished. But the resources for clustering and queues are very idle. The job was pending about 12 hours until I used the MR Cli to actively fail the map, and the job was finished normally.
        To view AM's log, the following info has been reported:

        2017-08-17 07:58:44,401 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
        2017-08-17 07:58:44,401 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 1 due to lack of space for maps
        2017-08-17 07:58:44,401 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:814012, vCores:-1>
        2017-08-17 07:58:44,401 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 15564
        
        Show
        daemon YunFan Zhou added a comment - Varun Saxena Rohith Sharma K S Hi, Rohith Sharma K S,Karthik Kambatla. Could you please help me with my problems? The job I ran was hanged, but it was different from the scenario you encountered. What I've observed is that there are 15564 maps in total, and only one map is Pending. All reduce is pending because the map is not finished. But the resources for clustering and queues are very idle. The job was pending about 12 hours until I used the MR Cli to actively fail the map, and the job was finished normally. To view AM's log, the following info has been reported: 2017-08-17 07:58:44,401 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0 2017-08-17 07:58:44,401 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 1 due to lack of space for maps 2017-08-17 07:58:44,401 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:814012, vCores:-1> 2017-08-17 07:58:44,401 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 15564

          People

          • Assignee:
            xinxianyin Xianyin Xin
            Reporter:
            Jobo Bob.zhao
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:
              Resolved:

              Development