Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6944

MR job got hanged forever when some NMs unstable for some time

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 2.7.2
    • Fix Version/s: None
    • Labels:
      None

      Description

      We encountered several jobs in the production environment due to the fact that some of the NM unstable cause one MAP of the job to be stuck there, and the job can't finish properly.
      However, the problems we encountered were different from those mentioned in https://issues.apache.org/jira/browse/MAPREDUCE-6513. Because in our scenario, all of MR REDUCEs does not start executing.
      But when I manually kill the hanged MAP, the job will be finished normally.

      2017-08-17 12:25:06,548 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 15564
      2017-08-17 12:25:07,555 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_e84_1502793246072_73922_01_015700
      2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:2218677, vCores:2225>
      2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 15564
      2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15723 ContRel:26 HostLocal:4575 RackLocal:8121
      
      2017-08-17 14:49:41,793 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:1 AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15724 ContRel:26 HostLocal:4575 RackLocal:8121
      2017-08-17 14:49:41,794 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Applying ask limit of 1 for priority:5 and capability:<memory:1024, vCores:1>
      2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1502793246072_73922: ask=1 release= 0 newContainers=0 finishedContainers=0 resourcelimit=<memory:1711989, vCores:1688> knownNMs=4236
      2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:1711989, vCores:1688>
      2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 15564
      2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1
      2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigning container Container: [ContainerId: container_e84_1502793246072_73922_01_015726, NodeId: bigdata-hdp-apache1960.xg01.diditaxi.com:8041, NodeHttpAddress: bigdata-hdp-apache1960.xg01.diditaxi.com:8042, Resource: <memory:1024, vCores:1>, Priority: 5, Token: Token { kind: ContainerToken, service: 10.93.111.36:8041 }, ] to fast fail map
      2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned from earlierFailedMaps
      2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_e84_1502793246072_73922_01_015726 to attempt_1502793246072_73922_m_012103_5
      2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:1727349, vCores:1703>
      2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 15564
      2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:1009 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:2 AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15725 ContRel:26 HostLocal:4575 RackLocal:8121
      
      !screenshot-1.png!
      

        Attachments

        1. screenshot-1.png
          98 kB
          YunFan Zhou

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              daemon YunFan Zhou
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated: