Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5689

MRAppMaster does not preempt reducers when scheduled maps cannot be fulfilled

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.0.0, 2.2.0
    • Fix Version/s: 0.23.11, 2.3.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      We saw corner case where Jobs running on cluster were hung. Scenario was something like this. Job was running within a pool which was running at its capacity. All available containers were occupied by reducers and last 2 mappers. There were few more reducers waiting to be scheduled in pipeline.
      At this point two mappers which were running failed and went back to scheduled state. two available containers were assigned to reducers, now whole pool was full of reducers waiting on two maps to be complete. 2 maps never got scheduled because pool was full.

      Ideally reducer preemption should have kicked in to make room for Mappers from this code in RMContaienrAllocator

      int completedMaps = getJob().getCompletedMaps();
          int completedTasks = completedMaps + getJob().getCompletedReduces();
          if (lastCompletedTasks != completedTasks) {
            lastCompletedTasks = completedTasks;
            recalculateReduceSchedule = true;
          }
      
          if (recalculateReduceSchedule) {
            preemptReducesIfNeeded();
      

      But in this scenario lastCompletedTasks is always completedTasks because maps were never completed. This would cause job to hang forever. As workaround if we kill few reducers, mappers would get scheduled and caused job to complete.

      1. MAPREDUCE-5689.1.patch
        3 kB
        Lohit Vijayarenu
      2. MAPREDUCE-5689.2.patch
        3 kB
        Karthik Kambatla

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Lohit Vijayarenu
            Reporter:
            Lohit Vijayarenu
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development