Hadoop Common
  1. Hadoop Common
  2. HADOOP-5241

Reduce tasks get stuck because of over-estimated task size (regression from 0.18)

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.19.0
    • Fix Version/s: 0.19.2, 0.20.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      Red Hat Enterprise Linux Server release 5.2
      JDK 1.6.0_11
      Hadoop 0.19.0

    • Hadoop Flags:
      Reviewed

      Description

      I have a simple MR benchmark job that computes PageRank on about 600 GB of HTML files using a 100 node cluster. For some reason, my reduce tasks get caught in a pending state. The JobTracker's log gets filled with the following messages:

      2009-02-12 15:47:29,839 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node tracker_d-59.cs.wisc.edu:localhost/127.0.0.1:33227 has 110125027328 bytes free; but we expect reduce input to take 399642198235
      2009-02-12 15:47:29,852 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node tracker_d-67.cs.wisc.edu:localhost/127.0.0.1:48626 has 107537776640 bytes free; but we expect reduce input to take 399642198235
      2009-02-12 15:47:29,885 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node tracker_d-73.cs.wisc.edu:localhost/127.0.0.1:58849 has 113631690752 bytes free; but we expect reduce input to take 399642198235
      <SNIP>

      The weird thing is that I get through about 70 reduce tasks completing before it hangs. If I reduce the amount of the input data on 100 nodes down to 200GB, then it seems to work. As I scale the amount of input to the number of nodes, I can get it work some of the times on 50 nodes and without any problems on 25 nodes and less.

      Note that it worked without any problems on Hadoop 0.18 late last year without changing any of the input data or the actual MR code.

      1. hadoop-patched-jobtracker.log.gz
        905 kB
        Andy Pavlo
      2. hadoop-jobtracker.log.gz
        1.02 MB
        Andy Pavlo
      3. hadoop_task_screenshot.png
        78 kB
        Andy Pavlo
      4. 5241_v1.patch
        6 kB
        Sharad Agarwal
      5. 5241_v1.patch
        6 kB
        Sharad Agarwal

        Activity

        Andy Pavlo created issue -
        Andy Pavlo made changes -
        Field Original Value New Value
        Attachment hadoop-jobtracker.log.gz [ 12400147 ]
        Devaraj Das made changes -
        Assignee Sharad Agarwal [ sharadag ]
        Fix Version/s 0.19.1 [ 12313473 ]
        Sharad Agarwal made changes -
        Attachment 5241_v1.patch [ 12400325 ]
        Andy Pavlo made changes -
        Attachment hadoop-patched-jobtracker.log.gz [ 12400379 ]
        Andy Pavlo made changes -
        Attachment hadoop_task_screenshot.png [ 12400381 ]
        Sharad Agarwal made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Sharad Agarwal made changes -
        Attachment 5241_v1.patch [ 12400731 ]
        Sharad Agarwal made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Sharad Agarwal made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Devaraj Das made changes -
        Fix Version/s 0.19.1 [ 12313473 ]
        Fix Version/s 0.20.0 [ 12313438 ]
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Fix Version/s 0.21.0 [ 12313563 ]
        Hadoop Flags [Reviewed]
        Resolution Fixed [ 1 ]
        Devaraj Das made changes -
        Fix Version/s 0.19.2 [ 12313650 ]
        Nigel Daley made changes -
        Fix Version/s 0.21.0 [ 12313563 ]
        Nigel Daley made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Owen O'Malley made changes -
        Component/s mapred [ 12310690 ]

          People

          • Assignee:
            Sharad Agarwal
            Reporter:
            Andy Pavlo
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development