[HADOOP-5241] Reduce tasks get stuck because of over-estimated task size (regression from 0.18) - ASF JIRA

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.19.0
Fix Version/s: 0.19.2, 0.20.0
Component/s: None
Labels:
None
Environment:

Red Hat Enterprise Linux Server release 5.2
JDK 1.6.0_11
Hadoop 0.19.0

Hadoop Flags:

Reviewed

Description

I have a simple MR benchmark job that computes PageRank on about 600 GB of HTML files using a 100 node cluster. For some reason, my reduce tasks get caught in a pending state. The JobTracker's log gets filled with the following messages:

2009-02-12 15:47:29,839 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node tracker_d-59.cs.wisc.edu:localhost/127.0.0.1:33227 has 110125027328 bytes free; but we expect reduce input to take 399642198235
2009-02-12 15:47:29,852 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node tracker_d-67.cs.wisc.edu:localhost/127.0.0.1:48626 has 107537776640 bytes free; but we expect reduce input to take 399642198235
2009-02-12 15:47:29,885 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node tracker_d-73.cs.wisc.edu:localhost/127.0.0.1:58849 has 113631690752 bytes free; but we expect reduce input to take 399642198235
<SNIP>

The weird thing is that I get through about 70 reduce tasks completing before it hangs. If I reduce the amount of the input data on 100 nodes down to 200GB, then it seems to work. As I scale the amount of input to the number of nodes, I can get it work some of the times on 50 nodes and without any problems on 25 nodes and less.

Note that it worked without any problems on Hadoop 0.18 late last year without changing any of the input data or the actual MR code.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

hadoop-jobtracker.log.gz
13/Feb/09 04:38
1.02 MB
Andy Pavlo
5241_v1.patch
17/Feb/09 11:53
6 kB
Sharad Agarwal
hadoop-patched-jobtracker.log.gz
18/Feb/09 05:00
905 kB
Andy Pavlo
hadoop_task_screenshot.png
18/Feb/09 05:01
78 kB
Andy Pavlo
5241_v1.patch
23/Feb/09 04:53
6 kB
Sharad Agarwal

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Sharad Agarwal

Reporter:: Andy Pavlo

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 12/Feb/09 22:01

Updated:: 08/Jul/09 16:53

Resolved:: 23/Feb/09 10:02

Agile

View on Board

Reduce tasks get stuck because of over-estimated task size (regression from 0.18)

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment