Issue Details (XML | Word | Printable)

Key: HADOOP-1862
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Arun C Murthy
Reporter: Owen O'Malley
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

reduces are getting stuck trying to find map outputs

Created: 07/Sep/07 09:39 PM   Updated: 08/Jul/09 04:52 PM
Return to search
Component/s: None
Affects Version/s: 0.13.0, 0.13.1, 0.14.0, 0.14.1
Fix Version/s: 0.14.2

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works HADOOP-1862_1_20070912.patch 2007-09-12 01:54 PM Arun C Murthy 17 kB
Text File Licensed for inclusion in ASF works HADOOP-1862_2_20070927.patch 2007-09-27 05:07 AM Arun C Murthy 17 kB
Text File Licensed for inclusion in ASF works HADOOP-1862_3_20071002.patch 2007-10-02 02:36 PM Arun C Murthy 17 kB
Text File Licensed for inclusion in ASF works HADOOP-1862_branch14_1_20071003.patch 2007-10-02 08:17 PM Arun C Murthy 17 kB
Text File HADOOP-1862_debug.patch 2007-09-11 10:40 AM Arun C Murthy 4 kB
Text File HADOOP-1862_prelim.patch 2007-09-10 12:08 PM Arun C Murthy 3 kB
Issue Links:
Duplicate
 

Resolution Date: 03/Oct/07 06:05 AM


 Description  « Hide
Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Arun C Murthy added a comment - 10/Sep/07 10:49 AM
Hmm... one straw to clutch:

$ cat 1862-event.log | grep task_200709041519_0023_m_001149
OBSOLETE task_200709041519_0023_m_001149_0 http://a.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001149_0
FAILED task_200709041519_0023_m_001149_0 null
SUCCEEDED task_200709041519_0023_m_001149_1 http://b.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001149_1
SUCCEEDED task_200709041519_0023_m_001149_2 http://c.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001149_2


$ cat 1862-event.log | grep task_200709041519_0023_m_001816
OBSOLETE task_200709041519_0023_m_001816_0 http://x.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001816_0
FAILED task_200709041519_0023_m_001816_0 null
SUCCEEDED task_200709041519_0023_m_001816_1 http://y.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001816_1
SUCCEEDED task_200709041519_0023_m_001816_2 http://z.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001816_2


Essentially, in JobInProgress.updateTaskStatuses(TaskInProgress, TaskStatus, JobTrackerMetrics) the TaskCompletionEvent.Status.SUCCEEDED is added irrespective of whether the TIP is already complete or not, leading to each reducer seeing 2 TaskCompletionEvent.Status.SUCCEEDED events as above... clearly the fetch from one of them will fail since either _1 or _2 will be KILLED, not a happy situation.

Like I said, I'll try to dig deeper, maybe this could help someone beat me to it. smile


Arun C Murthy added a comment - 10/Sep/07 12:08 PM
Illustrative fix...

Arun C Murthy added a comment - 11/Sep/07 10:34 AM
Ok, I was fortunate to be able to look at an actual job exhibiting this same behaviour (thanks to Christian and HADOOP-1874) and here are some insights:

Essentially there seems like a issue with TaskCompletionEvent}}s received by the {{ReduceTask. More than one stuck reducer (looking for map outputs) never actually scheduled a copy from the maps from which it is missing outputs, which leads me to believe that there is an issue with missing {{TaskCompletionEvents}}s.

The other minor bug is the one I pointed out in my previous comment, so I've attached another patch which incorporates:
a) previous fix
b) a debug statement to help track the received TaskCompletionEvent}}s at the {{Reduce task.

Christian: I'd appreciate if you could try this out... Thanks!


Owen O'Malley added a comment - 12/Sep/07 05:43 AM
Ok, I just found this one and it is a bad one. The problem is that the TaskTracker in fetchMapCompletionEvents stores a cache of the completion events indexed by TIP id instead TASK id. So there is a event race condition between tasks in the same tip and if the last event is the failed one, then the reduces get stuck, because that map is marked as never completed. I've marked this as a 0.14.2 fix, but we might need a 0.13 fix too.

Arun C Murthy added a comment - 12/Sep/07 05:57 AM
Ok, here is the proposal to fix this one:

a) Maintain a straight cache at the TaskTracker i.e. straight copy of the events sent by the JobTracker and don't manipulate them.
b) Process OBSOLETE TaskCompletionEvent}}s at the {{ReduceTask.
c) Keep my earlier fix to TaskInProgress.completedTask i.e. do not add a SUCCESS event until it is clear that this is the true successful event, and that the TIP hadn't already been completed by some other task-attempt.


Arun C Murthy added a comment - 12/Sep/07 01:54 PM
Here is a first take on this one, with very preliminary testing done.

I know Devaraj has some second-thoughts, but yet... smile


Devaraj Das added a comment - 17/Sep/07 06:24 AM
Sorry, I am not convinced that the patch solves the issue at hand. I would like to wait for the problem to be replicated with a version of hadoop that includes the patch for HADOOP-1874. Then I would like to see whether we still run into the situation reported by this bug, apply this patch on that version, and see whether it solves the problem. This is a significant change and I just want to make sure that this patch really addresses the reported bug.

Arun C Murthy added a comment - 27/Sep/07 05:07 AM
Here is a well-tested patch - works with and without speculative-execution, lots of lost tasktrackers etc.

However, I want to wait for Devaraj to run a few tests with this and HADOOP-1874 before we get this in... hence I'm not making it PA right-away. Clearly any reviews/ideas on this are welcome.


Devaraj Das added a comment - 28/Sep/07 08:51 AM
Submitting the patch on behalf of Arun to get it through Hudson.

Devaraj Das added a comment - 28/Sep/07 08:53 AM
The patch looks good.

Hadoop QA added a comment - 28/Sep/07 09:57 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12366638/HADOOP-1862_2_20070927.patch
against trunk revision r580461.

@author +1. The patch does not contain any @author tags.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new compiler warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests -1. The patch failed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/845/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/845/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/845/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/845/console

This message is automatically generated.


Arun C Murthy added a comment - 02/Oct/07 02:36 PM
Updated patch, also resets the 'size' variable MapOutputCopier.run on fetch-failures.

Hadoop QA added a comment - 02/Oct/07 04:42 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12366927/HADOOP-1862_3_20071002.patch
against trunk revision r581101.

@author +1. The patch does not contain any @author tags.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new compiler warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/865/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/865/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/865/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/865/console

This message is automatically generated.


Devaraj Das added a comment - 02/Oct/07 05:38 PM
I just committed this. Thanks, Arun!

Devaraj Das added a comment - 02/Oct/07 06:10 PM
Reopening the issue since the patch doesn't apply to 0.14. Please submit a separate patch for that.

Arun C Murthy added a comment - 02/Oct/07 08:17 PM
Attached patch for branch-0.14

Hudson added a comment - 02/Oct/07 10:58 PM

Devaraj Das added a comment - 03/Oct/07 06:05 AM
I just committed this. Thanks, Arun!