|
Ok, I was fortunate to be able to look at an actual job exhibiting this same behaviour (thanks to Christian and
Essentially there seems like a issue with TaskCompletionEvent}}s received by the {{ReduceTask. More than one stuck reducer (looking for map outputs) never actually scheduled a copy from the maps from which it is missing outputs, which leads me to believe that there is an issue with missing {{TaskCompletionEvents}}s. The other minor bug is the one I pointed out in my previous comment, so I've attached another patch which incorporates: Christian: I'd appreciate if you could try this out... Thanks! Ok, I just found this one and it is a bad one. The problem is that the TaskTracker in fetchMapCompletionEvents stores a cache of the completion events indexed by TIP id instead TASK id. So there is a event race condition between tasks in the same tip and if the last event is the failed one, then the reduces get stuck, because that map is marked as never completed. I've marked this as a 0.14.2 fix, but we might need a 0.13 fix too.
Ok, here is the proposal to fix this one:
a) Maintain a straight cache at the TaskTracker i.e. straight copy of the events sent by the JobTracker and don't manipulate them. Here is a first take on this one, with very preliminary testing done.
I know Devaraj has some second-thoughts, but yet... smile Sorry, I am not convinced that the patch solves the issue at hand. I would like to wait for the problem to be replicated with a version of hadoop that includes the patch for
Here is a well-tested patch - works with and without speculative-execution, lots of lost tasktrackers etc.
However, I want to wait for Devaraj to run a few tests with this and Submitting the patch on behalf of Arun to get it through Hudson.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12366638/HADOOP-1862_2_20070927.patch against trunk revision r580461. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/845/testReport/ This message is automatically generated. Updated patch, also resets the 'size' variable MapOutputCopier.run on fetch-failures.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12366927/HADOOP-1862_3_20071002.patch against trunk revision r581101. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/865/testReport/ This message is automatically generated. I just committed this. Thanks, Arun!
Reopening the issue since the patch doesn't apply to 0.14. Please submit a separate patch for that.
Attached patch for branch-0.14
Integrated in Hadoop-Nightly #258 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/258/
I just committed this. Thanks, Arun!
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Essentially, in JobInProgress.updateTaskStatuses(TaskInProgress, TaskStatus, JobTrackerMetrics) the TaskCompletionEvent.Status.SUCCEEDED is added irrespective of whether the TIP is already complete or not, leading to each reducer seeing 2 TaskCompletionEvent.Status.SUCCEEDED events as above... clearly the fetch from one of them will fail since either _1 or _2 will be KILLED, not a happy situation.
Like I said, I'll try to dig deeper, maybe this could help someone beat me to it. smile