|
Owen O'Malley made changes - 01/Feb/07 06:10 PM
Doug Cutting made changes - 02/Mar/07 10:17 PM
Attached patch for review. The unit tests run fine with the patch and I am in the process of running the sort benchmark.
Devaraj Das made changes - 05/Apr/07 04:37 PM
Devaraj Das made changes - 06/Apr/07 05:55 PM
This is w.r.t the current (
Devaraj Das made changes - 10/Apr/07 06:05 AM
This has the potential to quickly go stale since the patch touches many files in major ways. So would appreciate a quick review/commit on this one. Thanks.
Devaraj Das made changes - 13/Apr/07 06:43 AM
+1
http://issues.apache.org/jira/secure/attachment/12355215/968.apr10.patch Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/42/testReport/ 1. I notice that a lot of your iterators are not typed causing you to do casts of itr.next().
2. In many cases, the loop "for(Item item: itemSet){..}" is easier to read and more concise. 3. Maps should not be iterated through using: for(Map.Entry<Key,Value> item: myMap) {...} rather than: Iterator itr = myMap.keySet().iterator(); while (itr.hasNext()) { Value value = myMap.get(itr.next()); ... } 4. It looks like each reduce from a job will cause its job's FetchState to be added to the list a multiple time, so it will fetch multiple times per a loop. 5. I'd remove the sleep from queryJobTracker and move it to the MapEventsFetcherThread's run loop. 6. The doFetch is badly named, since it doesn't actually do the fetch. It should be called findReduces or something. 7. The name of the parameter of the first parameter in TaskUmbilicalProtocol.getMapCompletionEvents is "taskid", but if fact it is a job id. 8. The MapEventsFetcherThread's name doesn't need to include the task in the normal case, but I guess for unit tests it might be useful. 9. I assume that the shuffle code in ReduceTask matches the old code in ReduceTaskRunner. smile
Owen O'Malley made changes - 13/Apr/07 09:16 PM
Point 3 should be "Maps SHOULD BE iterated through using". Sorry for any confusion.
Thanks for the review, Owen. Some comments below.
> 1. I notice that a lot of your iterators are not typed causing you to do casts of itr.next Done (old habits die hard smile). > 4. It looks like each reduce from a job will cause its job's FetchState to be added to > 5. I'd remove the sleep from queryJobTracker and move it to the > 6. The doFetch is badly named, since it doesn't actually do the fetch. It should be > 7. The name of the parameter of the first parameter in > 8. The MapEventsFetcherThread's name doesn't need to include the task in the > 9. I assume that the shuffle code in ReduceTask matches the old code in
Devaraj Das made changes - 14/Apr/07 01:04 PM
Devaraj Das made changes - 14/Apr/07 01:24 PM
This removes a redundant call to System.currentTimeMillis in the TaskTracker.
Devaraj Das made changes - 14/Apr/07 06:11 PM
-1, build or testing failed
2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12355549/968.apr14.patch Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/50/testReport/ Please note that this message is automatically generated and may represent a problem with the automation system and not the patch. Here's a re-indented version of this patch.
Doug Cutting made changes - 16/Apr/07 10:47 PM
This patch addresses an issue to do with metrics reporting wherein stopMonitoring was not called for the ReduceTask. This potentially could lead to hung ReduceTasks after they are finished since the task JVM might not be able to exit until the (non-daemon) monitoring thread goes away.
Devaraj Das made changes - 17/Apr/07 06:29 AM
+1. The latest patch passes my nightly suit of benchmarks. I have resubmitted it to the patch process.
Devaraj, please can you clarify what you fixed with regard to the lack of calls to stopMonitoring? I think you're right that it is a problem, but I would expect the fix to be to change the Timer constructor in o.a.h.metrics.spi.AbstractMetricsContext. Instead of the zero-arg constructor we should use the two arg constructor that takes a thread name and a boolean isDaemon. I didn't see this change in the patch. I am pasting the relevant block of code from the latest patch (methodname :TaskTracker.java::Child::main()). Basically, I invoke the close() method of metricsContext for the context "mapred" just before the point where the child task closes the log manager (LogManager.shutdown()) and it is about to die. I thought this was the best way to stop the monitoring and let the task exit nicely without having to touch other parts of the metrics code.
} finally { Makes sense? I think David's point was that it would be best to fix the metrics code so that its thread is a daemon thread, rather than to expect client code to explicitly stop that thread.
Ok, changed the Timer constructor to the two-argument one. Also, retained the metricsContext.close() call that I had in the last patch. In general, I think that it is a good idea to call close() on an object if the close() method is clearly documented. Does this seem all right?
Devaraj Das made changes - 17/Apr/07 07:23 PM
I just committed this. Thanks, Devaraj!
Doug Cutting made changes - 17/Apr/07 07:57 PM
-1, could not apply patch.
The patch command could not apply the latest attachment http://issues.apache.org/jira/secure/attachment/12355713/968-with-metrics-fix.new.patch Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/58/console Please note that this message is automatically generated and may represent a problem with the automation system and not the patch. Integrated in Hadoop-Nightly #61 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/61/
Doug Cutting made changes - 08/Jun/07 08:40 PM
Owen O'Malley made changes - 08/Jul/09 04:52 PM
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
On the TaskTracker
1) The TaskTracker maintains the list of TaskCompletionEvents for a job. Whenever a ReduceTask is assigned to a TaskTracker it extracts the JobId out of that.
2) For that jobid it starts fetching MapTask completion events as long as any ReduceTask for that job is in the SHUFFLE phase (this ensures that the TaskTracker sees all MapTask lost events and keeps an updated cache of all events). When all the ReduceTasks for a given job have gone past the SHUFFLE phase, the TaskTracker does not fetch any more MapTask completion events until another ReduceTask gets assigned to it. If no other ReduceTask from the same job gets assigned to it, and the job completes, it clears the cache of TaskCompletionEvents.
3) The event-fetcher thread blocks on runningJobs object. Whenever the method addTaskToJob in TaskTracker adds a new Task to a job, it invokes runningJobs.notify(), so that the event-fetcher thread can unblock and continue.
4) The event-fetcher thread also goes through the runningJobs and immediately stops fetching events for those jobs that have been killed/failed.
On the TaskUmbilicalProtocol, ReduceTaskRunner & ReduceTask:
1) A new method - TaskCompletionEvent[] getSuccessMapCompleteEvents(String taskId, int fromIndex, int maxLocs) throws IOException; - has been added for enabling the ReduceTask to fetch TaskCompletionEvents cached at the TaskTracker. The semantics of this method are mirrored to the one in InterTrackerProtocol - getTaskCompletionEvents, except that in the umbilical protocol, we are interested in just the successful map events. Fetch failures are handled in the same way as is done today. Thus, most of the fetcher code in ReduceTaskRunner remains the same (the code now is part of ReduceTask in a new class called ReduceCopier, and the ReduceTaskRunner very closely matches to MapTaskRunner in terms of functionality/code).
Comments?