|
Attached patch pushes the creation of the temporary output directory into the FileOutputFormat, there-by ensuring Maps which do not produce output to HDFS do not create the temporary directories.
Is this an incompatible change? Once we reach consensus, I'll attach another patch with the relevant documentation fix also. This makes sense. The thing that we need to take care of is the map/reduce side files. But yes, with the implicit creation of task directories things will continue to work. If and when that implicit creation is addressed, we need to introduce an API that the users must call to create side files and that api could in turn create the directory tree.
Ok, then I'm PA'ing this patch. We don't need any other changes...
The checks for the output directory's existence in each OutputFormat implementation are now redundant, no?
Also, each implementation now calls both createWorkDir() and getWorkOutputPath(), and these are the only calls to either of these methods, always as a pair. Wouldn't it be simpler to combine these in a single method? For example, getWorkOutputPath() could be changed to create the directory, guarantee its existence, etc. This would reduce the amount of code duplicated in each FileOutputFormat subclass. I kept the checks since it's consistent with current behaviour - currently TaskTracker.localizeConfiguration does those checks.
We need getWorkOutputPath() so that client code can use it to query for the temp. path and create side-effect files. Hence I have introduced a new (non-public) createWorkDir() api... If back-compatibility is your concern, then leave getWorkOutputPath() alone and add a new protected method createWorkOutputPath() or somesuch that creates the directory and returns the path within it. Duplicating the same four lines in every subclass is not good, especially when these four lines check the existence of the same directory twice.
Incorporated feedback from Doug...
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12384313/HADOOP-3598_0_20080619.patch against trunk revision 669646. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2700/testReport/ This message is automatically generated. Every call to createWorkDir is in a sequence like:
Path taskOutputDir = createWorkDir(job);
if (taskOutputDir == null) {
throw new IOException("Could not create task output directory");
Path file = new Path(taskOutputDir, name);
At a minimum, the null test and exception can be added to createWorkDir, no? And perhaps the method can also create the path, e.g., the equivalent of: protected Path createOutputPath(job, name) { return new Path(createWorkDir(job), name); } +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12384351/HADOOP-3598_0_20080619.patch against trunk revision 669790. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2705/testReport/ This message is automatically generated. One comment..
Creation of _temporary/_<taskid> should not throw exception if mkdirs fails, but the directory already exists. Because the directory could be already created by creating a side-file or from an earlier call to getTaskOutputPath(). You can add fs.exists(taskTmpDir) check before throwing exception Otherthan that, with documentation fixed for getTaskOutputPath patch looks good.
FileSystem.mkdirs will not fail if directory already exists... Fixed the javadoc. Thanks for the review. +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12384404/HADOOP-3598_1_20080620.patch against trunk revision 669986. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2711/testReport/ This message is automatically generated. Integrated in Hadoop-trunk #525 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/525/
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Agree completely.
This also has caused performance degradation and even job failures