Issue Details (XML | Word | Printable)

Key: HADOOP-3598
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Arun C Murthy
Reporter: Arun C Murthy
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Map-Reduce framework needlessly creates temporary _${taskid} directories for Maps

Created: 19/Jun/08 03:04 PM   Updated: 08/Jul/09 04:52 PM
Return to search
Component/s: None
Affects Version/s: 0.18.0
Fix Version/s: 0.18.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works HADOOP-3598_0_20080619.patch 2008-06-20 07:35 AM Arun C Murthy 12 kB
Text File Licensed for inclusion in ASF works HADOOP-3598_0_20080619.patch 2008-06-19 08:02 PM Arun C Murthy 8 kB
Text File Licensed for inclusion in ASF works HADOOP-3598_0_20080619.patch 2008-06-19 04:07 PM Arun C Murthy 7 kB
Text File Licensed for inclusion in ASF works HADOOP-3598_1_20080620.patch 2008-06-20 06:56 PM Arun C Murthy 12 kB

Hadoop Flags: Incompatible change
Release Note: Changed Map-Reduce framework to no longer create temporary task output directories for staging outputs if staging outputs isn't necessary. ${mapred.out.dir}/_temporary/_${taskid}
Resolution Date: 20/Jun/08 10:37 PM


 Description  « Hide
The staging directory for task-outputs (i.e. ${mapred.out.dir}/temporary/${taskid}) should only be created when Maps produce output on HDFS, which usually isn't the case. This plays very badly with HDFS quotas and may lead to thousands of temp names in the FS namespace, there-by overhauling the quotas. IAC, it isn't good to needlessly create these directories.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Runping Qi added a comment - 19/Jun/08 03:12 PM
+1.

Agree completely.
This also has caused performance degradation and even job failures


Arun C Murthy added a comment - 19/Jun/08 04:07 PM
Attached patch pushes the creation of the temporary output directory into the FileOutputFormat, there-by ensuring Maps which do not produce output to HDFS do not create the temporary directories.

Is this an incompatible change?
I think so - the framework no longer creates the directory and once FileSystem.create doens't create the parent-directory implicitly all output-formats will need to be aware of this.
Thoughts?

Once we reach consensus, I'll attach another patch with the relevant documentation fix also.


Devaraj Das added a comment - 19/Jun/08 04:56 PM
This makes sense. The thing that we need to take care of is the map/reduce side files. But yes, with the implicit creation of task directories things will continue to work. If and when that implicit creation is addressed, we need to introduce an API that the users must call to create side files and that api could in turn create the directory tree.

Arun C Murthy added a comment - 19/Jun/08 05:30 PM
Ok, then I'm PA'ing this patch. We don't need any other changes...

Doug Cutting added a comment - 19/Jun/08 05:41 PM
The checks for the output directory's existence in each OutputFormat implementation are now redundant, no?

Also, each implementation now calls both createWorkDir() and getWorkOutputPath(), and these are the only calls to either of these methods, always as a pair. Wouldn't it be simpler to combine these in a single method? For example, getWorkOutputPath() could be changed to create the directory, guarantee its existence, etc. This would reduce the amount of code duplicated in each FileOutputFormat subclass.


Arun C Murthy added a comment - 19/Jun/08 05:53 PM
I kept the checks since it's consistent with current behaviour - currently TaskTracker.localizeConfiguration does those checks.

We need getWorkOutputPath() so that client code can use it to query for the temp. path and create side-effect files. Hence I have introduced a new (non-public) createWorkDir() api...


Doug Cutting added a comment - 19/Jun/08 06:12 PM
If back-compatibility is your concern, then leave getWorkOutputPath() alone and add a new protected method createWorkOutputPath() or somesuch that creates the directory and returns the path within it. Duplicating the same four lines in every subclass is not good, especially when these four lines check the existence of the same directory twice.

Arun C Murthy added a comment - 19/Jun/08 07:41 PM
Fair point.

Arun C Murthy added a comment - 19/Jun/08 08:02 PM
Incorporated feedback from Doug...

Hadoop QA added a comment - 19/Jun/08 09:44 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12384313/HADOOP-3598_0_20080619.patch
against trunk revision 669646.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2700/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2700/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2700/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2700/console

This message is automatically generated.


Doug Cutting added a comment - 19/Jun/08 10:04 PM
Every call to createWorkDir is in a sequence like:
Path taskOutputDir = createWorkDir(job);
    if (taskOutputDir == null) {
      throw new IOException("Could not create task output directory");
    Path file = new Path(taskOutputDir, name);

At a minimum, the null test and exception can be added to createWorkDir, no?

And perhaps the method can also create the path, e.g., the equivalent of:

protected Path createOutputPath(job, name) {
  return new Path(createWorkDir(job), name);
}

Arun C Murthy added a comment - 20/Jun/08 07:35 AM
Updated patch.

Arun C Murthy added a comment - 20/Jun/08 07:36 AM
Resubmitting to hudson.

Hadoop QA added a comment - 20/Jun/08 09:07 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12384351/HADOOP-3598_0_20080619.patch
against trunk revision 669790.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 9 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2705/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2705/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2705/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2705/console

This message is automatically generated.


Amareshwari Sriramadasu added a comment - 20/Jun/08 09:26 AM - edited
One comment..
Creation of _temporary/_<taskid> should not throw exception if mkdirs fails, but the directory already exists. Because the directory could be already created by creating a side-file or from an earlier call to getTaskOutputPath().
You can add fs.exists(taskTmpDir) check before throwing exception

Otherthan that, with documentation fixed for getTaskOutputPath patch looks good.


Arun C Murthy added a comment - 20/Jun/08 06:56 PM

Creation of temporary/<taskid> should not throw exception if mkdirs fails, but the directory already exists.

FileSystem.mkdirs will not fail if directory already exists...

Fixed the javadoc. Thanks for the review.


Doug Cutting added a comment - 20/Jun/08 09:08 PM
+1 This looks good.

Hadoop QA added a comment - 20/Jun/08 10:25 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12384404/HADOOP-3598_1_20080620.patch
against trunk revision 669986.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 9 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2711/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2711/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2711/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2711/console

This message is automatically generated.


Arun C Murthy added a comment - 20/Jun/08 10:37 PM
I just committed this.

Hudson added a comment - 21/Jun/08 12:24 PM