Hadoop Common
  1. Hadoop Common
  2. HADOOP-3598

Map-Reduce framework needlessly creates temporary _${taskid} directories for Maps

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.18.0
    • Fix Version/s: 0.18.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change
    • Release Note:
      Changed Map-Reduce framework to no longer create temporary task output directories for staging outputs if staging outputs isn't necessary. ${mapred.out.dir}/_temporary/_${taskid}

      Description

      The staging directory for task-outputs (i.e. $

      {mapred.out.dir}

      /temporary/$

      {taskid}

      ) should only be created when Maps produce output on HDFS, which usually isn't the case. This plays very badly with HDFS quotas and may lead to thousands of temp names in the FS namespace, there-by overhauling the quotas. IAC, it isn't good to needlessly create these directories.

      1. HADOOP-3598_0_20080619.patch
        12 kB
        Arun C Murthy
      2. HADOOP-3598_0_20080619.patch
        8 kB
        Arun C Murthy
      3. HADOOP-3598_0_20080619.patch
        7 kB
        Arun C Murthy
      4. HADOOP-3598_1_20080620.patch
        12 kB
        Arun C Murthy

        Activity

        Hide
        Runping Qi added a comment -

        +1.

        Agree completely.
        This also has caused performance degradation and even job failures

        Show
        Runping Qi added a comment - +1. Agree completely. This also has caused performance degradation and even job failures
        Hide
        Arun C Murthy added a comment -

        Attached patch pushes the creation of the temporary output directory into the FileOutputFormat, there-by ensuring Maps which do not produce output to HDFS do not create the temporary directories.

        Is this an incompatible change?
        I think so - the framework no longer creates the directory and once FileSystem.create doens't create the parent-directory implicitly all output-formats will need to be aware of this.
        Thoughts?

        Once we reach consensus, I'll attach another patch with the relevant documentation fix also.

        Show
        Arun C Murthy added a comment - Attached patch pushes the creation of the temporary output directory into the FileOutputFormat, there-by ensuring Maps which do not produce output to HDFS do not create the temporary directories. Is this an incompatible change? I think so - the framework no longer creates the directory and once FileSystem.create doens't create the parent-directory implicitly all output-formats will need to be aware of this. Thoughts? Once we reach consensus, I'll attach another patch with the relevant documentation fix also.
        Hide
        Devaraj Das added a comment -

        This makes sense. The thing that we need to take care of is the map/reduce side files. But yes, with the implicit creation of task directories things will continue to work. If and when that implicit creation is addressed, we need to introduce an API that the users must call to create side files and that api could in turn create the directory tree.

        Show
        Devaraj Das added a comment - This makes sense. The thing that we need to take care of is the map/reduce side files. But yes, with the implicit creation of task directories things will continue to work. If and when that implicit creation is addressed, we need to introduce an API that the users must call to create side files and that api could in turn create the directory tree.
        Hide
        Arun C Murthy added a comment -

        Ok, then I'm PA'ing this patch. We don't need any other changes...

        Show
        Arun C Murthy added a comment - Ok, then I'm PA'ing this patch. We don't need any other changes...
        Hide
        Doug Cutting added a comment -

        The checks for the output directory's existence in each OutputFormat implementation are now redundant, no?

        Also, each implementation now calls both createWorkDir() and getWorkOutputPath(), and these are the only calls to either of these methods, always as a pair. Wouldn't it be simpler to combine these in a single method? For example, getWorkOutputPath() could be changed to create the directory, guarantee its existence, etc. This would reduce the amount of code duplicated in each FileOutputFormat subclass.

        Show
        Doug Cutting added a comment - The checks for the output directory's existence in each OutputFormat implementation are now redundant, no? Also, each implementation now calls both createWorkDir() and getWorkOutputPath(), and these are the only calls to either of these methods, always as a pair. Wouldn't it be simpler to combine these in a single method? For example, getWorkOutputPath() could be changed to create the directory, guarantee its existence, etc. This would reduce the amount of code duplicated in each FileOutputFormat subclass.
        Hide
        Arun C Murthy added a comment -

        I kept the checks since it's consistent with current behaviour - currently TaskTracker.localizeConfiguration does those checks.

        We need getWorkOutputPath() so that client code can use it to query for the temp. path and create side-effect files. Hence I have introduced a new (non-public) createWorkDir() api...

        Show
        Arun C Murthy added a comment - I kept the checks since it's consistent with current behaviour - currently TaskTracker.localizeConfiguration does those checks. We need getWorkOutputPath() so that client code can use it to query for the temp. path and create side-effect files. Hence I have introduced a new (non-public) createWorkDir() api...
        Hide
        Doug Cutting added a comment -

        If back-compatibility is your concern, then leave getWorkOutputPath() alone and add a new protected method createWorkOutputPath() or somesuch that creates the directory and returns the path within it. Duplicating the same four lines in every subclass is not good, especially when these four lines check the existence of the same directory twice.

        Show
        Doug Cutting added a comment - If back-compatibility is your concern, then leave getWorkOutputPath() alone and add a new protected method createWorkOutputPath() or somesuch that creates the directory and returns the path within it. Duplicating the same four lines in every subclass is not good, especially when these four lines check the existence of the same directory twice.
        Hide
        Arun C Murthy added a comment -

        Fair point.

        Show
        Arun C Murthy added a comment - Fair point.
        Hide
        Arun C Murthy added a comment -

        Incorporated feedback from Doug...

        Show
        Arun C Murthy added a comment - Incorporated feedback from Doug...
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12384313/HADOOP-3598_0_20080619.patch
        against trunk revision 669646.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2700/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2700/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2700/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2700/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12384313/HADOOP-3598_0_20080619.patch against trunk revision 669646. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2700/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2700/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2700/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2700/console This message is automatically generated.
        Hide
        Doug Cutting added a comment -

        Every call to createWorkDir is in a sequence like:

            Path taskOutputDir = createWorkDir(job);
            if (taskOutputDir == null) {
              throw new IOException("Could not create task output directory");
            Path file = new Path(taskOutputDir, name);
        

        At a minimum, the null test and exception can be added to createWorkDir, no?

        And perhaps the method can also create the path, e.g., the equivalent of:

        protected Path createOutputPath(job, name) {
          return new Path(createWorkDir(job), name);
        }
        
        Show
        Doug Cutting added a comment - Every call to createWorkDir is in a sequence like: Path taskOutputDir = createWorkDir(job); if (taskOutputDir == null ) { throw new IOException( "Could not create task output directory" ); Path file = new Path(taskOutputDir, name); At a minimum, the null test and exception can be added to createWorkDir, no? And perhaps the method can also create the path, e.g., the equivalent of: protected Path createOutputPath(job, name) { return new Path(createWorkDir(job), name); }
        Hide
        Arun C Murthy added a comment -

        Updated patch.

        Show
        Arun C Murthy added a comment - Updated patch.
        Hide
        Arun C Murthy added a comment -

        Resubmitting to hudson.

        Show
        Arun C Murthy added a comment - Resubmitting to hudson.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12384351/HADOOP-3598_0_20080619.patch
        against trunk revision 669790.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 9 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2705/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2705/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2705/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2705/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12384351/HADOOP-3598_0_20080619.patch against trunk revision 669790. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2705/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2705/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2705/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2705/console This message is automatically generated.
        Hide
        Amareshwari Sriramadasu added a comment - - edited

        One comment..
        Creation of _temporary/_<taskid> should not throw exception if mkdirs fails, but the directory already exists. Because the directory could be already created by creating a side-file or from an earlier call to getTaskOutputPath().
        You can add fs.exists(taskTmpDir) check before throwing exception

        Otherthan that, with documentation fixed for getTaskOutputPath patch looks good.

        Show
        Amareshwari Sriramadasu added a comment - - edited One comment.. Creation of _temporary/_<taskid> should not throw exception if mkdirs fails, but the directory already exists. Because the directory could be already created by creating a side-file or from an earlier call to getTaskOutputPath(). You can add fs.exists(taskTmpDir) check before throwing exception Otherthan that, with documentation fixed for getTaskOutputPath patch looks good.
        Hide
        Arun C Murthy added a comment -

        Creation of temporary/<taskid> should not throw exception if mkdirs fails, but the directory already exists.

        FileSystem.mkdirs will not fail if directory already exists...

        Fixed the javadoc. Thanks for the review.

        Show
        Arun C Murthy added a comment - Creation of temporary/ <taskid> should not throw exception if mkdirs fails, but the directory already exists. FileSystem.mkdirs will not fail if directory already exists... Fixed the javadoc. Thanks for the review.
        Hide
        Doug Cutting added a comment -

        +1 This looks good.

        Show
        Doug Cutting added a comment - +1 This looks good.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12384404/HADOOP-3598_1_20080620.patch
        against trunk revision 669986.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 9 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2711/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2711/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2711/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2711/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12384404/HADOOP-3598_1_20080620.patch against trunk revision 669986. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2711/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2711/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2711/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2711/console This message is automatically generated.
        Hide
        Arun C Murthy added a comment -

        I just committed this.

        Show
        Arun C Murthy added a comment - I just committed this.
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #525 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/525/ )

          People

          • Assignee:
            Arun C Murthy
            Reporter:
            Arun C Murthy
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development