Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1305

Running distcp with -delete incurs avoidable penalties

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.1
    • Fix Version/s: 0.21.0
    • Component/s: distcp
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      First problem

      In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.

      The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.

      Changed that to just serialize Path and not FileStatus.

      Second problem

      To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.

      Changed that to dstfs.delete(path, true)

      1. M1305-1.patch
        4 kB
        Chris Douglas
      2. M1305-2.patch
        6 kB
        Chris Douglas
      3. MAPREDUCE-1305.patch
        2 kB
        Peter Romianowski

        Activity

        Hide
        Allen Wittenauer added a comment -

        Hmm. Why would RawLocalFileSystem be used at all? Are you using distcp against file://?

        Show
        Allen Wittenauer added a comment - Hmm. Why would RawLocalFileSystem be used at all? Are you using distcp against file://?
        Hide
        Peter Romianowski added a comment -

        Patch is against trunk.

        Show
        Peter Romianowski added a comment - Patch is against trunk.
        Hide
        Peter Romianowski added a comment -

        Yes I am using distcp against file:/// to make a backup of some crucial files to a NAS.

        The better question would be: Why serializing FileStatus if Path is all we need?

        Show
        Peter Romianowski added a comment - Yes I am using distcp against file:/// to make a backup of some crucial files to a NAS. The better question would be: Why serializing FileStatus if Path is all we need?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12428234/MAPRED-1305.patch
        against trunk revision 891524.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/208/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12428234/MAPRED-1305.patch against trunk revision 891524. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/208/console This message is automatically generated.
        Hide
        Peter Romianowski added a comment -

        We even do not need the absolute path serialized. Using NullWritable now.

        Patch is against trunk, rev 891812

        Show
        Peter Romianowski added a comment - We even do not need the absolute path serialized. Using NullWritable now. Patch is against trunk, rev 891812
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12428316/MAPREDUCE-1305.patch
        against trunk revision 893469.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/241/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/241/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/241/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/241/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12428316/MAPREDUCE-1305.patch against trunk revision 893469. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/241/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/241/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/241/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/241/console This message is automatically generated.
        Hide
        Chris Douglas added a comment -

        The patch looks good, but this:

        To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.

        Doesn't appear to be fixed. Are you planning to address this in a separate issue?

        Show
        Chris Douglas added a comment - The patch looks good, but this: To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file. Doesn't appear to be fixed. Are you planning to address this in a separate issue?
        Hide
        Chris Douglas added a comment -

        Modified Peter's patch to remove FsShell invocations.

        That part isn't actually horrible, performance-wise; it reuses the instance, so while there's certainly avoidable overhead in parsing and whatnot, it's not forking a process or anything too notable. It also supports the Trash, which may be useful/appreciated.

        Is supporting Trash useful for DistCp users running with -delete?

        Show
        Chris Douglas added a comment - Modified Peter's patch to remove FsShell invocations. That part isn't actually horrible, performance-wise; it reuses the instance, so while there's certainly avoidable overhead in parsing and whatnot, it's not forking a process or anything too notable. It also supports the Trash, which may be useful/appreciated. Is supporting Trash useful for DistCp users running with -delete?
        Hide
        Koji Noguchi added a comment -

        Is supporting Trash useful for DistCp users running with -delete?

        To me, yes.
        I've seen many of our users deleting their files accidentally.
        Trash has saved us great time.

        I'd like to request the Trash part to stay if there's not much performance problem.

        Show
        Koji Noguchi added a comment - Is supporting Trash useful for DistCp users running with -delete? To me, yes. I've seen many of our users deleting their files accidentally. Trash has saved us great time. I'd like to request the Trash part to stay if there's not much performance problem.
        Hide
        Chris Douglas added a comment -

        nod This version uses Trash directly, avoiding FsShell but keeping existing behavior

        Show
        Chris Douglas added a comment - nod This version uses Trash directly, avoiding FsShell but keeping existing behavior
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12435423/M1305-2.patch
        against trunk revision 908321.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/441/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/441/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/441/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/441/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12435423/M1305-2.patch against trunk revision 908321. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/441/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/441/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/441/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/441/console This message is automatically generated.
        Hide
        Peter Romianowski added a comment -

        Thanks Chris for remove calls to FsShell. I've been very busy lately so I did not manage to compile the patch.

        Show
        Peter Romianowski added a comment - Thanks Chris for remove calls to FsShell. I've been very busy lately so I did not manage to compile the patch.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        +1 patch looks good.

        Show
        Tsz Wo Nicholas Sze added a comment - +1 patch looks good.
        Hide
        Chris Douglas added a comment -

        I committed this. Thanks, Peter!

        Show
        Chris Douglas added a comment - I committed this. Thanks, Peter!
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #236 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/236/)
        . Improve efficiency of distcp -delete. Contributed by Peter Romianowski

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #236 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/236/ ) . Improve efficiency of distcp -delete. Contributed by Peter Romianowski
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #234 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/234/)
        . Improve efficiency of distcp -delete. Contributed by Peter Romianowski

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #234 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/234/ ) . Improve efficiency of distcp -delete. Contributed by Peter Romianowski

          People

          • Assignee:
            Peter Romianowski
            Reporter:
            Peter Romianowski
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development