Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1305

Running distcp with -delete incurs avoidable penalties

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.1
    • Fix Version/s: 0.21.0
    • Component/s: distcp
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      First problem

      In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need.

      The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow.

      Changed that to just serialize Path and not FileStatus.

      Second problem

      To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.

      Changed that to dstfs.delete(path, true)

      1. MAPREDUCE-1305.patch
        2 kB
        Peter Romianowski
      2. M1305-2.patch
        6 kB
        Chris Douglas
      3. M1305-1.patch
        4 kB
        Chris Douglas

        Activity

        Peter Romianowski created issue -
        Hide
        Allen Wittenauer added a comment -

        Hmm. Why would RawLocalFileSystem be used at all? Are you using distcp against file://?

        Show
        Allen Wittenauer added a comment - Hmm. Why would RawLocalFileSystem be used at all? Are you using distcp against file://?
        Peter Romianowski made changes -
        Field Original Value New Value
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Peter Romianowski added a comment -

        Patch is against trunk.

        Show
        Peter Romianowski added a comment - Patch is against trunk.
        Peter Romianowski made changes -
        Attachment MAPRED-1305.patch [ 12428234 ]
        Hide
        Peter Romianowski added a comment -

        Yes I am using distcp against file:/// to make a backup of some crucial files to a NAS.

        The better question would be: Why serializing FileStatus if Path is all we need?

        Show
        Peter Romianowski added a comment - Yes I am using distcp against file:/// to make a backup of some crucial files to a NAS. The better question would be: Why serializing FileStatus if Path is all we need?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12428234/MAPRED-1305.patch
        against trunk revision 891524.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/208/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12428234/MAPRED-1305.patch against trunk revision 891524. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/208/console This message is automatically generated.
        Peter Romianowski made changes -
        Attachment MAPRED-1305.patch [ 12428234 ]
        Hide
        Peter Romianowski added a comment -

        We even do not need the absolute path serialized. Using NullWritable now.

        Patch is against trunk, rev 891812

        Show
        Peter Romianowski added a comment - We even do not need the absolute path serialized. Using NullWritable now. Patch is against trunk, rev 891812
        Peter Romianowski made changes -
        Attachment MAPREDUCE-1305.patch [ 12428316 ]
        Chris Douglas made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Chris Douglas made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12428316/MAPREDUCE-1305.patch
        against trunk revision 893469.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/241/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/241/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/241/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/241/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12428316/MAPREDUCE-1305.patch against trunk revision 893469. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/241/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/241/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/241/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/241/console This message is automatically generated.
        Hide
        Chris Douglas added a comment -

        The patch looks good, but this:

        To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file.

        Doesn't appear to be fixed. Are you planning to address this in a separate issue?

        Show
        Chris Douglas added a comment - The patch looks good, but this: To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file. Doesn't appear to be fixed. Are you planning to address this in a separate issue?
        Hide
        Chris Douglas added a comment -

        Modified Peter's patch to remove FsShell invocations.

        That part isn't actually horrible, performance-wise; it reuses the instance, so while there's certainly avoidable overhead in parsing and whatnot, it's not forking a process or anything too notable. It also supports the Trash, which may be useful/appreciated.

        Is supporting Trash useful for DistCp users running with -delete?

        Show
        Chris Douglas added a comment - Modified Peter's patch to remove FsShell invocations. That part isn't actually horrible, performance-wise; it reuses the instance, so while there's certainly avoidable overhead in parsing and whatnot, it's not forking a process or anything too notable. It also supports the Trash, which may be useful/appreciated. Is supporting Trash useful for DistCp users running with -delete?
        Chris Douglas made changes -
        Attachment M1305-1.patch [ 12435417 ]
        Hide
        Koji Noguchi added a comment -

        Is supporting Trash useful for DistCp users running with -delete?

        To me, yes.
        I've seen many of our users deleting their files accidentally.
        Trash has saved us great time.

        I'd like to request the Trash part to stay if there's not much performance problem.

        Show
        Koji Noguchi added a comment - Is supporting Trash useful for DistCp users running with -delete? To me, yes. I've seen many of our users deleting their files accidentally. Trash has saved us great time. I'd like to request the Trash part to stay if there's not much performance problem.
        Hide
        Chris Douglas added a comment -

        nod This version uses Trash directly, avoiding FsShell but keeping existing behavior

        Show
        Chris Douglas added a comment - nod This version uses Trash directly, avoiding FsShell but keeping existing behavior
        Chris Douglas made changes -
        Attachment M1305-2.patch [ 12435423 ]
        Chris Douglas made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Chris Douglas made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Chris Douglas made changes -
        Summary Massive performance problem with DistCp and -delete Running distcp with -delete incurs avoidable penalties
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12435423/M1305-2.patch
        against trunk revision 908321.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/441/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/441/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/441/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/441/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12435423/M1305-2.patch against trunk revision 908321. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/441/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/441/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/441/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/441/console This message is automatically generated.
        Hide
        Peter Romianowski added a comment -

        Thanks Chris for remove calls to FsShell. I've been very busy lately so I did not manage to compile the patch.

        Show
        Peter Romianowski added a comment - Thanks Chris for remove calls to FsShell. I've been very busy lately so I did not manage to compile the patch.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        +1 patch looks good.

        Show
        Tsz Wo Nicholas Sze added a comment - +1 patch looks good.
        Tsz Wo Nicholas Sze made changes -
        Hadoop Flags [Reviewed]
        Hide
        Chris Douglas added a comment -

        I committed this. Thanks, Peter!

        Show
        Chris Douglas added a comment - I committed this. Thanks, Peter!
        Chris Douglas made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Fix Version/s 0.22.0 [ 12314184 ]
        Resolution Fixed [ 1 ]
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #236 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/236/)
        . Improve efficiency of distcp -delete. Contributed by Peter Romianowski

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #236 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/236/ ) . Improve efficiency of distcp -delete. Contributed by Peter Romianowski
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #234 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/234/)
        . Improve efficiency of distcp -delete. Contributed by Peter Romianowski

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #234 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/234/ ) . Improve efficiency of distcp -delete. Contributed by Peter Romianowski
        Tom White made changes -
        Fix Version/s 0.21.0 [ 12314045 ]
        Fix Version/s 0.22.0 [ 12314184 ]
        Tom White made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Patch Available Patch Available Open Open
        55d 9h 50m 2 Chris Douglas 10/Feb/10 08:35
        Open Open Patch Available Patch Available
        11m 54s 3 Chris Douglas 10/Feb/10 08:35
        Patch Available Patch Available Resolved Resolved
        3d 1h 36m 1 Chris Douglas 13/Feb/10 10:11
        Resolved Resolved Closed Closed
        192d 11h 8m 1 Tom White 24/Aug/10 22:19

          People

          • Assignee:
            Peter Romianowski
            Reporter:
            Peter Romianowski
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development