Hadoop Common
  1. Hadoop Common
  2. HADOOP-3939

DistCp should support an option for deleting non-existing files.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Added a new option -delete to DistCp so that if the files/directories exist in dst but not in src will be deleted. It uses FsShell to do delete, so that it will use trash if the trash is enable.

      Description

      One use case of DistCp is to sync two directories. Currently, DistCp has an -update option for overwriting dst files if src is different from dst. However, it is not enough for sync. If there are some files in dst but not exist in src, there is no easy way to delete them. We should add a new option, say -delete, so that DistCp will delete the non-existing in dst.

      1. 3939_20080825.patch
        6 kB
        Tsz Wo Nicholas Sze
      2. 3939_20080825b.patch
        10 kB
        Tsz Wo Nicholas Sze
      3. 3939_20080826.patch
        10 kB
        Tsz Wo Nicholas Sze
      4. 3939_20080828.patch
        11 kB
        Tsz Wo Nicholas Sze
      5. 3939_20080829.patch
        11 kB
        Tsz Wo Nicholas Sze
      6. 3939_20080829b_0.18+3873_20080811b_0.18.patch
        39 kB
        Tsz Wo Nicholas Sze
      7. 3939_20080829b.patch
        11 kB
        Tsz Wo Nicholas Sze

        Issue Links

          Activity

          Hide
          Koji Noguchi added a comment -

          I can see users mis-using this feature and deleting some of their important files.
          Can we use Trash if it's enabled ?

          Show
          Koji Noguchi added a comment - I can see users mis-using this feature and deleting some of their important files. Can we use Trash if it's enabled ?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Can we use Trash if it's enabled ?

          +1 I think this is a good idea. It can be done by re-using the codes in FsShell.

          Show
          Tsz Wo Nicholas Sze added a comment - > Can we use Trash if it's enabled ? +1 I think this is a good idea. It can be done by re-using the codes in FsShell.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          3939_20080825.patch: first version. Need some tests

          Show
          Tsz Wo Nicholas Sze added a comment - 3939_20080825.patch: first version. Need some tests
          Hide
          Tsz Wo Nicholas Sze added a comment -

          3939_20080825b.patch: fixed some bugs.

          Show
          Tsz Wo Nicholas Sze added a comment - 3939_20080825b.patch: fixed some bugs.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          3939_20080826.patch: added a test.

          Show
          Tsz Wo Nicholas Sze added a comment - 3939_20080826.patch: added a test.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Passed test-patch and all tests locally. Submitting ...

          Show
          Tsz Wo Nicholas Sze added a comment - Passed test-patch and all tests locally. Submitting ...
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12388941/3939_20080826.patch
          against trunk revision 689363.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3117/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3117/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3117/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3117/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12388941/3939_20080826.patch against trunk revision 689363. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3117/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3117/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3117/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3117/console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          3939_20080826.patch only changed DistCp and fixed a bug in FileStatus.hashCode(). The unit tests failed are not related.

          Show
          Tsz Wo Nicholas Sze added a comment - 3939_20080826.patch only changed DistCp and fixed a bug in FileStatus.hashCode(). The unit tests failed are not related.
          Hide
          Chris Douglas added a comment -
          • Would it make sense to require either -update or -overwrite if -delete is specified? Without either of these options, the semantics are a little confusing. For example:
            • In this case, the destination doesn't exist. Everything that isn't the source is deleted, which seems reasonable.
              $ bin/hadoop fs -ls a b
              Found 2 items
              -rw-r--r--   1 someuser somegroup      92934 2008-08-11 21:42 /user/someuser/a/part-00000
              Found 4 items
              -rw-r--r--   1 someuser somegroup  105177784 2008-08-28 11:46 /user/someuser/b/part-00000
              -rw-r--r--   1 someuser somegroup  105177884 2008-08-28 11:46 /user/someuser/b/part-00001
              -rw-r--r--   1 someuser somegroup  105177754 2008-08-28 11:46 /user/someuser/b/part-00002
              $ bin/hadoop distcp -delete hdfs://host:8020/user/someuser/a hdfs://host:8020/user/someuser/b
              08/08/28 11:51:18 INFO tools.DistCp: srcPaths=[hdfs://host:8020/user/someuser/a]
              08/08/28 11:51:18 INFO tools.DistCp: destPath=hdfs://host:8020/user/someuser/b
              Deleted hdfs://host/user/someuser/b/part-00000
              Deleted hdfs://host/user/someuser/b/part-00001
              Deleted hdfs://host/user/someuser/b/part-00002
              [snip]
              $ bin/hadoop fs -ls a b
              Found 2 items
              -rw-r--r--   1 someuser somegroup      92934 2008-08-11 21:42 /user/someuser/a/part-00000
              Found 2 items
              drwxr-xr-x   - someuser somegroup          0 2008-08-28 11:51 /user/someuser/b/a
              
            • Here, the destination does exist, but it is deleted anyway, as though -overwrite were specified.
              $ bin/hadoop fs -lsr a b
              -rw-r--r--   1 someuser somegroup      92934 2008-08-11 21:42 /user/someuser/a/part-00000
              -rw-r--r--   1 someuser somegroup  105177784 2008-08-28 11:51 /user/someuser/b/part-00000
              -rw-r--r--   1 someuser somegroup  105177884 2008-08-28 11:51 /user/someuser/b/part-00001
              -rw-r--r--   1 someuser somegroup  105177754 2008-08-28 11:51 /user/someuser/b/part-00002
              drwxr-xr-x   - someuser somegroup          0 2008-08-28 13:34 /user/someuser/b/a
              -rw-r--r--   1 someuser somegroup  105177784 2008-08-28 13:34 /user/someuser/b/a/part-00000
              $ bin/hadoop distcp -delete hdfs://host:8020/user/someuser/a hdfs://host:8020/user/someuser/b
              08/08/28 13:35:14 INFO tools.DistCp: srcPaths=[hdfs://host:8020/user/someuser/a]
              08/08/28 13:35:14 INFO tools.DistCp: destPath=hdfs://host:8020/user/someuser/b
              Deleted hdfs://host:8020/user/someuser/b/part-00000
              Deleted hdfs://host:8020/user/someuser/b/part-00001
              Deleted hdfs://host:8020/user/someuser/b/part-00002
              Deleted hdfs://host:8020/user/someuser/b/a
              [snip]
              $ bin/hadoop fs -lsr a b
              -rw-r--r--   1 someuser somegroup      92934 2008-08-11 21:42 /user/someuser/a/part-00000
              drwxr-xr-x   - someuser somegroup          0 2008-08-28 13:35 /user/someuser/b/a
              -rw-r--r--   1 someuser somegroup      92934 2008-08-28 13:35 /user/someuser/b/a/part-00000
              

          Adding this dependency would also help prevent casual errors and potentially serious mistakes if the Trash is disabled.

          • It might help to always add a message about FsShell failing, and set the cause rather than:
            +            } catch(Exception e) {
            +              throw e instanceof IOException? (IOException)e: new IOException(e);
            +            }
            
          • When -delete is specified, the client is doing a lot of work to recursively list the destination, then to delete individual files there. In the future it might make sense to leave it to the maps to delete entries, since the source list is sorted. The client (or a reduce) would have to do some work on the boundaries, but it should scale well. The current patch is clearer given distcp's current organization, though.
          • The fix to FileStatus makes sense, but when is the Path null?
          Show
          Chris Douglas added a comment - Would it make sense to require either -update or -overwrite if -delete is specified? Without either of these options, the semantics are a little confusing. For example: In this case, the destination doesn't exist. Everything that isn't the source is deleted, which seems reasonable. $ bin/hadoop fs -ls a b Found 2 items -rw-r--r-- 1 someuser somegroup 92934 2008-08-11 21:42 /user/someuser/a/part-00000 Found 4 items -rw-r--r-- 1 someuser somegroup 105177784 2008-08-28 11:46 /user/someuser/b/part-00000 -rw-r--r-- 1 someuser somegroup 105177884 2008-08-28 11:46 /user/someuser/b/part-00001 -rw-r--r-- 1 someuser somegroup 105177754 2008-08-28 11:46 /user/someuser/b/part-00002 $ bin/hadoop distcp -delete hdfs://host:8020/user/someuser/a hdfs://host:8020/user/someuser/b 08/08/28 11:51:18 INFO tools.DistCp: srcPaths=[hdfs://host:8020/user/someuser/a] 08/08/28 11:51:18 INFO tools.DistCp: destPath=hdfs://host:8020/user/someuser/b Deleted hdfs://host/user/someuser/b/part-00000 Deleted hdfs://host/user/someuser/b/part-00001 Deleted hdfs://host/user/someuser/b/part-00002 [snip] $ bin/hadoop fs -ls a b Found 2 items -rw-r--r-- 1 someuser somegroup 92934 2008-08-11 21:42 /user/someuser/a/part-00000 Found 2 items drwxr-xr-x - someuser somegroup 0 2008-08-28 11:51 /user/someuser/b/a Here, the destination does exist, but it is deleted anyway, as though -overwrite were specified. $ bin/hadoop fs -lsr a b -rw-r--r-- 1 someuser somegroup 92934 2008-08-11 21:42 /user/someuser/a/part-00000 -rw-r--r-- 1 someuser somegroup 105177784 2008-08-28 11:51 /user/someuser/b/part-00000 -rw-r--r-- 1 someuser somegroup 105177884 2008-08-28 11:51 /user/someuser/b/part-00001 -rw-r--r-- 1 someuser somegroup 105177754 2008-08-28 11:51 /user/someuser/b/part-00002 drwxr-xr-x - someuser somegroup 0 2008-08-28 13:34 /user/someuser/b/a -rw-r--r-- 1 someuser somegroup 105177784 2008-08-28 13:34 /user/someuser/b/a/part-00000 $ bin/hadoop distcp -delete hdfs://host:8020/user/someuser/a hdfs://host:8020/user/someuser/b 08/08/28 13:35:14 INFO tools.DistCp: srcPaths=[hdfs://host:8020/user/someuser/a] 08/08/28 13:35:14 INFO tools.DistCp: destPath=hdfs://host:8020/user/someuser/b Deleted hdfs://host:8020/user/someuser/b/part-00000 Deleted hdfs://host:8020/user/someuser/b/part-00001 Deleted hdfs://host:8020/user/someuser/b/part-00002 Deleted hdfs://host:8020/user/someuser/b/a [snip] $ bin/hadoop fs -lsr a b -rw-r--r-- 1 someuser somegroup 92934 2008-08-11 21:42 /user/someuser/a/part-00000 drwxr-xr-x - someuser somegroup 0 2008-08-28 13:35 /user/someuser/b/a -rw-r--r-- 1 someuser somegroup 92934 2008-08-28 13:35 /user/someuser/b/a/part-00000 Adding this dependency would also help prevent casual errors and potentially serious mistakes if the Trash is disabled. It might help to always add a message about FsShell failing, and set the cause rather than: + } catch(Exception e) { + throw e instanceof IOException? (IOException)e: new IOException(e); + } When -delete is specified, the client is doing a lot of work to recursively list the destination, then to delete individual files there. In the future it might make sense to leave it to the maps to delete entries, since the source list is sorted. The client (or a reduce) would have to do some work on the boundaries, but it should scale well. The current patch is clearer given distcp's current organization, though. The fix to FileStatus makes sense, but when is the Path null?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Would it make sense to require either -update or -overwrite if -delete is specified?

          We should enforce that.

          > The fix to FileStatus makes sense, but when is the Path null?

          I hit this when creating a FileStatus by the default constructor and then put is in some data structure (I forgot which data structure). The current implementation does not need to this operation. So I will revert this change.

          Show
          Tsz Wo Nicholas Sze added a comment - > Would it make sense to require either -update or -overwrite if -delete is specified? We should enforce that. > The fix to FileStatus makes sense, but when is the Path null? I hit this when creating a FileStatus by the default constructor and then put is in some data structure (I forgot which data structure). The current implementation does not need to this operation. So I will revert this change.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          3939_20080828.patch: incorporated all comments from Chris.

          Show
          Tsz Wo Nicholas Sze added a comment - 3939_20080828.patch: incorporated all comments from Chris.
          Hide
          Chris Douglas added a comment -

          +1

          Show
          Chris Douglas added a comment - +1
          Hide
          Tsz Wo Nicholas Sze added a comment -

          submit again.

          Show
          Tsz Wo Nicholas Sze added a comment - submit again.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12389133/3939_20080828.patch
          against trunk revision 690096.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3141/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3141/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3141/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3141/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12389133/3939_20080828.patch against trunk revision 690096. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3141/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3141/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3141/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3141/console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          3939_20080829.patch: fixed a bug for path checking.

          Show
          Tsz Wo Nicholas Sze added a comment - 3939_20080829.patch: fixed a bug for path checking.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          3939_20080829b.patch: updated the new unit test.

          Show
          Tsz Wo Nicholas Sze added a comment - 3939_20080829b.patch: updated the new unit test.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Tested locally. 3939_20080829b.patch is ready to be committed.

          Show
          Tsz Wo Nicholas Sze added a comment - Tested locally. 3939_20080829b.patch is ready to be committed.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12389205/3939_20080829b.patch
          against trunk revision 690641.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3150/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3150/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3150/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3150/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12389205/3939_20080829b.patch against trunk revision 690641. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3150/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3150/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3150/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3150/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -

          I just committed this. Thanks Nicholas

          Show
          Chris Douglas added a comment - I just committed this. Thanks Nicholas
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #590 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/590/ )
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #622 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/622/ )
          Hide
          Tsz Wo Nicholas Sze added a comment -

          3939_20080829b_0.18+3873_20080811b_0.18.patch: for 0.18. It also includes HADOOP-3873. This patch won't be committed.

          Show
          Tsz Wo Nicholas Sze added a comment - 3939_20080829b_0.18+3873_20080811b_0.18.patch: for 0.18. It also includes HADOOP-3873 . This patch won't be committed.
          Hide
          Benoit Sigoure added a comment -

          Oops, sorry I meant to edit HBASE-3939.

          Show
          Benoit Sigoure added a comment - Oops, sorry I meant to edit HBASE-3939 .

            People

            • Assignee:
              Tsz Wo Nicholas Sze
              Reporter:
              Tsz Wo Nicholas Sze
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development