Issue Details (XML | Word | Printable)

Key: HADOOP-3939
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Tsz Wo (Nicholas), SZE
Reporter: Tsz Wo (Nicholas), SZE
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

DistCp should support an option for deleting non-existing files.

Created: 12/Aug/08 07:58 PM   Updated: 08/Jul/09 04:51 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.19.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 3939_20080825.patch 2008-08-25 10:20 PM Tsz Wo (Nicholas), SZE 6 kB
Text File Licensed for inclusion in ASF works 3939_20080825b.patch 2008-08-26 03:06 AM Tsz Wo (Nicholas), SZE 10 kB
Text File Licensed for inclusion in ASF works 3939_20080826.patch 2008-08-26 07:46 PM Tsz Wo (Nicholas), SZE 10 kB
Text File Licensed for inclusion in ASF works 3939_20080828.patch 2008-08-29 12:18 AM Tsz Wo (Nicholas), SZE 11 kB
Text File Licensed for inclusion in ASF works 3939_20080829.patch 2008-08-29 06:13 PM Tsz Wo (Nicholas), SZE 11 kB
Text File Licensed for inclusion in ASF works 3939_20080829b.patch 2008-08-29 10:38 PM Tsz Wo (Nicholas), SZE 11 kB
Text File Licensed for inclusion in ASF works 3939_20080829b_0.18+3873_20080811b_0.18.patch 2009-01-13 12:02 AM Tsz Wo (Nicholas), SZE 39 kB
Issue Links:
Dependants
 
Reference
 

Hadoop Flags: Reviewed
Release Note: Added a new option -delete to DistCp so that if the files/directories exist in dst but not in src will be deleted. It uses FsShell to do delete, so that it will use trash if the trash is enable.
Resolution Date: 01/Sep/08 08:45 PM


 Description  « Hide
One use case of DistCp is to sync two directories. Currently, DistCp has an -update option for overwriting dst files if src is different from dst. However, it is not enough for sync. If there are some files in dst but not exist in src, there is no easy way to delete them. We should add a new option, say -delete, so that DistCp will delete the non-existing in dst.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Koji Noguchi added a comment - 12/Aug/08 09:15 PM
I can see users mis-using this feature and deleting some of their important files.
Can we use Trash if it's enabled ?

Tsz Wo (Nicholas), SZE added a comment - 12/Aug/08 10:05 PM
> Can we use Trash if it's enabled ?

+1 I think this is a good idea. It can be done by re-using the codes in FsShell.


Tsz Wo (Nicholas), SZE added a comment - 25/Aug/08 10:20 PM
3939_20080825.patch: first version. Need some tests

Tsz Wo (Nicholas), SZE added a comment - 26/Aug/08 03:06 AM
3939_20080825b.patch: fixed some bugs.

Tsz Wo (Nicholas), SZE added a comment - 26/Aug/08 07:46 PM
3939_20080826.patch: added a test.

Tsz Wo (Nicholas), SZE added a comment - 26/Aug/08 08:23 PM
Passed test-patch and all tests locally. Submitting ...

Hadoop QA added a comment - 27/Aug/08 07:54 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12388941/3939_20080826.patch
against trunk revision 689363.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 4 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3117/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3117/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3117/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3117/console

This message is automatically generated.


Tsz Wo (Nicholas), SZE added a comment - 27/Aug/08 03:54 PM
3939_20080826.patch only changed DistCp and fixed a bug in FileStatus.hashCode(). The unit tests failed are not related.

Chris Douglas added a comment - 28/Aug/08 09:07 PM
  • Would it make sense to require either -update or -overwrite if -delete is specified? Without either of these options, the semantics are a little confusing. For example:
    • In this case, the destination doesn't exist. Everything that isn't the source is deleted, which seems reasonable.
      $ bin/hadoop fs -ls a b
      Found 2 items
      -rw-r--r--   1 someuser somegroup      92934 2008-08-11 21:42 /user/someuser/a/part-00000
      Found 4 items
      -rw-r--r--   1 someuser somegroup  105177784 2008-08-28 11:46 /user/someuser/b/part-00000
      -rw-r--r--   1 someuser somegroup  105177884 2008-08-28 11:46 /user/someuser/b/part-00001
      -rw-r--r--   1 someuser somegroup  105177754 2008-08-28 11:46 /user/someuser/b/part-00002
      $ bin/hadoop distcp -delete hdfs://host:8020/user/someuser/a hdfs://host:8020/user/someuser/b
      08/08/28 11:51:18 INFO tools.DistCp: srcPaths=[hdfs://host:8020/user/someuser/a]
      08/08/28 11:51:18 INFO tools.DistCp: destPath=hdfs://host:8020/user/someuser/b
      Deleted hdfs://host/user/someuser/b/part-00000
      Deleted hdfs://host/user/someuser/b/part-00001
      Deleted hdfs://host/user/someuser/b/part-00002
      [snip]
      $ bin/hadoop fs -ls a b
      Found 2 items
      -rw-r--r--   1 someuser somegroup      92934 2008-08-11 21:42 /user/someuser/a/part-00000
      Found 2 items
      drwxr-xr-x   - someuser somegroup          0 2008-08-28 11:51 /user/someuser/b/a
      
    • Here, the destination does exist, but it is deleted anyway, as though -overwrite were specified.
      $ bin/hadoop fs -lsr a b
      -rw-r--r--   1 someuser somegroup      92934 2008-08-11 21:42 /user/someuser/a/part-00000
      -rw-r--r--   1 someuser somegroup  105177784 2008-08-28 11:51 /user/someuser/b/part-00000
      -rw-r--r--   1 someuser somegroup  105177884 2008-08-28 11:51 /user/someuser/b/part-00001
      -rw-r--r--   1 someuser somegroup  105177754 2008-08-28 11:51 /user/someuser/b/part-00002
      drwxr-xr-x   - someuser somegroup          0 2008-08-28 13:34 /user/someuser/b/a
      -rw-r--r--   1 someuser somegroup  105177784 2008-08-28 13:34 /user/someuser/b/a/part-00000
      $ bin/hadoop distcp -delete hdfs://host:8020/user/someuser/a hdfs://host:8020/user/someuser/b
      08/08/28 13:35:14 INFO tools.DistCp: srcPaths=[hdfs://host:8020/user/someuser/a]
      08/08/28 13:35:14 INFO tools.DistCp: destPath=hdfs://host:8020/user/someuser/b
      Deleted hdfs://host:8020/user/someuser/b/part-00000
      Deleted hdfs://host:8020/user/someuser/b/part-00001
      Deleted hdfs://host:8020/user/someuser/b/part-00002
      Deleted hdfs://host:8020/user/someuser/b/a
      [snip]
      $ bin/hadoop fs -lsr a b
      -rw-r--r--   1 someuser somegroup      92934 2008-08-11 21:42 /user/someuser/a/part-00000
      drwxr-xr-x   - someuser somegroup          0 2008-08-28 13:35 /user/someuser/b/a
      -rw-r--r--   1 someuser somegroup      92934 2008-08-28 13:35 /user/someuser/b/a/part-00000
      

Adding this dependency would also help prevent casual errors and potentially serious mistakes if the Trash is disabled.

  • It might help to always add a message about FsShell failing, and set the cause rather than:
    +            } catch(Exception e) {
    +              throw e instanceof IOException? (IOException)e: new IOException(e);
    +            }
    
  • When -delete is specified, the client is doing a lot of work to recursively list the destination, then to delete individual files there. In the future it might make sense to leave it to the maps to delete entries, since the source list is sorted. The client (or a reduce) would have to do some work on the boundaries, but it should scale well. The current patch is clearer given distcp's current organization, though.
  • The fix to FileStatus makes sense, but when is the Path null?

Tsz Wo (Nicholas), SZE added a comment - 29/Aug/08 12:06 AM
> Would it make sense to require either -update or -overwrite if -delete is specified?

We should enforce that.

> The fix to FileStatus makes sense, but when is the Path null?

I hit this when creating a FileStatus by the default constructor and then put is in some data structure (I forgot which data structure). The current implementation does not need to this operation. So I will revert this change.


Tsz Wo (Nicholas), SZE added a comment - 29/Aug/08 12:18 AM
3939_20080828.patch: incorporated all comments from Chris.

Chris Douglas added a comment - 29/Aug/08 12:53 AM
+1

Tsz Wo (Nicholas), SZE added a comment - 29/Aug/08 01:49 AM
submit again.

Hadoop QA added a comment - 29/Aug/08 06:42 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12389133/3939_20080828.patch
against trunk revision 690096.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 4 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3141/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3141/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3141/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3141/console

This message is automatically generated.


Tsz Wo (Nicholas), SZE added a comment - 29/Aug/08 06:13 PM
3939_20080829.patch: fixed a bug for path checking.

Tsz Wo (Nicholas), SZE added a comment - 29/Aug/08 10:38 PM
3939_20080829b.patch: updated the new unit test.

Tsz Wo (Nicholas), SZE added a comment - 29/Aug/08 11:02 PM
Tested locally. 3939_20080829b.patch is ready to be committed.

Hadoop QA added a comment - 01/Sep/08 07:08 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12389205/3939_20080829b.patch
against trunk revision 690641.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 4 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3150/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3150/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3150/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3150/console

This message is automatically generated.


Chris Douglas added a comment - 01/Sep/08 08:45 PM
I just committed this. Thanks Nicholas

Hudson added a comment - 02/Sep/08 01:02 PM

Hudson added a comment - 03/Oct/08 02:31 PM

Tsz Wo (Nicholas), SZE added a comment - 13/Jan/09 12:02 AM
3939_20080829b_0.18+3873_20080811b_0.18.patch: for 0.18. It also includes HADOOP-3873. This patch won't be committed.