|
>This sounds rather ad-hoc. What is the use case?
One use case is doing backup a number of directories, say /user1/data, /user2/data, /user3/data, etc. during off peak hours everyday. Each of these directories may contain large number of files/bytes. If we simply do distcp, then it cannot finish copying everything within a single day. Also, since DistCp currently copies files sequentially, files in /user1/data will be copied first. The other users will be unhappy. If distcp support a limit option, we could do something like These commands will be executed everyday. Suppose /user1/data contains 5 files as following /user1/data/file1 50GB Then, distcp will copy file1 and file2 in the first day. In the second day, since file1 and file2 already exist, distcp will copy file3 and file4. User1 will expect 3 days to finish copying all files. > In most cases, the total size to be copied can be determined up front, before the copying begins, no?
Yes, you are right that we can pre-compute lists of files being copied and impose whatever constraints. The new option is to automate the pre-computation. DistCp currently computes a list of files before copying. I am planning to change the computation so that the list will satisfy the file/size limit constraints. > What might be better is a mechanism to stop a DistCp job. E.g., one could provide a "stop" file name. When this is non-null, copying will stop as soon as the named file exists. Might that meet the need here? This is a good idea to stop DistCp job nicely. Let me see whether it could solve the backup use case described above. Okay, sounds like a reasonable use case.
Your initial description sounded like you intended to count the files copied as the job runs, and terminate it when it crosses a limit. That would be tricky, and is perhaps not what you meant anyway. Rather, all we need to do to implement this is to count bytes and files as files are listed in the client before the job is created. If that's all you mean, then +1, this seems like a fine feature. The implementation would be much cleaner if listStatus acceptted a StatusFilter. Then the filter can count bytes and files and stop returning new files once its limit is exceeded. The existing code would hardly change. How about we add two new options "-filelimit n" and "-sizelimit n" to distcp?
> How about we add two new options "-filelimit n" and "-sizelimit n" to distcp?
Sounds fine to me. This is a useful feature. Hopefully documentation clearly defines what users can expect.
One question : extending the example in the 2nd commend above, what happens if /user1/data/file1 is deleted on the source before the second day? will it be deleted on the destination? If yes, may be some option like "-sync" will make it more clear to the user (of course "-sizelimit" etc still apply). In the long term, once we can preserve modification times and other metadata while copying, it might be better to add "rsync" mode to distcp. > One question : extending the example in the 2nd commend above, what happens if /user1/data/file1 is deleted on the source before the second day? will it be deleted on the destination? If yes, may be some option like "-sync" will make it more clear to the user (of course "-sizelimit" etc still apply).
+1 "-sync" is probably our next step. Is that an "yes" for deleting the file? Thanks.
I don't plan to implement "-sync" or file deletion in this issue. So the answer for your question is: "no, the file will be remained in the destination." I will file another issue for that. Sorry for not being clear.
3873_20080808b.patch: this is a first patch supporting the new "-filelimit" and "-sizelimit" options. Need re-writing the shell messages and new tests.
3873_20080811b.patch: this is a complete patch
Passed all tests locally, try Hudson.
Created
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12388005/3873_20080811b.patch against trunk revision 685425. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3053/testReport/ This message is automatically generated. The javadoc warnings are nothing to do with the patch. Before I did "svn update" today, there were no javadoc warnings. See
The tests failed are TestMapRed and TestMiniMRDFSSort but they failed on trunk (they did not fail before I did "svn update"). See +1 looks good.
I just committed this. Thanks, Nicholas Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/
Integrated in Hadoop-trunk #622 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/622/
3873_20080811b_0.18.patch: for 0.18 (this won't be committed.)
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
In most cases, the total size to be copied can be determined up front, before the copying begins, no?
What might be better is a mechanism to stop a DistCp job. E.g., one could provide a "stop" file name. When this is non-null, copying will stop as soon as the named file exists. Might that meet the need here?