Issue Details (XML | Word | Printable)

Key: HADOOP-3873
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Tsz Wo (Nicholas), SZE
Reporter: Tsz Wo (Nicholas), SZE
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

DistCp should have an option for limiting the number of files/bytes being copied

Created: 30/Jul/08 08:45 PM   Updated: 08/Jul/09 04:51 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.19.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 3873_20080808b.patch 2008-08-09 01:15 AM Tsz Wo (Nicholas), SZE 14 kB
Text File Licensed for inclusion in ASF works 3873_20080811b.patch 2008-08-11 10:44 PM Tsz Wo (Nicholas), SZE 31 kB
Text File Licensed for inclusion in ASF works 3873_20080811b_0.18.patch 2009-01-09 09:59 PM Tsz Wo (Nicholas), SZE 31 kB
Issue Links:
Dependants
 
Reference
 

Hadoop Flags: Reviewed
Release Note: Added two new options -filelimit <n> and -sizelimit <n> to DistCp for limiting the total number of files and the total size in bytes, respectively.
Resolution Date: 13/Aug/08 11:35 PM


 Description  « Hide
A single DistCp command may potentially copies a huge number of files/bytes. In such case, DistCp will run a long time and there is no way stop it nicely. It would be good if DistCp have an option to limit the number of files/bytes being copied. Once the limit is reached, DistCp will terminate and return success. All files copied are guaranteed to be good and there is no partially copied file.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Doug Cutting added a comment - 30/Jul/08 09:13 PM
This sounds rather ad-hoc. What is the use case?

In most cases, the total size to be copied can be determined up front, before the copying begins, no?

What might be better is a mechanism to stop a DistCp job. E.g., one could provide a "stop" file name. When this is non-null, copying will stop as soon as the named file exists. Might that meet the need here?


Tsz Wo (Nicholas), SZE added a comment - 30/Jul/08 10:31 PM
>This sounds rather ad-hoc. What is the use case?

One use case is doing backup a number of directories, say /user1/data, /user2/data, /user3/data, etc. during off peak hours everyday. Each of these directories may contain large number of files/bytes. If we simply do distcp, then it cannot finish copying everything within a single day.

Also, since DistCp currently copies files sequentially, files in /user1/data will be copied first. The other users will be unhappy.

If distcp support a limit option, we could do something like
distcp /user1/data limit 100GB, 1000000 files
distcp /user2/data limit 100GB, 1000000 files
...

These commands will be executed everyday. Suppose /user1/data contains 5 files as following

/user1/data/file1 50GB
/user1/data/file2 50GB
/user1/data/file3 50GB
/user1/data/file4 50GB
/user1/data/file5 50GB

Then, distcp will copy file1 and file2 in the first day. In the second day, since file1 and file2 already exist, distcp will copy file3 and file4. User1 will expect 3 days to finish copying all files.


Tsz Wo (Nicholas), SZE added a comment - 30/Jul/08 10:36 PM
> In most cases, the total size to be copied can be determined up front, before the copying begins, no?

Yes, you are right that we can pre-compute lists of files being copied and impose whatever constraints. The new option is to automate the pre-computation. DistCp currently computes a list of files before copying. I am planning to change the computation so that the list will satisfy the file/size limit constraints.

> What might be better is a mechanism to stop a DistCp job. E.g., one could provide a "stop" file name. When this is non-null, copying will stop as soon as the named file exists. Might that meet the need here?

This is a good idea to stop DistCp job nicely. Let me see whether it could solve the backup use case described above.


Doug Cutting added a comment - 30/Jul/08 10:59 PM
Okay, sounds like a reasonable use case.

Your initial description sounded like you intended to count the files copied as the job runs, and terminate it when it crosses a limit. That would be tricky, and is perhaps not what you meant anyway. Rather, all we need to do to implement this is to count bytes and files as files are listed in the client before the job is created. If that's all you mean, then +1, this seems like a fine feature.

The implementation would be much cleaner if listStatus acceptted a StatusFilter. Then the filter can count bytes and files and stop returning new files once its limit is exceeded. The existing code would hardly change.


Tsz Wo (Nicholas), SZE added a comment - 08/Aug/08 06:28 PM
How about we add two new options "-filelimit n" and "-sizelimit n" to distcp?

Doug Cutting added a comment - 08/Aug/08 07:38 PM
> How about we add two new options "-filelimit n" and "-sizelimit n" to distcp?

Sounds fine to me.


Raghu Angadi added a comment - 08/Aug/08 07:57 PM
This is a useful feature. Hopefully documentation clearly defines what users can expect.

One question : extending the example in the 2nd commend above, what happens if /user1/data/file1 is deleted on the source before the second day? will it be deleted on the destination? If yes, may be some option like "-sync" will make it more clear to the user (of course "-sizelimit" etc still apply).

In the long term, once we can preserve modification times and other metadata while copying, it might be better to add "rsync" mode to distcp.


Tsz Wo (Nicholas), SZE added a comment - 08/Aug/08 08:37 PM
> One question : extending the example in the 2nd commend above, what happens if /user1/data/file1 is deleted on the source before the second day? will it be deleted on the destination? If yes, may be some option like "-sync" will make it more clear to the user (of course "-sizelimit" etc still apply).

+1 "-sync" is probably our next step.


Raghu Angadi added a comment - 08/Aug/08 08:50 PM
Is that an "yes" for deleting the file? Thanks.

Tsz Wo (Nicholas), SZE added a comment - 08/Aug/08 08:56 PM
I don't plan to implement "-sync" or file deletion in this issue. So the answer for your question is: "no, the file will be remained in the destination." I will file another issue for that. Sorry for not being clear.

Tsz Wo (Nicholas), SZE added a comment - 09/Aug/08 01:15 AM
3873_20080808b.patch: this is a first patch supporting the new "-filelimit" and "-sizelimit" options. Need re-writing the shell messages and new tests.

Tsz Wo (Nicholas), SZE added a comment - 11/Aug/08 10:44 PM
3873_20080811b.patch: this is a complete patch
  • -filelimit <n> and -sizelimit <n> support symbolic representation. For examples,
    1230k = 1230 * 1024 = 1259520
    891g = 891 * 1024^3 = 956703965184
  • Comparing files sizes during setup
  • Rewrote shell messages
  • Added a few tests

Tsz Wo (Nicholas), SZE added a comment - 12/Aug/08 12:09 AM
Passed all tests locally, try Hudson.

Tsz Wo (Nicholas), SZE added a comment - 12/Aug/08 07:59 PM
Created HADOOP-3939 for the -sync issue mentioned before.

Hadoop QA added a comment - 13/Aug/08 09:51 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12388005/3873_20080811b.patch
against trunk revision 685425.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 9 new or modified tests.

-1 javadoc. The javadoc tool appears to have generated 1 warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3053/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3053/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3053/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3053/console

This message is automatically generated.


Tsz Wo (Nicholas), SZE added a comment - 13/Aug/08 06:23 PM
The javadoc warnings are nothing to do with the patch. Before I did "svn update" today, there were no javadoc warnings. See HADOOP-3949

The tests failed are TestMapRed and TestMiniMRDFSSort but they failed on trunk (they did not fail before I did "svn update"). See HADOOP-3950


Chris Douglas added a comment - 13/Aug/08 11:35 PM
+1 looks good.

I just committed this. Thanks, Nicholas


Hudson added a comment - 22/Aug/08 12:34 PM

Hudson added a comment - 03/Oct/08 02:31 PM

Tsz Wo (Nicholas), SZE added a comment - 09/Jan/09 09:59 PM
3873_20080811b_0.18.patch: for 0.18 (this won't be committed.)