Hadoop Common
  1. Hadoop Common
  2. HADOOP-3873

DistCp should have an option for limiting the number of files/bytes being copied

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Added two new options -filelimit <n> and -sizelimit <n> to DistCp for limiting the total number of files and the total size in bytes, respectively.

      Description

      A single DistCp command may potentially copies a huge number of files/bytes. In such case, DistCp will run a long time and there is no way stop it nicely. It would be good if DistCp have an option to limit the number of files/bytes being copied. Once the limit is reached, DistCp will terminate and return success. All files copied are guaranteed to be good and there is no partially copied file.

      1. 3873_20080811b.patch
        31 kB
        Tsz Wo Nicholas Sze
      2. 3873_20080811b_0.18.patch
        31 kB
        Tsz Wo Nicholas Sze
      3. 3873_20080808b.patch
        14 kB
        Tsz Wo Nicholas Sze

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          This sounds rather ad-hoc. What is the use case?

          In most cases, the total size to be copied can be determined up front, before the copying begins, no?

          What might be better is a mechanism to stop a DistCp job. E.g., one could provide a "stop" file name. When this is non-null, copying will stop as soon as the named file exists. Might that meet the need here?

          Show
          Doug Cutting added a comment - This sounds rather ad-hoc. What is the use case? In most cases, the total size to be copied can be determined up front, before the copying begins, no? What might be better is a mechanism to stop a DistCp job. E.g., one could provide a "stop" file name. When this is non-null, copying will stop as soon as the named file exists. Might that meet the need here?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          >This sounds rather ad-hoc. What is the use case?

          One use case is doing backup a number of directories, say /user1/data, /user2/data, /user3/data, etc. during off peak hours everyday. Each of these directories may contain large number of files/bytes. If we simply do distcp, then it cannot finish copying everything within a single day.

          Also, since DistCp currently copies files sequentially, files in /user1/data will be copied first. The other users will be unhappy.

          If distcp support a limit option, we could do something like
          distcp /user1/data limit 100GB, 1000000 files
          distcp /user2/data limit 100GB, 1000000 files
          ...

          These commands will be executed everyday. Suppose /user1/data contains 5 files as following

          /user1/data/file1 50GB
          /user1/data/file2 50GB
          /user1/data/file3 50GB
          /user1/data/file4 50GB
          /user1/data/file5 50GB

          Then, distcp will copy file1 and file2 in the first day. In the second day, since file1 and file2 already exist, distcp will copy file3 and file4. User1 will expect 3 days to finish copying all files.

          Show
          Tsz Wo Nicholas Sze added a comment - >This sounds rather ad-hoc. What is the use case? One use case is doing backup a number of directories, say /user1/data, /user2/data, /user3/data, etc. during off peak hours everyday. Each of these directories may contain large number of files/bytes. If we simply do distcp, then it cannot finish copying everything within a single day. Also, since DistCp currently copies files sequentially, files in /user1/data will be copied first. The other users will be unhappy. If distcp support a limit option, we could do something like distcp /user1/data limit 100GB, 1000000 files distcp /user2/data limit 100GB, 1000000 files ... These commands will be executed everyday. Suppose /user1/data contains 5 files as following /user1/data/file1 50GB /user1/data/file2 50GB /user1/data/file3 50GB /user1/data/file4 50GB /user1/data/file5 50GB Then, distcp will copy file1 and file2 in the first day. In the second day, since file1 and file2 already exist, distcp will copy file3 and file4. User1 will expect 3 days to finish copying all files.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > In most cases, the total size to be copied can be determined up front, before the copying begins, no?

          Yes, you are right that we can pre-compute lists of files being copied and impose whatever constraints. The new option is to automate the pre-computation. DistCp currently computes a list of files before copying. I am planning to change the computation so that the list will satisfy the file/size limit constraints.

          > What might be better is a mechanism to stop a DistCp job. E.g., one could provide a "stop" file name. When this is non-null, copying will stop as soon as the named file exists. Might that meet the need here?

          This is a good idea to stop DistCp job nicely. Let me see whether it could solve the backup use case described above.

          Show
          Tsz Wo Nicholas Sze added a comment - > In most cases, the total size to be copied can be determined up front, before the copying begins, no? Yes, you are right that we can pre-compute lists of files being copied and impose whatever constraints. The new option is to automate the pre-computation. DistCp currently computes a list of files before copying. I am planning to change the computation so that the list will satisfy the file/size limit constraints. > What might be better is a mechanism to stop a DistCp job. E.g., one could provide a "stop" file name. When this is non-null, copying will stop as soon as the named file exists. Might that meet the need here? This is a good idea to stop DistCp job nicely. Let me see whether it could solve the backup use case described above.
          Hide
          Doug Cutting added a comment -

          Okay, sounds like a reasonable use case.

          Your initial description sounded like you intended to count the files copied as the job runs, and terminate it when it crosses a limit. That would be tricky, and is perhaps not what you meant anyway. Rather, all we need to do to implement this is to count bytes and files as files are listed in the client before the job is created. If that's all you mean, then +1, this seems like a fine feature.

          The implementation would be much cleaner if listStatus acceptted a StatusFilter. Then the filter can count bytes and files and stop returning new files once its limit is exceeded. The existing code would hardly change.

          Show
          Doug Cutting added a comment - Okay, sounds like a reasonable use case. Your initial description sounded like you intended to count the files copied as the job runs, and terminate it when it crosses a limit. That would be tricky, and is perhaps not what you meant anyway. Rather, all we need to do to implement this is to count bytes and files as files are listed in the client before the job is created. If that's all you mean, then +1, this seems like a fine feature. The implementation would be much cleaner if listStatus acceptted a StatusFilter. Then the filter can count bytes and files and stop returning new files once its limit is exceeded. The existing code would hardly change.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          How about we add two new options "-filelimit n" and "-sizelimit n" to distcp?

          Show
          Tsz Wo Nicholas Sze added a comment - How about we add two new options "-filelimit n" and "-sizelimit n" to distcp?
          Hide
          Doug Cutting added a comment -

          > How about we add two new options "-filelimit n" and "-sizelimit n" to distcp?

          Sounds fine to me.

          Show
          Doug Cutting added a comment - > How about we add two new options "-filelimit n" and "-sizelimit n" to distcp? Sounds fine to me.
          Hide
          Raghu Angadi added a comment -

          This is a useful feature. Hopefully documentation clearly defines what users can expect.

          One question : extending the example in the 2nd commend above, what happens if /user1/data/file1 is deleted on the source before the second day? will it be deleted on the destination? If yes, may be some option like "-sync" will make it more clear to the user (of course "-sizelimit" etc still apply).

          In the long term, once we can preserve modification times and other metadata while copying, it might be better to add "rsync" mode to distcp.

          Show
          Raghu Angadi added a comment - This is a useful feature. Hopefully documentation clearly defines what users can expect. One question : extending the example in the 2nd commend above, what happens if /user1/data/file1 is deleted on the source before the second day? will it be deleted on the destination? If yes, may be some option like "-sync" will make it more clear to the user (of course "-sizelimit" etc still apply). In the long term, once we can preserve modification times and other metadata while copying, it might be better to add "rsync" mode to distcp.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > One question : extending the example in the 2nd commend above, what happens if /user1/data/file1 is deleted on the source before the second day? will it be deleted on the destination? If yes, may be some option like "-sync" will make it more clear to the user (of course "-sizelimit" etc still apply).

          +1 "-sync" is probably our next step.

          Show
          Tsz Wo Nicholas Sze added a comment - > One question : extending the example in the 2nd commend above, what happens if /user1/data/file1 is deleted on the source before the second day? will it be deleted on the destination? If yes, may be some option like "-sync" will make it more clear to the user (of course "-sizelimit" etc still apply). +1 "-sync" is probably our next step.
          Hide
          Raghu Angadi added a comment -

          Is that an "yes" for deleting the file? Thanks.

          Show
          Raghu Angadi added a comment - Is that an "yes" for deleting the file? Thanks.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I don't plan to implement "-sync" or file deletion in this issue. So the answer for your question is: "no, the file will be remained in the destination." I will file another issue for that. Sorry for not being clear.

          Show
          Tsz Wo Nicholas Sze added a comment - I don't plan to implement "-sync" or file deletion in this issue. So the answer for your question is: "no, the file will be remained in the destination." I will file another issue for that. Sorry for not being clear.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          3873_20080808b.patch: this is a first patch supporting the new "-filelimit" and "-sizelimit" options. Need re-writing the shell messages and new tests.

          Show
          Tsz Wo Nicholas Sze added a comment - 3873_20080808b.patch: this is a first patch supporting the new "-filelimit" and "-sizelimit" options. Need re-writing the shell messages and new tests.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          3873_20080811b.patch: this is a complete patch

          • -filelimit <n> and -sizelimit <n> support symbolic representation. For examples,
            1230k = 1230 * 1024 = 1259520
            891g = 891 * 1024^3 = 956703965184
          • Comparing files sizes during setup
          • Rewrote shell messages
          • Added a few tests
          Show
          Tsz Wo Nicholas Sze added a comment - 3873_20080811b.patch: this is a complete patch -filelimit <n> and -sizelimit <n> support symbolic representation. For examples, 1230k = 1230 * 1024 = 1259520 891g = 891 * 1024^3 = 956703965184 Comparing files sizes during setup Rewrote shell messages Added a few tests
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Passed all tests locally, try Hudson.

          Show
          Tsz Wo Nicholas Sze added a comment - Passed all tests locally, try Hudson.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Created HADOOP-3939 for the -sync issue mentioned before.

          Show
          Tsz Wo Nicholas Sze added a comment - Created HADOOP-3939 for the -sync issue mentioned before.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12388005/3873_20080811b.patch
          against trunk revision 685425.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3053/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3053/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3053/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3053/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12388005/3873_20080811b.patch against trunk revision 685425. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3053/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3053/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3053/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3053/console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          The javadoc warnings are nothing to do with the patch. Before I did "svn update" today, there were no javadoc warnings. See HADOOP-3949

          The tests failed are TestMapRed and TestMiniMRDFSSort but they failed on trunk (they did not fail before I did "svn update"). See HADOOP-3950

          Show
          Tsz Wo Nicholas Sze added a comment - The javadoc warnings are nothing to do with the patch. Before I did "svn update" today, there were no javadoc warnings. See HADOOP-3949 The tests failed are TestMapRed and TestMiniMRDFSSort but they failed on trunk (they did not fail before I did "svn update"). See HADOOP-3950
          Hide
          Chris Douglas added a comment -

          +1 looks good.

          I just committed this. Thanks, Nicholas

          Show
          Chris Douglas added a comment - +1 looks good. I just committed this. Thanks, Nicholas
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/ )
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #622 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/622/ )
          Hide
          Tsz Wo Nicholas Sze added a comment -

          3873_20080811b_0.18.patch: for 0.18 (this won't be committed.)

          Show
          Tsz Wo Nicholas Sze added a comment - 3873_20080811b_0.18.patch: for 0.18 (this won't be committed.)

            People

            • Assignee:
              Tsz Wo Nicholas Sze
              Reporter:
              Tsz Wo Nicholas Sze
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development