Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-642

distcp could have an option to preserve the full source path

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.0
    • Component/s: distcp
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      DistCp now has a "-basedir" option that allows you to set the sufix of the source path that will be copied to the destination.

      Description

      It would be helpful to have an option that preserves the full source path when copying files from one location to another. This is specially important when archiving/moving files from one cluster to another one.

      1. MAPREDUCE-642.patch
        7 kB
        Rodrigo Schmidt
      2. HADOOP-5826.patch
        6 kB
        Rodrigo Schmidt
      3. HADOOP-5826.4.patch
        7 kB
        Rodrigo Schmidt
      4. HADOOP-5826.3.patch
        7 kB
        Rodrigo Schmidt
      5. HADOOP-5826.2.patch
        7 kB
        Rodrigo Schmidt

        Issue Links

          Activity

          Hide
          dhruba borthakur added a comment -

          distcp has options to preserve the modification time, replication factor, permissions, etc from the source file to the destination. One approach would be to introduce a command line parameter to distcp that "preserves" source-path.

          Show
          dhruba borthakur added a comment - distcp has options to preserve the modification time, replication factor, permissions, etc from the source file to the destination. One approach would be to introduce a command line parameter to distcp that "preserves" source-path.
          Hide
          Rodrigo Schmidt added a comment -

          Indeed, using the "-p" options sounds like a great idea.

          Show
          Rodrigo Schmidt added a comment - Indeed, using the "-p" options sounds like a great idea.
          Hide
          Raghu Angadi added a comment -

          What does "preserving full source path" specifically mean?

          Show
          Raghu Angadi added a comment - What does "preserving full source path" specifically mean?
          Hide
          Rodrigo Schmidt added a comment -

          Assuming option -p s did that, if we ran

          hadoop distcp -p s /a/b/c /d

          file/directory /d/a/b/c would be created instead of /d/c.

          Show
          Rodrigo Schmidt added a comment - Assuming option -p s did that, if we ran hadoop distcp -p s /a/b/c /d file/directory /d/a/b/c would be created instead of /d/c.
          Hide
          Raghu Angadi added a comment -

          That seems to be doable by user of distcp like :
          hadoop fs -mkdir dest:/d/a/b (copying properties of a and b from source, if possible)
          hadoop distcp src:/a/b/c dest:/d/a/b

          If we want to add this feature, I think it is better not to use '-p' for this. '-p' by convention implies preserving properties of each file.. it would be confusing if it also created these paths.

          Show
          Raghu Angadi added a comment - That seems to be doable by user of distcp like : hadoop fs -mkdir dest:/d/a/b (copying properties of a and b from source, if possible) hadoop distcp src:/a/b/c dest:/d/a/b If we want to add this feature, I think it is better not to use '-p' for this. '-p' by convention implies preserving properties of each file.. it would be confusing if it also created these paths.
          Hide
          Rodrigo Schmidt added a comment -

          Another possibility is something like a basedir (-b) option that tells distcp what is the prefix of the source path that should be used as the base directory for copying.

          distcp -b /a /a/b/c /d

          would create file /d/b/c

          distcp -b / /a/b/c /d

          would creaet file /d/a/b/c

          this is more general than the original proposition.

          Show
          Rodrigo Schmidt added a comment - Another possibility is something like a basedir (-b) option that tells distcp what is the prefix of the source path that should be used as the base directory for copying. distcp -b /a /a/b/c /d would create file /d/b/c distcp -b / /a/b/c /d would creaet file /d/a/b/c this is more general than the original proposition.
          Hide
          Raghu Angadi added a comment -

          The basedir option sounds better. This would be a special option and I think it better to use a longer option name.

          Show
          Raghu Angadi added a comment - The basedir option sounds better. This would be a special option and I think it better to use a longer option name.
          Hide
          Rodrigo Schmidt added a comment -

          What about

          -basedir <dir>

          or

          -base <dir>

          ?

          Show
          Rodrigo Schmidt added a comment - What about -basedir <dir> or -base <dir> ?
          Hide
          Raghu Angadi added a comment -

          -basedir with one or two examples in help should do I think.

          Show
          Raghu Angadi added a comment - -basedir with one or two examples in help should do I think.
          Hide
          Rodrigo Schmidt added a comment -

          What should be the behavior in case basedir is not a correct prefix for one or more sources?

          I think it should throw an exception and cancel the copy.

          Show
          Rodrigo Schmidt added a comment - What should be the behavior in case basedir is not a correct prefix for one or more sources? I think it should throw an exception and cancel the copy.
          Hide
          Raghu Angadi added a comment -

          I think it is better for someone with more familiarity with distcp to review.

          My comments from a brief look at the patch :

          • it might throw exception if used like 'distcp -basedir /a/b /a/b /dst'
          • does simple 'distcp /dir0/dir1 /dst' create /dst/dir1 directory or move contents for dir1 into /dst? If it is former, then test case does not seem to test feature.
          • The implementation changes the destination path but does not explicitly create the extra directories. Will it satisfy preserving the properties of those directories? (ie. '-basedir /a /a/b/c /dst' may not preserve properties for /dst/b and /dst/b/c).
          Show
          Raghu Angadi added a comment - I think it is better for someone with more familiarity with distcp to review. My comments from a brief look at the patch : it might throw exception if used like 'distcp -basedir /a/b /a/b /dst' does simple 'distcp /dir0/dir1 /dst' create /dst/dir1 directory or move contents for dir1 into /dst? If it is former, then test case does not seem to test feature. The implementation changes the destination path but does not explicitly create the extra directories. Will it satisfy preserving the properties of those directories? (ie. '-basedir /a /a/b/c /dst' may not preserve properties for /dst/b and /dst/b/c).
          Hide
          Rodrigo Schmidt added a comment -

          Thanks a lot for the comments!

          1) I'm throwing an exception on purpose when <basedir> equals some input source as I don't see why we should allow this.

          2) if /dst doesn't exist, /dst/dir1 is not created in the default behavior. As this is the case in the unit test, the feature is being tested (you can compare it to the unit test that copies files from dfs to dfs).

          3) You are definitely right about the extra directories' permissions. I didn't check anything about them and I should have. I'll change it and upload a new patch.

          Show
          Rodrigo Schmidt added a comment - Thanks a lot for the comments! 1) I'm throwing an exception on purpose when <basedir> equals some input source as I don't see why we should allow this. 2) if /dst doesn't exist, /dst/dir1 is not created in the default behavior. As this is the case in the unit test, the feature is being tested (you can compare it to the unit test that copies files from dfs to dfs). 3) You are definitely right about the extra directories' permissions. I didn't check anything about them and I should have. I'll change it and upload a new patch.
          Hide
          Raghu Angadi added a comment -

          > 1) I'm throwing an exception on purpose when <basedir> equals some input source as I don't see why we should allow this.

          Even though it does not seem useful, it is logically correct way to use the option. Sometimes it might be required, say in automated scripts where argument to -basedir is not provided by human.

          Show
          Raghu Angadi added a comment - > 1) I'm throwing an exception on purpose when <basedir> equals some input source as I don't see why we should allow this. Even though it does not seem useful, it is logically correct way to use the option. Sometimes it might be required, say in automated scripts where argument to -basedir is not provided by human.
          Hide
          Rodrigo Schmidt added a comment -

          Ok! I think that will be automatically solved by my solution to point 3.

          Show
          Rodrigo Schmidt added a comment - Ok! I think that will be automatically solved by my solution to point 3.
          Hide
          Rodrigo Schmidt added a comment -

          New patch!

          Show
          Rodrigo Schmidt added a comment - New patch!
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12408779/HADOOP-5762.2.patch
          against trunk revision 777594.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/380/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/380/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/380/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/380/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12408779/HADOOP-5762.2.patch against trunk revision 777594. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/380/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/380/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/380/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/380/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          The test failures in streaming do not seem to be related to distcp at all.

          Show
          dhruba borthakur added a comment - The test failures in streaming do not seem to be related to distcp at all.
          Hide
          Rodrigo Schmidt added a comment -

          New patch

          Show
          Rodrigo Schmidt added a comment - New patch
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12409138/HADOOP-5826.2.patch
          against trunk revision 779807.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/423/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/423/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/423/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/423/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12409138/HADOOP-5826.2.patch against trunk revision 779807. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/423/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/423/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/423/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/423/console This message is automatically generated.
          Hide
          Rodrigo Schmidt added a comment -

          Minor bug in the previous patch

          Show
          Rodrigo Schmidt added a comment - Minor bug in the previous patch
          Hide
          Rodrigo Schmidt added a comment -

          New patch.

          Show
          Rodrigo Schmidt added a comment - New patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12409552/HADOOP-5826.3.patch
          against trunk revision 780875.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/448/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/448/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/448/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/448/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12409552/HADOOP-5826.3.patch against trunk revision 780875. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/448/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/448/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/448/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/448/console This message is automatically generated.
          Hide
          Rodrigo Schmidt added a comment - - edited

          Failed test seems to be completely unrelated to distcp.

          Show
          Rodrigo Schmidt added a comment - - edited Failed test seems to be completely unrelated to distcp.
          Hide
          Rodrigo Schmidt added a comment -

          Anyone wants to review the latest patch?

          Show
          Rodrigo Schmidt added a comment - Anyone wants to review the latest patch?
          Hide
          dhruba borthakur added a comment -

          Code looks good. One comment:

          1. If basedir does not exist, then emit an error message saying "The directory specified by basedir does not exist".

          Show
          dhruba borthakur added a comment - Code looks good. One comment: 1. If basedir does not exist, then emit an error message saying "The directory specified by basedir does not exist".
          Hide
          Rodrigo Schmidt added a comment -

          There are already two tests and two error messages that cover this case:

          • If basedir is not a prefix of the source (which will probably happen if it doesn't exist), it'll print: "Basedir [dir] is not a prefix of source path [src]"
          • If basedir is a prefix but is not a directory, it'll print: "Basedir [dir] is not a directory"

          Adding a new error message for the case where the directory doesn't exist looks like a redundant test given the first error case.

          Show
          Rodrigo Schmidt added a comment - There are already two tests and two error messages that cover this case: If basedir is not a prefix of the source (which will probably happen if it doesn't exist), it'll print: "Basedir [dir] is not a prefix of source path [src] " If basedir is a prefix but is not a directory, it'll print: "Basedir [dir] is not a directory" Adding a new error message for the case where the directory doesn't exist looks like a redundant test given the first error case.
          Hide
          Rodrigo Schmidt added a comment -

          Updated patch.

          Show
          Rodrigo Schmidt added a comment - Updated patch.
          Hide
          Rodrigo Schmidt added a comment -

          Passed unit tests and ant test-patch returned the following:

          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 3 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          [exec]

          Show
          Rodrigo Schmidt added a comment - Passed unit tests and ant test-patch returned the following: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec]
          Hide
          Tom White added a comment -

          Unfortunately the patch no longer applies due to the project split. Would you be able to regenerate it for the MapReduce project please?

          Show
          Tom White added a comment - Unfortunately the patch no longer applies due to the project split. Would you be able to regenerate it for the MapReduce project please?
          Hide
          Rodrigo Schmidt added a comment -

          New patch. Should work with the project split.

          Show
          Rodrigo Schmidt added a comment - New patch. Should work with the project split.
          Hide
          Rodrigo Schmidt added a comment -

          Passed unit tests (some failures, but all unrelated to this patch) and ant test-patch returned the following:

          [exec]
          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 3 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          [exec]
          [exec]

          Show
          Rodrigo Schmidt added a comment - Passed unit tests (some failures, but all unrelated to this patch) and ant test-patch returned the following: [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec]
          Hide
          dhruba borthakur added a comment -

          I just committed this. Thanks Rodrigo.

          Rodrigo: Please update the Release Notes in this JIRA. I am going to open a new JIRA that requests updating the distcp guide.

          Show
          dhruba borthakur added a comment - I just committed this. Thanks Rodrigo. Rodrigo: Please update the Release Notes in this JIRA. I am going to open a new JIRA that requests updating the distcp guide.
          Hide
          dhruba borthakur added a comment -

          I just committed this. Thanks Rodrigo.
          Rodrigo: Please update the Release Notes in this JIRA. Distcp Guide to be updated via MAPREDUCE-689

          Show
          dhruba borthakur added a comment - I just committed this. Thanks Rodrigo. Rodrigo: Please update the Release Notes in this JIRA. Distcp Guide to be updated via MAPREDUCE-689
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Dhruba, we already have MAPREDUCE-647 for updating distcp doc.

          Also, thanks for reviewing the patch.

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Dhruba, we already have MAPREDUCE-647 for updating distcp doc. Also, thanks for reviewing the patch.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #15 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/15/)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #15 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/15/ )

            People

            • Assignee:
              Rodrigo Schmidt
              Reporter:
              Rodrigo Schmidt
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development