Hadoop Common
  1. Hadoop Common
  2. HADOOP-5620

discp can preserve modification times of files

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      New DistCp option -pt preserves last modification and last access times of copied files.

      Description

      It will be helpful if distcp can preserve the modification time and access time of files. This helps to archive/unarchive hdfs files.

      1. HADOOP-5620.patch
        5 kB
        Rodrigo Schmidt
      2. HADOOP-5620.2.patch
        4 kB
        Rodrigo Schmidt
      3. HADOOP-5620.3.patch
        5 kB
        Rodrigo Schmidt

        Issue Links

          Activity

          dhruba borthakur created issue -
          dhruba borthakur made changes -
          Field Original Value New Value
          Link This issue blocks HADOOP-4058 [ HADOOP-4058 ]
          dhruba borthakur made changes -
          Assignee Rodrigo Schmidt [ rschmidt ]
          Hide
          Rodrigo Schmidt added a comment -

          distcp itself updates the access times of the files it copies. How should this option work in such case?

          Show
          Rodrigo Schmidt added a comment - distcp itself updates the access times of the files it copies. How should this option work in such case?
          Hide
          dhruba borthakur added a comment -

          After distcp copies a file from src to dest, the dest file should have the same access time as the one in the src file.

          Show
          dhruba borthakur added a comment - After distcp copies a file from src to dest, the dest file should have the same access time as the one in the src file.
          Hide
          dhruba borthakur added a comment -

          The method FileSystem.setTime() allows setting the access time of the file.

          Show
          dhruba borthakur added a comment - The method FileSystem.setTime() allows setting the access time of the file.
          Hide
          Rodrigo Schmidt added a comment -

          let's assume we have a file /src/src.file, whose access time is Mar 30, 2005, and then we run

          $ bin/hadoop dstcp /src/src.file /dst

          By the end of the execution, since distcp reads the source, the access time of /src/src.file will be close to now().

          My question is more fundamental: Does it make sense to preserve the access time, if the copying itself will update the access time of the source? What are we trying to preserve with it?

          Show
          Rodrigo Schmidt added a comment - let's assume we have a file /src/src.file, whose access time is Mar 30, 2005, and then we run $ bin/hadoop dstcp /src/src.file /dst By the end of the execution, since distcp reads the source, the access time of /src/src.file will be close to now(). My question is more fundamental: Does it make sense to preserve the access time, if the copying itself will update the access time of the source? What are we trying to preserve with it?
          Hide
          Raghu Angadi added a comment -

          Of course, modification time is the more often used stat. But setting access time does not cost more and seems like the right thing to do. Even though access times for both source and dest files (usually) changes during copy, it would not be same unless set explicitly.

          Alternately we could just do what ever Linux 'cp -a' command does. If it preserves access time, we should.

          Also some file systems might not configured to update access times all the time.

          Show
          Raghu Angadi added a comment - Of course, modification time is the more often used stat. But setting access time does not cost more and seems like the right thing to do. Even though access times for both source and dest files (usually) changes during copy, it would not be same unless set explicitly. Alternately we could just do what ever Linux 'cp -a' command does. If it preserves access time, we should. Also some file systems might not configured to update access times all the time.
          Hide
          Rodrigo Schmidt added a comment -

          The main reason we needed this option was to preserve the properties of files that are being migrated from one cluster to another one, without losing last access and modification times (and without interfering with it). In that sense, I think the best option is to set the destination with last modification and access times of the source file immediately before it is copied.

          Show
          Rodrigo Schmidt added a comment - The main reason we needed this option was to preserve the properties of files that are being migrated from one cluster to another one, without losing last access and modification times (and without interfering with it). In that sense, I think the best option is to set the destination with last modification and access times of the source file immediately before it is copied.
          Rodrigo Schmidt made changes -
          Attachment HADOOP-5620.patch [ 12408734 ]
          Rodrigo Schmidt made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Fix Version/s 0.21.0 [ 12313563 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12408734/HADOOP-5620.patch
          against trunk revision 777761.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/390/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/390/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/390/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/390/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12408734/HADOOP-5620.patch against trunk revision 777761. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/390/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/390/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/390/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/390/console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -
          • Should discp also preserve modification times of directories? Otherwise, the new -t option seems confusing.
          • It might be better to get mtime and atime inside the if-statement
            if ( preserve_status && preseved.contains(FileAttribute.TIMES)) {
              ...
            }
            

            since srcstat remains unchanged in the method.

          • Marking this as an incompatible change because it change the meaning of -p.
          Show
          Tsz Wo Nicholas Sze added a comment - Should discp also preserve modification times of directories? Otherwise, the new -t option seems confusing. It might be better to get mtime and atime inside the if-statement if ( preserve_status && preseved.contains(FileAttribute.TIMES)) { ... } since srcstat remains unchanged in the method. Marking this as an incompatible change because it change the meaning of -p.
          Tsz Wo Nicholas Sze made changes -
          Hadoop Flags [Incompatible change]
          Hide
          Rodrigo Schmidt added a comment -
          • Dhruba told me modification times of directories are not persistent, that is, on namenode restart they are set to the latest modification time amongst the files they contain.
          • If we get atime inside the if, it will be the copy time (last access after copying the file) instead of the latest access time before copying, which is what we need for migration.
          Show
          Rodrigo Schmidt added a comment - Dhruba told me modification times of directories are not persistent, that is, on namenode restart they are set to the latest modification time amongst the files they contain. If we get atime inside the if, it will be the copy time (last access after copying the file) instead of the latest access time before copying, which is what we need for migration.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Dhruba told me modification times of directories are not persistent, that is, on namenode restart they are set to the latest modification time amongst the files they contain.
          I just have checked the codes. It seems not true.

          Also, DistCp works on general FileSystem. It should not depend on a particular implementation.

          > If we get atime inside the if, it will be the copy time (last access after copying the file) instead of the latest access time before copying, which is what we need for migration.
          FileStatus is a local object. Once it has been obtained from a FileSystem it remains unchanged even the actual status of the file is changed. So the atime inside the if-statement will be the latest access time before copying since getFileStatus is called before copying.

          BTW, there is a white space change in the patch, could you remove it?

          Show
          Tsz Wo Nicholas Sze added a comment - > Dhruba told me modification times of directories are not persistent, that is, on namenode restart they are set to the latest modification time amongst the files they contain. I just have checked the codes. It seems not true. Also, DistCp works on general FileSystem. It should not depend on a particular implementation. > If we get atime inside the if, it will be the copy time (last access after copying the file) instead of the latest access time before copying, which is what we need for migration. FileStatus is a local object. Once it has been obtained from a FileSystem it remains unchanged even the actual status of the file is changed. So the atime inside the if-statement will be the latest access time before copying since getFileStatus is called before copying. BTW, there is a white space change in the patch, could you remove it?
          Hide
          Rodrigo Schmidt added a comment -

          Thanks for the comments! The new patch has a much more elegant solution that works for both files and directories.

          Show
          Rodrigo Schmidt added a comment - Thanks for the comments! The new patch has a much more elegant solution that works for both files and directories.
          Rodrigo Schmidt made changes -
          Attachment HADOOP-5620.2.patch [ 12409095 ]
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Yes, the new patch is much more elegant.

          Could you rename the method updatePermissions(..) to something like updateStatus(..)? Otherwise, patch looks good.

          Show
          Tsz Wo Nicholas Sze added a comment - Yes, the new patch is much more elegant. Could you rename the method updatePermissions(..) to something like updateStatus(..)? Otherwise, patch looks good.
          Hide
          Rodrigo Schmidt added a comment -

          Changed it to updateDestStatus since method updateStatus already exists.

          Show
          Rodrigo Schmidt added a comment - Changed it to updateDestStatus since method updateStatus already exists.
          Rodrigo Schmidt made changes -
          Attachment HADOOP-5620.3.patch [ 12409098 ]
          Tsz Wo Nicholas Sze made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 new patch looks good.

          Show
          Tsz Wo Nicholas Sze added a comment - +1 new patch looks good.
          Tsz Wo Nicholas Sze made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hadoop Flags [Incompatible change] [Incompatible change, Reviewed]
          Rodrigo Schmidt made changes -
          Status Patch Available [ 10002 ] In Progress [ 3 ]
          Rodrigo Schmidt made changes -
          Status In Progress [ 3 ] Patch Available [ 10002 ]
          Hide
          Rodrigo Schmidt added a comment -

          $ ant -Dpatch.file=/.../HADOOP-5620.3.patch -Dforrest.home=/.../apache-forrest-0.8/ -Dfindbugs.home=/.../findbugs-1.3.8/ -Djava5.home=/.../jdk1.5.0_07/ test-patch

          [exec]
          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 4 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          [exec]

          Show
          Rodrigo Schmidt added a comment - $ ant -Dpatch.file=/.../ HADOOP-5620 .3.patch -Dforrest.home=/.../apache-forrest-0.8/ -Dfindbugs.home=/.../findbugs-1.3.8/ -Djava5.home=/.../jdk1.5.0_07/ test-patch [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 4 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec]
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Rodrigo, have you also run the unit tests in your local machine?

          BTW, since this is an incompatible change (which changes the distcp command options), please add release note in this jira. Could you also update the forrest doc (in this or a separated jira)? .

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Rodrigo, have you also run the unit tests in your local machine? BTW, since this is an incompatible change (which changes the distcp command options), please add release note in this jira. Could you also update the forrest doc (in this or a separated jira)? .
          Rodrigo Schmidt made changes -
          Release Note DistCp can now preserve last modification and last access times of copied files with option -pt
          Hide
          Rodrigo Schmidt added a comment -

          I ran TestCopyFiles manually. I was expecting "ant ... test-patch" to run all the unit tests, but I was surprised to see it didn't. What's the option to run the exact same tests performed by Hadoop QA?

          I'm changing the forrest doc...

          Show
          Rodrigo Schmidt added a comment - I ran TestCopyFiles manually. I was expecting "ant ... test-patch" to run all the unit tests, but I was surprised to see it didn't. What's the option to run the exact same tests performed by Hadoop QA? I'm changing the forrest doc...
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > What's the option to run the exact same tests performed by Hadoop QA?
          You may do "ant test" for running all unit tests.

          > I'm changing the forrest doc...
          Thanks.

          Show
          Tsz Wo Nicholas Sze added a comment - > What's the option to run the exact same tests performed by Hadoop QA? You may do "ant test" for running all unit tests. > I'm changing the forrest doc... Thanks.
          Rodrigo Schmidt made changes -
          Link This issue blocks HADOOP-5927 [ HADOOP-5927 ]
          Hide
          Rodrigo Schmidt added a comment -

          I realize it makes more sense to change the forrest doc only once, with all the new features we are introducing. I created HADOOP-5927 for that.

          Show
          Rodrigo Schmidt added a comment - I realize it makes more sense to change the forrest doc only once, with all the new features we are introducing. I created HADOOP-5927 for that.
          Hide
          Rodrigo Schmidt added a comment -

          I ran the unit tests and the three following errors appeared. They all seem to be unrelated to this patch:

          Testsuite: org.apache.hadoop.mapred.TestJobTrackerRestartWithLostTracker
          Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
          
          Testcase: testRestartWithLostTracker took 0.003 sec
          	Caused an ERROR
          Timeout occurred. Please note the time in the report does not reflect the time until the timeout.
          junit.framework.AssertionFailedError: Timeout occurred. Please note the time in the report does not reflect the time until the timeout.
          

          Testsuite: org.apache.hadoop.mapred.TestReduceFetch
          Tests run: 3, Failures: 1, Errors: 0, Time elapsed: 282.598 sec
          
          Testcase: testReduceFromPartialMem took 27.634 sec
          	FAILED
          Expected at least 1MB fewer bytes read from local (21159650) than written to HDFS (21036680)
          junit.framework.AssertionFailedError: Expected at least 1MB fewer bytes read from local (21159650) than written to HDFS (21036680)
          	at org.apache.hadoop.mapred.TestReduceFetch.testReduceFromPartialMem(TestReduceFetch.java:276)
          	at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
          	at junit.extensions.TestSetup$1.protect(TestSetup.java:23)
          	at junit.extensions.TestSetup.run(TestSetup.java:27)
          

          Testsuite: org.apache.hadoop.mapred.TestTaskFail
          Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 92.65 sec
          
          Testcase: testWithDFS took 92.629 sec
          	FAILED
          expected:<FAILED> but was:<KILLED>
          junit.framework.AssertionFailedError: expected:<FAILED> but was:<KILLED>
          	at org.apache.hadoop.mapred.TestTaskFail.validateJob(TestTaskFail.java:139)
          	at org.apache.hadoop.mapred.TestTaskFail.testWithDFS(TestTaskFail.java:170)
          
          Show
          Rodrigo Schmidt added a comment - I ran the unit tests and the three following errors appeared. They all seem to be unrelated to this patch: – Testsuite: org.apache.hadoop.mapred.TestJobTrackerRestartWithLostTracker Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec Testcase: testRestartWithLostTracker took 0.003 sec Caused an ERROR Timeout occurred. Please note the time in the report does not reflect the time until the timeout. junit.framework.AssertionFailedError: Timeout occurred. Please note the time in the report does not reflect the time until the timeout. – Testsuite: org.apache.hadoop.mapred.TestReduceFetch Tests run: 3, Failures: 1, Errors: 0, Time elapsed: 282.598 sec Testcase: testReduceFromPartialMem took 27.634 sec FAILED Expected at least 1MB fewer bytes read from local (21159650) than written to HDFS (21036680) junit.framework.AssertionFailedError: Expected at least 1MB fewer bytes read from local (21159650) than written to HDFS (21036680) at org.apache.hadoop.mapred.TestReduceFetch.testReduceFromPartialMem(TestReduceFetch.java:276) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24) at junit.extensions.TestSetup$1.protect(TestSetup.java:23) at junit.extensions.TestSetup.run(TestSetup.java:27) – Testsuite: org.apache.hadoop.mapred.TestTaskFail Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 92.65 sec Testcase: testWithDFS took 92.629 sec FAILED expected:<FAILED> but was:<KILLED> junit.framework.AssertionFailedError: expected:<FAILED> but was:<KILLED> at org.apache.hadoop.mapred.TestTaskFail.validateJob(TestTaskFail.java:139) at org.apache.hadoop.mapred.TestTaskFail.testWithDFS(TestTaskFail.java:170)
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I have committed this. Thanks, Rodrigo!

          Show
          Tsz Wo Nicholas Sze added a comment - I have committed this. Thanks, Rodrigo!
          Tsz Wo Nicholas Sze made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12409098/HADOOP-5620.3.patch
          against trunk revision 779656.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/421/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/421/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/421/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/421/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12409098/HADOOP-5620.3.patch against trunk revision 779656. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/421/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/421/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/421/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/421/console This message is automatically generated.
          dhruba borthakur made changes -
          Link This issue relates to MAPREDUCE-689 [ MAPREDUCE-689 ]
          Owen O'Malley made changes -
          Component/s tools/distcp [ 12312387 ]
          Hide
          Robert Chansler added a comment -

          Editorial pass over all release notes prior to publication of 0.21.

          Show
          Robert Chansler added a comment - Editorial pass over all release notes prior to publication of 0.21.
          Robert Chansler made changes -
          Release Note DistCp can now preserve last modification and last access times of copied files with option -pt New DistCp option -pt preserves last modification and last access times of copied files.
          Tom White made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Gavin made changes -
          Link This issue blocks MAPREDUCE-647 [ MAPREDUCE-647 ]
          Gavin made changes -
          Link This issue is depended upon by MAPREDUCE-647 [ MAPREDUCE-647 ]

            People

            • Assignee:
              Rodrigo Schmidt
              Reporter:
              dhruba borthakur
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development