Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1548

Hadoop archives should be able to preserve times and other properties from original files

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.22.0
    • Component/s: harchive
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Files inside hadoop archives don't keep their original:

      • modification time
      • access time
      • permission
      • owner
      • group

      all such properties are currently taken from the file storing the archive index, and not the stored files. This doesn't look very correct.

      There should be possible to preserve the original properties of the stored files.

      1. MAPREDUCE-1548.5.patch
        11 kB
        Rodrigo Schmidt
      2. MAPREDUCE-1548.4.patch
        10 kB
        Rodrigo Schmidt
      3. MAPREDUCE-1548.3.patch
        9 kB
        Rodrigo Schmidt
      4. MAPREDUCE-1548.2.patch
        9 kB
        Rodrigo Schmidt
      5. MAPREDUCE-1548.1.patch
        9 kB
        Rodrigo Schmidt
      6. MAPREDUCE-1548.0.patch
        9 kB
        Rodrigo Schmidt

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Patch Available Patch Available Open Open
          94d 18h 29m 6 Rodrigo Schmidt 17/Sep/10 00:33
          Open Open Patch Available Patch Available
          103d 19h 53m 7 Rodrigo Schmidt 17/Sep/10 00:33
          Patch Available Patch Available Resolved Resolved
          6d 4h 57m 1 dhruba borthakur 23/Sep/10 05:31
          Resolved Resolved Closed Closed
          445d 1h 47m 1 Konstantin Shvachko 12/Dec/11 06:19
          Ravi Prakash made changes -
          Link This issue is related to HADOOP-11436 [ HADOOP-11436 ]
          Konstantin Shvachko made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #523 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/523/)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #523 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/523/ )
          dhruba borthakur made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Resolution Fixed [ 1 ]
          Hide
          dhruba borthakur added a comment -

          I just committed this. Thanks Rodrigo.

          Show
          dhruba borthakur added a comment - I just committed this. Thanks Rodrigo.
          Hide
          Ramkumar Vadali added a comment -

          ant test-patch results with the latest patch:

          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 8 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          [exec]
          [exec] +1 system tests framework. The patch passed system tests framework compile.
          [exec]
          [exec]
          [exec]
          [exec]
          [exec] ======================================================================
          [exec] ======================================================================
          [exec] Finished build.
          [exec] ======================================================================
          [exec] ======================================================================
          [exec]
          [exec]

          BUILD SUCCESSFUL

          Show
          Ramkumar Vadali added a comment - ant test-patch results with the latest patch: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 8 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 system tests framework. The patch passed system tests framework compile. [exec] [exec] [exec] [exec] [exec] ====================================================================== [exec] ====================================================================== [exec] Finished build. [exec] ====================================================================== [exec] ====================================================================== [exec] [exec] BUILD SUCCESSFUL
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Mahadev, please take a look the new patch.

          Show
          Tsz Wo Nicholas Sze added a comment - Mahadev, please take a look the new patch.
          Hide
          Rodrigo Schmidt added a comment -

          @Mahadev, Nicholas: The new patch is good to go.

          Show
          Rodrigo Schmidt added a comment - @Mahadev, Nicholas: The new patch is good to go.
          Rodrigo Schmidt made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Rodrigo Schmidt made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Rodrigo Schmidt made changes -
          Attachment MAPREDUCE-1548.5.patch [ 12454814 ]
          Hide
          Rodrigo Schmidt added a comment -

          New patch!

          Mahadev was right! I fixed the xml file that controls the unit test and everything worked.

          The only unit test that failed was a contrib test that also failed on a clean trunk:

          [junit] Running org.apache.hadoop.streaming.TestUlimit
          [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 125.025 sec
          [junit] Test org.apache.hadoop.streaming.TestUlimit FAILED

          All the other tests were good.

          Show
          Rodrigo Schmidt added a comment - New patch! Mahadev was right! I fixed the xml file that controls the unit test and everything worked. The only unit test that failed was a contrib test that also failed on a clean trunk: [junit] Running org.apache.hadoop.streaming.TestUlimit [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 125.025 sec [junit] Test org.apache.hadoop.streaming.TestUlimit FAILED All the other tests were good.
          Hide
          Mahadev konar added a comment -

          I think this has mostly to do with

                 if (globPaths.isEmpty()) {
          -        throw new IOException("The resolved paths is empty."
          +        throw new IOException("The resolved paths set is empty."
                       + "  Please check whether the srcPaths exist, where srcPaths = "
                       + srcPaths);
          
          

          Not sure though.

          Show
          Mahadev konar added a comment - I think this has mostly to do with if (globPaths.isEmpty()) { - throw new IOException( "The resolved paths is empty." + throw new IOException( "The resolved paths set is empty." + " Please check whether the srcPaths exist, where srcPaths = " + srcPaths); Not sure though.
          Hide
          Ramkumar Vadali added a comment -

          Thanks Nicholas.

          ant test failed in TestMRCLI. From build/test/TEST-org.apache.hadoop.cli.TestMRCLI.txt,

          2010-09-15 15:03:33,645 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(225)) - Failing tests:
          2010-09-15 15:03:33,645 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(226)) - --------------
          2010-09-15 15:03:33,645 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(232)) - 4: Archive: Archive does not show null message
          2010-09-15 15:03:33,645 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(232)) - 16: Archive: Invalid Source is specified

          Rodrigo, could you take a look?

          Show
          Ramkumar Vadali added a comment - Thanks Nicholas. ant test failed in TestMRCLI. From build/test/TEST-org.apache.hadoop.cli.TestMRCLI.txt, 2010-09-15 15:03:33,645 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(225)) - Failing tests: 2010-09-15 15:03:33,645 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(226)) - -------------- 2010-09-15 15:03:33,645 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(232)) - 4: Archive: Archive does not show null message 2010-09-15 15:03:33,645 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(232)) - 16: Archive: Invalid Source is specified Rodrigo, could you take a look?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          The "test-patch" target runs Findbugs, javadoc, etc but not unit tests. For unit tests, we need to run "ant test".

          Show
          Tsz Wo Nicholas Sze added a comment - The "test-patch" target runs Findbugs, javadoc, etc but not unit tests. For unit tests, we need to run "ant test".
          Hide
          Ramkumar Vadali added a comment -

          I used:

          ant -Dpatch.file=~/local/patches/MAPREDUCE-1548.4.patch -Dforrest.home=/home/rvadali/forrest/ -Dfindbugs.home=/home/rvadali/findbugs -Dscratch.dir=/tmp/rvadali.hadoopQA -Dsvn.cmd=/usr/local/bin/svn -Dgrep.cmd=/bin/grep -Dpatch.cmd=/usr/bin/patch -Djava5.home=/home/rvadali/local/external/jdk1.5 test-patch

          Show
          Ramkumar Vadali added a comment - I used: ant -Dpatch.file=~/local/patches/ MAPREDUCE-1548 .4.patch -Dforrest.home=/home/rvadali/forrest/ -Dfindbugs.home=/home/rvadali/findbugs -Dscratch.dir=/tmp/rvadali.hadoopQA -Dsvn.cmd=/usr/local/bin/svn -Dgrep.cmd=/bin/grep -Dpatch.cmd=/usr/bin/patch -Djava5.home=/home/rvadali/local/external/jdk1.5 test-patch
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Just want to double check, what command have you used for running the unit tests?

          Show
          Tsz Wo Nicholas Sze added a comment - Just want to double check, what command have you used for running the unit tests?
          Hide
          Ramkumar Vadali added a comment -

          @Nicholas, Mahadev, here are the results of the unit-tests:

          [exec]
          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 3 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          [exec]
          [exec] +1 system tests framework. The patch passed system tests framework compile.
          [exec]

          Show
          Ramkumar Vadali added a comment - @Nicholas, Mahadev, here are the results of the unit-tests: [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 system tests framework. The patch passed system tests framework compile. [exec]
          Hide
          Mahadev konar added a comment -

          +1 ...

          @rodrigo/ramkumar can you please post the result of ant test. Looks like hudson patch build is broken.

          Show
          Mahadev konar added a comment - +1 ... @rodrigo/ramkumar can you please post the result of ant test. Looks like hudson patch build is broken.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          @Ramkumar, have you run all the unit tests?

          Show
          Tsz Wo Nicholas Sze added a comment - @Ramkumar, have you run all the unit tests?
          Hide
          Ramkumar Vadali added a comment -

          @Mahadev, If you like the latest patch, could you please commit it? Thanks!

          Show
          Ramkumar Vadali added a comment - @Mahadev, If you like the latest patch, could you please commit it? Thanks!
          Hide
          Rodrigo Schmidt added a comment -

          @Mahadev: Yes, the propsplits part is a little cryptic. Adding more comments makes a lot of sense. I just uploaded a new version. Let me know if it is fine that way.

          Cheers!

          Show
          Rodrigo Schmidt added a comment - @Mahadev: Yes, the propsplits part is a little cryptic. Adding more comments makes a lot of sense. I just uploaded a new version. Let me know if it is fine that way. Cheers!
          Rodrigo Schmidt made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Rodrigo Schmidt made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Rodrigo Schmidt made changes -
          Attachment MAPREDUCE-1548.4.patch [ 12454280 ]
          Hide
          Mahadev konar added a comment -

          +1, the patch looks good.

          Rodrigo,
          Just a minor nit:

          • can you add some comments that propsplits is a different column if the entry in index is a directory or a file? Its a little confusing when you see the code. Just add a comment in HarFilesystem that propsplits is stored in different columns depending on a dir or a file.

          Other than that patch looks good. I will go ahead and commit as soon as you make the change.

          thanks

          Show
          Mahadev konar added a comment - +1, the patch looks good. Rodrigo, Just a minor nit: can you add some comments that propsplits is a different column if the entry in index is a directory or a file? Its a little confusing when you see the code. Just add a comment in HarFilesystem that propsplits is stored in different columns depending on a dir or a file. Other than that patch looks good. I will go ahead and commit as soon as you make the change. thanks
          Hide
          Rodrigo Schmidt added a comment -

          No, it's not. It creates a new version (3), but it can still read old ones.

          It's similar to what we did to allow spaces in filenames (HADOOP-6591).

          Show
          Rodrigo Schmidt added a comment - No, it's not. It creates a new version (3), but it can still read old ones. It's similar to what we did to allow spaces in filenames ( HADOOP-6591 ).
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Rodrigo, I have not got a chance to take a look the patch. Is it an incompatible change?

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Rodrigo, I have not got a chance to take a look the patch. Is it an incompatible change?
          Hide
          Ramkumar Vadali added a comment -

          +1
          Looks good. I ran test-patch:

          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 3 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          [exec]
          [exec] +1 system tests framework. The patch passed system tests framework compile.
          [exec]
          [exec]

          Show
          Ramkumar Vadali added a comment - +1 Looks good. I ran test-patch: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 system tests framework. The patch passed system tests framework compile. [exec] [exec]
          Rodrigo Schmidt made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Rodrigo Schmidt made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Rodrigo Schmidt made changes -
          Attachment MAPREDUCE-1548.3.patch [ 12454082 ]
          Hide
          Ramkumar Vadali added a comment -

          Yes, that should be fine. Just leave a comment on how to get those if needed.

          Show
          Ramkumar Vadali added a comment - Yes, that should be fine. Just leave a comment on how to get those if needed.
          Hide
          Rodrigo Schmidt added a comment -

          Are you fine if I just remove those fields?

          Show
          Rodrigo Schmidt added a comment - Are you fine if I just remove those fields?
          Hide
          Rodrigo Schmidt added a comment -

          Sure!

          Show
          Rodrigo Schmidt added a comment - Sure!
          Hide
          Ramkumar Vadali added a comment -

          The warnings are caused by removing the accessor methods for the group,owner and permission fields

          Show
          Ramkumar Vadali added a comment - The warnings are caused by removing the accessor methods for the group,owner and permission fields
          Hide
          Ramkumar Vadali added a comment -

          Rodrigo, there are 3 new findbugs warnings introduced by the patch. Could you fix those?

          Code Warning
          UrF Unread field: org.apache.hadoop.fs.HarFileSystem$HarStatus.group
          Bug type URF_UNREAD_FIELD (click for details)
          In class org.apache.hadoop.fs.HarFileSystem$HarStatus
          Field org.apache.hadoop.fs.HarFileSystem$HarStatus.group
          At HarFileSystem.java:[line 558]
          UrF Unread field: org.apache.hadoop.fs.HarFileSystem$HarStatus.owner
          Bug type URF_UNREAD_FIELD (click for details)
          In class org.apache.hadoop.fs.HarFileSystem$HarStatus
          Field org.apache.hadoop.fs.HarFileSystem$HarStatus.owner
          At HarFileSystem.java:[line 557]
          UrF Unread field: org.apache.hadoop.fs.HarFileSystem$HarStatus.permission
          Bug type URF_UNREAD_FIELD (click for details)
          In class org.apache.hadoop.fs.HarFileSystem$HarStatus
          Field org.apache.hadoop.fs.HarFileSystem$HarStatus.permission
          At HarFileSystem.java:[line 556]

          Show
          Ramkumar Vadali added a comment - Rodrigo, there are 3 new findbugs warnings introduced by the patch. Could you fix those? Code Warning UrF Unread field: org.apache.hadoop.fs.HarFileSystem$HarStatus.group Bug type URF_UNREAD_FIELD (click for details) In class org.apache.hadoop.fs.HarFileSystem$HarStatus Field org.apache.hadoop.fs.HarFileSystem$HarStatus.group At HarFileSystem.java: [line 558] UrF Unread field: org.apache.hadoop.fs.HarFileSystem$HarStatus.owner Bug type URF_UNREAD_FIELD (click for details) In class org.apache.hadoop.fs.HarFileSystem$HarStatus Field org.apache.hadoop.fs.HarFileSystem$HarStatus.owner At HarFileSystem.java: [line 557] UrF Unread field: org.apache.hadoop.fs.HarFileSystem$HarStatus.permission Bug type URF_UNREAD_FIELD (click for details) In class org.apache.hadoop.fs.HarFileSystem$HarStatus Field org.apache.hadoop.fs.HarFileSystem$HarStatus.permission At HarFileSystem.java: [line 556]
          Hide
          Ramkumar Vadali added a comment -

          +1
          Patch looks good

          Show
          Ramkumar Vadali added a comment - +1 Patch looks good
          Rodrigo Schmidt made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Rodrigo Schmidt made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Rodrigo Schmidt made changes -
          Attachment MAPREDUCE-1548.2.patch [ 12453670 ]
          Hide
          Rodrigo Schmidt added a comment -

          New patch

          Show
          Rodrigo Schmidt added a comment - New patch
          Hide
          Rodrigo Schmidt added a comment -

          I was actually indifferent w.r.t. removing or keeping them, but now I feel like removing them will help keeping the code cleaner. We can add them later, if they are needed.

          I'll remove them and send a new diff later today.

          Show
          Rodrigo Schmidt added a comment - I was actually indifferent w.r.t. removing or keeping them, but now I feel like removing them will help keeping the code cleaner. We can add them later, if they are needed. I'll remove them and send a new diff later today.
          Show
          Ramkumar Vadali added a comment - I thought you intended to remove those based on your comment ( https://issues.apache.org/jira/browse/MAPREDUCE-1548?focusedCommentId=12878480&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12878480 ).
          Hide
          Rodrigo Schmidt added a comment -

          That's part of the new HarStatus object. The information is stored and I thought of having methods to make them accessible. Would you prefer to remove the methods?

          Show
          Rodrigo Schmidt added a comment - That's part of the new HarStatus object. The information is stored and I thought of having methods to make them accessible. Would you prefer to remove the methods?
          Hide
          Ramkumar Vadali added a comment -

          @Rodrigo, the latest patch shows new methods in HarFileSystem: getPermission, getOwner, getGroup. Those should be removed, right?

          Show
          Ramkumar Vadali added a comment - @Rodrigo, the latest patch shows new methods in HarFileSystem: getPermission, getOwner, getGroup. Those should be removed, right?
          Rodrigo Schmidt made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Rodrigo Schmidt made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Rodrigo Schmidt made changes -
          Attachment MAPREDUCE-1548.1.patch [ 12453567 ]
          Hide
          Rodrigo Schmidt added a comment -

          New patch with Scott's recommendation.

          Show
          Rodrigo Schmidt added a comment - New patch with Scott's recommendation.
          Hide
          Rodrigo Schmidt added a comment -

          Thanks Scott!

          Indeed I've been doing too much PHP and javascript lately. I'll change that.

          The space-separated metadata is fine because I'm URL-encoding all the string properties. I kept " " as the standard separator because it is already used in other parts of Hadoop Archives.

          Using URL-encoding of strings makes it safe.

          Show
          Rodrigo Schmidt added a comment - Thanks Scott! Indeed I've been doing too much PHP and javascript lately. I'll change that. The space-separated metadata is fine because I'm URL-encoding all the string properties. I kept " " as the standard separator because it is already used in other parts of Hadoop Archives. Using URL-encoding of strings makes it safe.
          Hide
          Scott Chen added a comment -

          @Rodrigo: Can you change this to modificationTime? Doing too much PHP?

          +    long modification_time = 0;
          

          The other thing is that metadata are separated by " " (it was like that).
          Is this safe enough?

          Other than these the patch looks good. Thanks.

          Show
          Scott Chen added a comment - @Rodrigo: Can you change this to modificationTime? Doing too much PHP? + long modification_time = 0; The other thing is that metadata are separated by " " (it was like that). Is this safe enough? Other than these the patch looks good. Thanks.
          Hide
          Rodrigo Schmidt added a comment -

          Yes, if this proposal sounds good to you, I think it's ready for review

          Show
          Rodrigo Schmidt added a comment - Yes, if this proposal sounds good to you, I think it's ready for review
          Hide
          Mahadev konar added a comment -

          sounds good rodrigo!

          Is this jira up for review? Seems like its PA!

          Show
          Mahadev konar added a comment - sounds good rodrigo! Is this jira up for review? Seems like its PA!
          Hide
          Rodrigo Schmidt added a comment -

          Mahadev, Nicholas, Do you agree with this proposal?

          Thanks,
          Rodrigo

          Show
          Rodrigo Schmidt added a comment - Mahadev, Nicholas, Do you agree with this proposal? Thanks, Rodrigo
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12447000/MAPREDUCE-1548.0.patch
          against trunk revision 954364.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/242/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/242/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/242/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/242/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12447000/MAPREDUCE-1548.0.patch against trunk revision 954364. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/242/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/242/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/242/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/242/console This message is automatically generated.
          Rodrigo Schmidt made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Rodrigo Schmidt made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Rodrigo Schmidt made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Affects Version/s 0.22.0 [ 12314184 ]
          Fix Version/s 0.22.0 [ 12314184 ]
          Rodrigo Schmidt made changes -
          Attachment MAPREDUCE-1548.0.patch [ 12447000 ]
          Hide
          Rodrigo Schmidt added a comment -

          It's time to restart this this discussion.

          This is a something that contrib/raid relies on and it has to be fixed if we plan to make raid an integral part of Hadoop.

          The one thing raid needs is hadoop archives to preserve modification times.

          This patch adds all properties (modTime, permissions, owner, group) to the har index file but it only exposes modification times on HarFileSystem. Making HarFileSystem respect the access permissions and properties during would make the patch too complicated and harder to review.

          I propose that we create a separate JIRA that will concentrate on making sure HarFileSystem respects access permissions.

          Show
          Rodrigo Schmidt added a comment - It's time to restart this this discussion. This is a something that contrib/raid relies on and it has to be fixed if we plan to make raid an integral part of Hadoop. The one thing raid needs is hadoop archives to preserve modification times. This patch adds all properties (modTime, permissions, owner, group) to the har index file but it only exposes modification times on HarFileSystem. Making HarFileSystem respect the access permissions and properties during would make the patch too complicated and harder to review. I propose that we create a separate JIRA that will concentrate on making sure HarFileSystem respects access permissions.
          Hide
          Rodrigo Schmidt added a comment -

          I think most file properties should be stored inside the index file and HadoopArchives should have an option to tell whether we want to preserve the original file properties or assign new ones at creation time. On a personal note, I think the default should be to keep the original file properties.

          As for HarFileSystem, it should assign properties in the following way:

          Filename, Length, isDir, ModificationTime, Permission, Owner, Group: from the information stored inside the index file

          Replication, BlockSize, AccessTime: from the part-xxx file storing the file data.

          I think this is as consistent as one can get.

          Show
          Rodrigo Schmidt added a comment - I think most file properties should be stored inside the index file and HadoopArchives should have an option to tell whether we want to preserve the original file properties or assign new ones at creation time. On a personal note, I think the default should be to keep the original file properties. As for HarFileSystem, it should assign properties in the following way: Filename, Length, isDir, ModificationTime, Permission, Owner, Group: from the information stored inside the index file Replication, BlockSize, AccessTime: from the part-xxx file storing the file data. I think this is as consistent as one can get.
          Hide
          Sanjay Radia added a comment -

          >I think liststatus should show the information stored in index file and not the part file permissions.

          I disagree here.
          har -t should should show what is in the index.
          However, the har filesystem should show the properties it is implementing:

          • the replication factor of the har - this is the replication factor of the part file.
          • the file permissions of the har - this can be the permissions of the directory or the index or the part file (we need to pick one).

          The har file system should pass the following unit test:

          • listStatus
          • check that the permissions returned are enforced.
          Show
          Sanjay Radia added a comment - >I think liststatus should show the information stored in index file and not the part file permissions. I disagree here. har -t should should show what is in the index. However, the har filesystem should show the properties it is implementing: the replication factor of the har - this is the replication factor of the part file. the file permissions of the har - this can be the permissions of the directory or the index or the part file (we need to pick one). The har file system should pass the following unit test: listStatus check that the permissions returned are enforced.
          Rodrigo Schmidt made changes -
          Link This issue is blocked by MAPREDUCE-1578 [ MAPREDUCE-1578 ]
          Hide
          Mahadev konar added a comment -

          HADOOP-6591 has been created for this. We can fix that with avro or we can fix that with url encoding in the filenames (both leading to upping the version).

          Show
          Mahadev konar added a comment - HADOOP-6591 has been created for this. We can fix that with avro or we can fix that with url encoding in the filenames (both leading to upping the version).
          Hide
          Rodrigo Schmidt added a comment -

          I was working on this patch and I just realized hadoop archives don't work with files that have spaces in their names. Avro should fix that, or we could try to patch the current code to escape the filename before we store it. What do you guys think? Should we create a separate JIRA or should I just make the fix together with this code?

          Show
          Rodrigo Schmidt added a comment - I was working on this patch and I just realized hadoop archives don't work with files that have spaces in their names. Avro should fix that, or we could try to patch the current code to escape the filename before we store it. What do you guys think? Should we create a separate JIRA or should I just make the fix together with this code?
          Hide
          Mahadev konar added a comment -

          I would vote for access time of index file, because that would involve only one lookup in terms of ls of a har file. We wont really need to access part files and there access times to get filestatus of a file in hadoop archives. We should just make it explicit in the docs what all metadata we return. Also, its consistent with saying that all the har files access, do access the index file, so we just return the access time of the index file.

          Show
          Mahadev konar added a comment - I would vote for access time of index file, because that would involve only one lookup in terms of ls of a har file. We wont really need to access part files and there access times to get filestatus of a file in hadoop archives. We should just make it explicit in the docs what all metadata we return. Also, its consistent with saying that all the har files access, do access the index file, so we just return the access time of the index file.
          Hide
          Rodrigo Schmidt added a comment -

          What about the access time? I wasn't thinking of storing it in the index file. We could take the access time from the part file, but that might not be correct. Imagine two files are stored inside the same part file and we access only one of them. If we query the access time of the untouched file, it'll be wrong (showing it has been accessed when it has not). Should we just keep the original last acess time before har, the har creation time, and provide this information always, or should we do something else?

          Show
          Rodrigo Schmidt added a comment - What about the access time? I wasn't thinking of storing it in the index file. We could take the access time from the part file, but that might not be correct. Imagine two files are stored inside the same part file and we access only one of them. If we query the access time of the untouched file, it'll be wrong (showing it has been accessed when it has not). Should we just keep the original last acess time before har, the har creation time, and provide this information always, or should we do something else?
          Hide
          Mahadev konar added a comment -

          I think it is very useful to store the information, which can possibly be used in untar, but ls should return the meta data from the part files in order to avoid inconsistency.

          I think liststatus should show the information stored in index file and not the part file permissions. Its a tricky question but I think the information stored in the index file should be returned. Also, there are multiple files that need to be read for reading a real archived file (_index, _masterindex, part-***), so we cannot really return part-** file permissions and assume that it will always work. There could be case when _index file might not be readable.

          We will have to throw the appropriate exception about files not readable in case they are not. But, definitely the filestatus information stored in the index file seems the right thing to return.

          Show
          Mahadev konar added a comment - I think it is very useful to store the information, which can possibly be used in untar, but ls should return the meta data from the part files in order to avoid inconsistency. I think liststatus should show the information stored in index file and not the part file permissions. Its a tricky question but I think the information stored in the index file should be returned. Also, there are multiple files that need to be read for reading a real archived file (_index, _masterindex, part-*** ), so we cannot really return part- ** file permissions and assume that it will always work. There could be case when _index file might not be readable. We will have to throw the appropriate exception about files not readable in case they are not. But, definitely the filestatus information stored in the index file seems the right thing to return.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > ... yes I would like to store this information and not apply it to the part files (as you pointed out, this would probably lead to inconsistencies). ...

          The inconsistency may lead to some undesirable behaviors: ls may return 777 on a har:// path but some user actually cannot access it because the permission on the part file is different. Users definitely will be confused.

          > Keeping the properties does not really prevent someone who has access to the har directory and files to acess the part file that contains the data, but it would help a lot if we wanted to unhar the files at some point and keep their original properties. That's exactly what I'm proposing in this JIRA.

          I think it is very useful to store the information, which can possibly be used in untar, but ls should return the meta data from the part files in order to avoid inconsistency.

          Show
          Tsz Wo Nicholas Sze added a comment - > ... yes I would like to store this information and not apply it to the part files (as you pointed out, this would probably lead to inconsistencies). ... The inconsistency may lead to some undesirable behaviors: ls may return 777 on a har:// path but some user actually cannot access it because the permission on the part file is different. Users definitely will be confused. > Keeping the properties does not really prevent someone who has access to the har directory and files to acess the part file that contains the data, but it would help a lot if we wanted to unhar the files at some point and keep their original properties. That's exactly what I'm proposing in this JIRA. I think it is very useful to store the information, which can possibly be used in untar, but ls should return the meta data from the part files in order to avoid inconsistency.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > ... Currently, the code creates FileStatus objects taking part of the information (e.g., size) from inside index file and the rest (replication, owner, times) from the properties of the index file itself, which is not correct. ...
          That's true. I filed HADOOP-6601 earlier.

          Show
          Tsz Wo Nicholas Sze added a comment - > ... Currently, the code creates FileStatus objects taking part of the information (e.g., size) from inside index file and the rest (replication, owner, times) from the properties of the index file itself, which is not correct. ... That's true. I filed HADOOP-6601 earlier.
          Tsz Wo Nicholas Sze made changes -
          Link This issue relates to HADOOP-6601 [ HADOOP-6601 ]
          Hide
          Rodrigo Schmidt added a comment -

          Nicholas, yes I would like to store this information and not apply it to the part files (as you pointed out, this would probably lead to inconsistencies). However, I would like the correct ownership, permission, and times to show up when someone does an ls on a har:// path. Currently, the code creates FileStatus objects taking part of the information (e.g., size) from inside index file and the rest (replication, owner, times) from the properties of the index file itself, which is not correct. Currently, if you do an ls on a har:// path, the contents will show up as having a replication factor of 10 (the default for the index file) although the part file containing the data will probably have replication factor of 3 (hdfs default).

          Keeping the properties does not really prevent someone who has access to the har directory and files to acess the part file that contains the data, but it would help a lot if we wanted to unhar the files at some point and keep their original properties. That's exactly what I'm proposing in this JIRA.

          Shortly, I would like to store the properties in the index file, list them on an ls command, and return them correctly on getFileStatus(), not much more than that. I think this would be a good start for future and more complicated extensions.

          Show
          Rodrigo Schmidt added a comment - Nicholas, yes I would like to store this information and not apply it to the part files (as you pointed out, this would probably lead to inconsistencies). However, I would like the correct ownership, permission, and times to show up when someone does an ls on a har:// path. Currently, the code creates FileStatus objects taking part of the information (e.g., size) from inside index file and the rest (replication, owner, times) from the properties of the index file itself, which is not correct. Currently, if you do an ls on a har:// path, the contents will show up as having a replication factor of 10 (the default for the index file) although the part file containing the data will probably have replication factor of 3 (hdfs default). Keeping the properties does not really prevent someone who has access to the har directory and files to acess the part file that contains the data, but it would help a lot if we wanted to unhar the files at some point and keep their original properties. That's exactly what I'm proposing in this JIRA. Shortly, I would like to store the properties in the index file, list them on an ls command, and return them correctly on getFileStatus(), not much more than that. I think this would be a good start for future and more complicated extensions.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Rodrigo, it seems that you are suggesting to store the information but not actually apply them to the part files. Then, what will be returned for ls on har:// paths?

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Rodrigo, it seems that you are suggesting to store the information but not actually apply them to the part files. Then, what will be returned for ls on har:// paths?
          Hide
          Allen Wittenauer added a comment -

          FWIW, i actually had to debug tar once eons ago. Basically, it tries to set the user/group/acl/etc information on every write. It just silently ignores those types of errors.

          Show
          Allen Wittenauer added a comment - FWIW, i actually had to debug tar once eons ago. Basically, it tries to set the user/group/acl/etc information on every write. It just silently ignores those types of errors.
          Hide
          Rodrigo Schmidt added a comment -

          Nicholas, I think the replication factor is not preserved. It takes the default replication factor for the new part... files.

          Show
          Rodrigo Schmidt added a comment - Nicholas, I think the replication factor is not preserved. It takes the default replication factor for the new part... files.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > ... Its similar to how you would create a tar!
          It sounds good to follow tar.

          A similar question: what is the replication number of the part file, given file1 and file2 have replication 3 and 10, respectively?

          Show
          Tsz Wo Nicholas Sze added a comment - > ... Its similar to how you would create a tar! It sounds good to follow tar. A similar question: what is the replication number of the part file, given file1 and file2 have replication 3 and 10, respectively?
          Hide
          Allen Wittenauer added a comment -

          Actually no.

          If tar is run as a super user, it can restore user and group information.

          Show
          Allen Wittenauer added a comment - Actually no. If tar is run as a super user, it can restore user and group information.
          Hide
          Mahadev konar added a comment -

          shouldnt we be using the default umask for the part files? and the owner would be the one who runs the archive command. no? Its similar to how you would create a tar!

          Show
          Mahadev konar added a comment - shouldnt we be using the default umask for the part files? and the owner would be the one who runs the archive command. no? Its similar to how you would create a tar!
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > you are right nicholas,that the file permissions that would be stored in the index file can be surpassed with using the underlying file system ...

          Mahadev, you have not answered my question "how would you set the permission on the part file?" Suppose we have two files:

          file1 owner1 700
          file2 owner2 070

          Then, what is the permission of the part file? 770? And who is the owner of the part file?

          Show
          Tsz Wo Nicholas Sze added a comment - > you are right nicholas,that the file permissions that would be stored in the index file can be surpassed with using the underlying file system ... Mahadev, you have not answered my question "how would you set the permission on the part file?" Suppose we have two files: file1 owner1 700 file2 owner2 070 Then, what is the permission of the part file? 770? And who is the owner of the part file?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > ... but foo cannot set ownver of file2 to bar unless foo is not a superuser.

          Typo correction: ... but foo cannot set owner of file2 to bar unless foo is a superuser.

          Show
          Tsz Wo Nicholas Sze added a comment - > ... but foo cannot set ownver of file2 to bar unless foo is not a superuser. Typo correction: ... but foo cannot set owner of file2 to bar unless foo is a superuser.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          There is one more problem on permission: Suppose the owners of file1 and file2 are foo and bar, respectively. foo creates an archive and the owner information are stored in the index file. Now, foo un-archive it but foo cannot set ownver of file2 to bar unless foo is not a superuser.

          Show
          Tsz Wo Nicholas Sze added a comment - There is one more problem on permission: Suppose the owners of file1 and file2 are foo and bar, respectively. foo creates an archive and the owner information are stored in the index file. Now, foo un-archive it but foo cannot set ownver of file2 to bar unless foo is not a superuser.
          Hide
          Rodrigo Schmidt added a comment -

          I'm working on a patch for hadoop 0.20, so I sort of prefer to postpone the use of Avro to a future JIRA. I also agree with Mahadev that we should make it backwards compatible.

          Show
          Rodrigo Schmidt added a comment - I'm working on a patch for hadoop 0.20, so I sort of prefer to postpone the use of Avro to a future JIRA. I also agree with Mahadev that we should make it backwards compatible.
          Hide
          Mahadev konar added a comment -

          you are right nicholas, that the file permissions that would be stored in the index file can be surpassed with using the underlying file system but I think thats fine. I think its the same case with tarball, since you can read through a tarball file directly (surpassing permissions of files stored in tar) and read through data which might not be readable as per the permissions stored in the tarball.

          wonder if it's time to convert the index file to Avro? That should make it easier to add fields.

          I think its a good idea to move to a binary format for sure. Though we need to maintain backwards compatbility with the new version being able to read older versions.

          Show
          Mahadev konar added a comment - you are right nicholas, that the file permissions that would be stored in the index file can be surpassed with using the underlying file system but I think thats fine. I think its the same case with tarball, since you can read through a tarball file directly (surpassing permissions of files stored in tar) and read through data which might not be readable as per the permissions stored in the tarball. wonder if it's time to convert the index file to Avro? That should make it easier to add fields. I think its a good idea to move to a binary format for sure. Though we need to maintain backwards compatbility with the new version being able to read older versions.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Does anybody see a problem in doing that?

          Multiple small files may be stored in single archive part file. If the original files have different permissions, the how would you set the permission on the part file?

          Show
          Tsz Wo Nicholas Sze added a comment - > Does anybody see a problem in doing that? Multiple small files may be stored in single archive part file. If the original files have different permissions, the how would you set the permission on the part file?
          Hide
          Doug Cutting added a comment -

          > My idea is to extend the index file with this information.

          Sounds reasonable to me.

          I wonder if it's time to convert the index file to Avro? That should make it easier to add fields.

          Show
          Doug Cutting added a comment - > My idea is to extend the index file with this information. Sounds reasonable to me. I wonder if it's time to convert the index file to Avro? That should make it easier to add fields.
          Hide
          Rodrigo Schmidt added a comment -

          My idea is to extend the index file with this information. Does anybody see a problem in doing that?

          Show
          Rodrigo Schmidt added a comment - My idea is to extend the index file with this information. Does anybody see a problem in doing that?
          Rodrigo Schmidt made changes -
          Field Original Value New Value
          Assignee Rodrigo Schmidt [ rschmidt ]
          Rodrigo Schmidt created issue -

            People

            • Assignee:
              Rodrigo Schmidt
              Reporter:
              Rodrigo Schmidt
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development