Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-739

Allow relative paths to be created inside archives.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: harchive
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Allow creating archives with relative paths with a -p option on the command line.

      Description

      Archives currently stores the full path from the input sources – since it allows multiple sources and regular expressions as inputs. So the created archives have the full path of the input sources. This is un intuitive and a user hassle. We should get rid of it and allow users to say that the created archive should be relative to some absolute path and throw an excpetion if the input does not confirm to the relative absolute path.

      1. MAPREDUCE-739-0.20.patch
        26 kB
        Mahadev konar
      2. MAPREDUCE-739.yhadoop.patch
        31 kB
        Mahadev konar
      3. MAPREDUCE-739.patch
        20 kB
        Mahadev konar
      4. MAPREDUCE-739.patch
        32 kB
        Mahadev konar
      5. MAPREDUCE-739.patch
        33 kB
        Mahadev konar
      6. HADOOP-3663.patch
        6 kB
        Mahadev konar
      7. HADOOP-3663.patch
        15 kB
        Mahadev konar
      8. HADOOP-3663.patch
        15 kB
        Mahadev konar

        Issue Links

          Activity

          Hide
          Mahadev konar added a comment -

          patch for yhadoop branch...

          Show
          Mahadev konar added a comment - patch for yhadoop branch...
          Hide
          Mahadev konar added a comment -

          attaching a patch that applies to the 0.20 branch. In case folks are interested in using it.

          Show
          Mahadev konar added a comment - attaching a patch that applies to the 0.20 branch. In case folks are interested in using it.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #29 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/29/)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #29 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/29/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #25 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/25/)
          . Allow relative paths to be created in archives. Contributed by Mahadev Konar

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #25 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/25/ ) . Allow relative paths to be created in archives. Contributed by Mahadev Konar
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk #29 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/29/)
          HADOOP-6142. Update documentation and use of harchives for relative paths added
          in . Contributed by Mahadev Konar

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk #29 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/29/ ) HADOOP-6142 . Update documentation and use of harchives for relative paths added in . Contributed by Mahadev Konar
          Hide
          Chris Douglas added a comment -

          I committed this. Thanks, Mahadev!

          Show
          Chris Douglas added a comment - I committed this. Thanks, Mahadev!
          Hide
          Chris Douglas added a comment -
               [exec] +1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 34 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec] 
               [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
          

          The new patch makes no functional changes, so the test result for the last run (unrelated streaming test failures) is sufficient

          Show
          Chris Douglas added a comment - [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 34 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. The new patch makes no functional changes, so the test result for the last run (unrelated streaming test failures) is sufficient
          Hide
          Mahadev konar added a comment -

          trying hudson again.

          Show
          Mahadev konar added a comment - trying hudson again.
          Hide
          Mahadev konar added a comment -

          attcched missing license to an old xml file - src/test/mapred/org/apache/hadoop/cli/testMRConf.xml

          Show
          Mahadev konar added a comment - attcched missing license to an old xml file - src/test/mapred/org/apache/hadoop/cli/testMRConf.xml
          Hide
          Todd Lipcon added a comment -

          Woops, I just mixed up two browser tabs and commented on the wrong JIRA. The "Delete" button seems to be gone. Please disregard the above.

          Show
          Todd Lipcon added a comment - Woops, I just mixed up two browser tabs and commented on the wrong JIRA. The "Delete" button seems to be gone. Please disregard the above.
          Hide
          Todd Lipcon added a comment -

          -1 release audit. The applied patch generated 317 release audit warnings (more than the trunk's current 315 warnings).

          The release audit diff is:

          121d120
          < [java] !????? /home/hudson/hudson-slave/workspace/Mapreduce-Patch-vesta.apache.org/trunk/build/hadoop-mapred-794101_MAPREDUCE-739_PATCH-12413485/docs/jdiff/changes/org.apache.hadoop.tools.HadoopArchives.html
          129d127
          < [java] !????? /home/hudson/hudson-slave/workspace/Mapreduce-Patch-vesta.apache.org/trunk/build/hadoop-mapred-794101_MAPREDUCE-739_PATCH-12413485/docs/jdiff/changes/pkg_org.apache.hadoop.tools.html

          This patch doesn't touch those packages, so I think we can ignore this:

           .../org/apache/hadoop/fs/ChecksumFileSystem.java   |    5 +-
           src/java/org/apache/hadoop/util/DataChecksum.java  |   10 +-
           src/java/org/apache/hadoop/util/PureJavaCrc32.java |  351 ++++++++++++++++++++
           .../org/apache/hadoop/util/TestPureJavaCrc32.java  |  171 ++++++++++
          

          -1 contrib tests. The patch failed contrib unit tests.

          Only failed Streaming tests which are failing on trunk.

          Show
          Todd Lipcon added a comment - -1 release audit. The applied patch generated 317 release audit warnings (more than the trunk's current 315 warnings). The release audit diff is: 121d120 < [java] !????? /home/hudson/hudson-slave/workspace/Mapreduce-Patch-vesta.apache.org/trunk/build/hadoop-mapred-794101_ MAPREDUCE-739 _PATCH-12413485/docs/jdiff/changes/org.apache.hadoop.tools.HadoopArchives.html 129d127 < [java] !????? /home/hudson/hudson-slave/workspace/Mapreduce-Patch-vesta.apache.org/trunk/build/hadoop-mapred-794101_ MAPREDUCE-739 _PATCH-12413485/docs/jdiff/changes/pkg_org.apache.hadoop.tools.html This patch doesn't touch those packages, so I think we can ignore this: .../org/apache/hadoop/fs/ChecksumFileSystem.java | 5 +- src/java/org/apache/hadoop/util/DataChecksum.java | 10 +- src/java/org/apache/hadoop/util/PureJavaCrc32.java | 351 ++++++++++++++++++++ .../org/apache/hadoop/util/TestPureJavaCrc32.java | 171 ++++++++++ -1 contrib tests. The patch failed contrib unit tests. Only failed Streaming tests which are failing on trunk.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12413485/MAPREDUCE-739.patch
          against trunk revision 794101.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 34 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          -1 release audit. The applied patch generated 317 release audit warnings (more than the trunk's current 315 warnings).

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/393/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/393/artifact/trunk/current/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/393/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/393/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/393/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12413485/MAPREDUCE-739.patch against trunk revision 794101. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 34 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 317 release audit warnings (more than the trunk's current 315 warnings). +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/393/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/393/artifact/trunk/current/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/393/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/393/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/393/console This message is automatically generated.
          Hide
          Mahadev konar added a comment -

          updated the patch to

          • fix the test failure in TestCLI for mapreduce
          • made minor changes to error if sources are not relative to parent path
          • made an unnecessary public method package private

          not sure why the hudson build failed the other tests. I will try hudson again for this patch.

          Show
          Mahadev konar added a comment - updated the patch to fix the test failure in TestCLI for mapreduce made minor changes to error if sources are not relative to parent path made an unnecessary public method package private not sure why the hudson build failed the other tests. I will try hudson again for this patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12413168/MAPREDUCE-739.patch
          against trunk revision 793457.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          -1 release audit. The applied patch generated 317 release audit warnings (more than the trunk's current 315 warnings).

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/382/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/382/artifact/trunk/current/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/382/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/382/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/382/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12413168/MAPREDUCE-739.patch against trunk revision 793457. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 317 release audit warnings (more than the trunk's current 315 warnings). -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/382/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/382/artifact/trunk/current/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/382/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/382/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/382/console This message is automatically generated.
          Hide
          Mahadev konar added a comment -

          this patch adds an argument

          archiveName <> -p paretnPath <src>* <dest>
          

          here all the files will be archived relative to parentPath. If the user has only one directory and wants the archived files relative to that then

          archiveName <> -p paretnPath <dest> can be used.
          
          • the new argument is mandatory and is required. This is a change from old versions but given that this will make usability much easier it will makes things easy to understand.
          • the old archives is still readalbe from this new code (the old archives that were created pre 0.21 will be readable still)
          • also incorpiorated hairong's comments to make the code much easier to read.
          • update forrest docs with better examples in common jira HADOOP-6142.
          Show
          Mahadev konar added a comment - this patch adds an argument archiveName <> -p paretnPath <src>* <dest> here all the files will be archived relative to parentPath. If the user has only one directory and wants the archived files relative to that then archiveName <> -p paretnPath <dest> can be used. the new argument is mandatory and is required. This is a change from old versions but given that this will make usability much easier it will makes things easy to understand. the old archives is still readalbe from this new code (the old archives that were created pre 0.21 will be readable still) also incorpiorated hairong's comments to make the code much easier to read. update forrest docs with better examples in common jira HADOOP-6142 .
          Hide
          Mahadev konar added a comment -

          cancelling the patch for 0.19... ill fix the patch with review comments for a lter release.

          Show
          Mahadev konar added a comment - cancelling the patch for 0.19... ill fix the patch with review comments for a lter release.
          Hide
          Hairong Kuang added a comment -

          My point is that those paths not belonging to the archive should not be there in the first place. Then there is no need to filter them later. Although there are only a few lines, they made me to wonder why they are there and had to talk to you to understand them. I do not think the fix needs a lot of time , why complaining about bandwidth?

          Show
          Hairong Kuang added a comment - My point is that those paths not belonging to the archive should not be there in the first place. Then there is no need to filter them later. Although there are only a few lines, they made me to wonder why they are there and had to talk to you to understand them. I do not think the fix needs a lot of time , why complaining about bandwidth?
          Hide
          Mahadev konar added a comment -

          hmm... the code in the mapper just filters any paths that should not be part of the archive. It just calls a function and there are 3 line changes for that. Neway if its cofusing for someone there must be somethign wrong with the code . I dont have the bandwidth to fix this (as mentioned above) by the code freeze date. Ill leave it as it is. If people think that this needs to go into 0.19 I think this patch should be good...

          Show
          Mahadev konar added a comment - hmm... the code in the mapper just filters any paths that should not be part of the archive. It just calls a function and there are 3 line changes for that. Neway if its cofusing for someone there must be somethign wrong with the code . I dont have the bandwidth to fix this (as mentioned above) by the code freeze date. Ill leave it as it is. If people think that this needs to go into 0.19 I think this patch should be good...
          Hide
          Hairong Kuang added a comment -

          For (2), although the code complexity might be the same, I think at least not to write the parent directory structure to srcWriter. The code in the mapper that filters those directories is really confusing.

          Show
          Hairong Kuang added a comment - For (2), although the code complexity might be the same, I think at least not to write the parent directory structure to srcWriter. The code in the mapper that filters those directories is really confusing.
          Hide
          Mahadev konar added a comment -

          in response to hairong's comments -
          1) fixed
          2) i think the code complexity would be the same in both the cases.
          3) you right that using strnigs migt be more efficient but then the code would not be robust.

          Show
          Mahadev konar added a comment - in response to hairong's comments - 1) fixed 2) i think the code complexity would be the same in both the cases. 3) you right that using strnigs migt be more efficient but then the code would not be robust.
          Hide
          Hairong Kuang added a comment -

          The code looks good. I have a few comments:
          1. In usage, relative src should be src because absolution paths are still supported.
          2. Will the code be cleaner if the archive method takes relative paths as the second parameter? The structure of the parent directory does not need to write to srcWriter. Only relative paths need to be passed to mappers too.
          3. Will it be more efficient if the relPathToRoot works on Strings?

          Show
          Hairong Kuang added a comment - The code looks good. I have a few comments: 1. In usage, relative src should be src because absolution paths are still supported. 2. Will the code be cleaner if the archive method takes relative paths as the second parameter? The structure of the parent directory does not need to write to srcWriter. Only relative paths need to be passed to mappers too. 3. Will it be more efficient if the relPathToRoot works on Strings?
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12389800/HADOOP-3663.patch
          against trunk revision 693705.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3228/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3228/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3228/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3228/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12389800/HADOOP-3663.patch against trunk revision 693705. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3228/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3228/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3228/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3228/console This message is automatically generated.
          Hide
          Mahadev konar added a comment -

          This patch adds tests, documentation chnages and a minor bugfix tp the previous one.

          Show
          Mahadev konar added a comment - This patch adds tests, documentation chnages and a minor bugfix tp the previous one.
          Hide
          Mahadev konar added a comment -

          this is preliminary patch which implements the above. Ill update the patch to include changes in docs and some tests.

          Show
          Mahadev konar added a comment - this is preliminary patch which implements the above. Ill update the patch to include changes in docs and some tests.
          Hide
          Mahadev konar added a comment -

          so here is a proposal after talking to owen and sanjay...

          the archive command would look like

          archive -p /foo/bar <relatives paths to /foo/bar>

          example
          archive -p /foo/bar a b c

          where /foo/bar/a is a directory, /foo/bar/b, /foo/bar/c is a directory.

          if no -p is specified then the archives will work as it is now... the paths inside the archive will be absolute.

          this will allow user to access archives that are created using relative paths.

          as of 0.18
          if you create an archive with

          archive /foo/bar/a /foo/bar/b /foo/bar/c

          then to access these directories in the archives you will have to use

          har://somepath.har/foo/bar/a
          har://somepath.har/foo/bar/b
          har://somepath.har/foo/bar/c

          with the above proposal the user would create an archive using
          archive -p /foo/bar a b c

          and to access the directories would use

          har://somepath.har/a
          har://somepath.har/b
          har://somepath.har/c

          comments are welcome....

          Show
          Mahadev konar added a comment - so here is a proposal after talking to owen and sanjay... the archive command would look like archive -p /foo/bar <relatives paths to /foo/bar> example archive -p /foo/bar a b c where /foo/bar/a is a directory, /foo/bar/b, /foo/bar/c is a directory. if no -p is specified then the archives will work as it is now... the paths inside the archive will be absolute. this will allow user to access archives that are created using relative paths. as of 0.18 if you create an archive with archive /foo/bar/a /foo/bar/b /foo/bar/c then to access these directories in the archives you will have to use har://somepath.har/foo/bar/a har://somepath.har/foo/bar/b har://somepath.har/foo/bar/c with the above proposal the user would create an archive using archive -p /foo/bar a b c and to access the directories would use har://somepath.har/a har://somepath.har/b har://somepath.har/c comments are welcome....

            People

            • Assignee:
              Mahadev konar
              Reporter:
              Mahadev konar
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development