Hadoop Common
  1. Hadoop Common
  2. HADOOP-2019

DistributedFileCache should support .tgz files in addition to jars and zip files

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.14.3
    • Fix Version/s: 0.18.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Added support for .tar, .tgz and .tar.gz files in DistributedCache. File sizes are limited to 2GB.

      Description

      Currently the distributed file cache only works with zip and jar archives, which don't work for larger than 2g. We should support .tgz archives also.

      1. patch-2019.txt
        13 kB
        Amareshwari Sriramadasu
      2. patch-2019.txt
        8 kB
        Amareshwari Sriramadasu
      3. patch-2019.txt
        6 kB
        Amareshwari Sriramadasu
      4. test.tar.gz
        0.2 kB
        Amareshwari Sriramadasu
      5. test.tgz
        0.2 kB
        Amareshwari Sriramadasu
      6. test.tar
        10 kB
        Amareshwari Sriramadasu
      7. patch-2019.txt
        7 kB
        Amareshwari Sriramadasu

        Issue Links

          Activity

          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #484 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/484/ )
          Hide
          Devaraj Das added a comment -

          I just committed this. Thanks, Amareshwari!

          Show
          Devaraj Das added a comment - I just committed this. Thanks, Amareshwari!
          Hide
          Amareshwari Sriramadasu added a comment -

          Looks like hudson is not able to run findbugs, may be because of dependency of the tar files in build.xml .
          But I ran findbugs on my machine, and there are no new findbug warnings introduced.

          Show
          Amareshwari Sriramadasu added a comment - Looks like hudson is not able to run findbugs, may be because of dependency of the tar files in build.xml . But I ran findbugs on my machine, and there are no new findbug warnings introduced.
          Hide
          Mahadev konar added a comment -

          patch looks good.. .can you take care of the findbugs warnings?

          Show
          Mahadev konar added a comment - patch looks good.. .can you take care of the findbugs warnings?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12381483/patch-2019.txt
          against trunk revision 653638.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 20 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to cause Findbugs to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/testReport/
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12381483/patch-2019.txt against trunk revision 653638. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 20 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/console This message is automatically generated.
          Hide
          Amareshwari Sriramadasu added a comment -

          Added documentation

          Show
          Amareshwari Sriramadasu added a comment - Added documentation
          Hide
          Amareshwari Sriramadasu added a comment -

          we could use existing "hadoop-14-dfs-dir.tgz" since the contents of the tar file don't seem to matter (I'm not so sure).

          No, we cannot use hadoop-14-dfs-dir.tgz for the test TestMiniMRDFSCaching, because we read the contents of tar file also for assertions.

          Show
          Amareshwari Sriramadasu added a comment - we could use existing "hadoop-14-dfs-dir.tgz" since the contents of the tar file don't seem to matter (I'm not so sure). No, we cannot use hadoop-14-dfs-dir.tgz for the test TestMiniMRDFSCaching, because we read the contents of tar file also for assertions.
          Hide
          Raghu Angadi added a comment -

          +1 for the changes. Regd hudson, you could either include binary files in the patch (mostly might work) or just for hudson, we could use existing "hadoop-14-dfs-dir.tgz" since the contents of the tar file don't seem to matter (I'm not so sure).

          Show
          Raghu Angadi added a comment - +1 for the changes. Regd hudson, you could either include binary files in the patch (mostly might work) or just for hudson, we could use existing "hadoop-14-dfs-dir.tgz" since the contents of the tar file don't seem to matter (I'm not so sure).
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12381418/patch-2019.txt
          against trunk revision 653264.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 20 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to cause Findbugs to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/testReport/
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12381418/patch-2019.txt against trunk revision 653264. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 20 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/console This message is automatically generated.
          Hide
          Amareshwari Sriramadasu added a comment -

          Here is a patch doing the untar as suggested in HADOOP-1717.
          I moved the code for untarring to FileUtil.untar() and calling it in TestDFSUpgradeFromImage and also in DistributedCache. Also used the ShellCommandExecutor to run the command.

          Show
          Amareshwari Sriramadasu added a comment - Here is a patch doing the untar as suggested in HADOOP-1717 . I moved the code for untarring to FileUtil.untar() and calling it in TestDFSUpgradeFromImage and also in DistributedCache. Also used the ShellCommandExecutor to run the command.
          Hide
          Amareshwari Sriramadasu added a comment -

          Cancelling patch to address Raghu's comments.

          Show
          Amareshwari Sriramadasu added a comment - Cancelling patch to address Raghu's comments.
          Hide
          Raghu Angadi added a comment -

          Essentially, we can port (or merge) untar part of TestDFSUpgradeFromImage.unpackStorage() to FileUtils.unTar() and use it in both places.

          Show
          Raghu Angadi added a comment - Essentially, we can port (or merge) untar part of TestDFSUpgradeFromImage.unpackStorage() to FileUtils.unTar() and use it in both places.
          Hide
          Raghu Angadi added a comment -

          Also, it is better to use ShellCommandExecutor to run the command since it takes care of various errors.

          Show
          Raghu Angadi added a comment - Also, it is better to use ShellCommandExecutor to run the command since it takes care of various errors.
          Hide
          Raghu Angadi added a comment -

          This will mostly fail on Solaris since 'tar' does not support '-z' option. See HADOOP-1717 for possible work around.

          Show
          Raghu Angadi added a comment - This will mostly fail on Solaris since 'tar' does not support '-z' option. See HADOOP-1717 for possible work around.
          Hide
          Mahadev konar added a comment -

          +1 patch looks good.

          Show
          Mahadev konar added a comment - +1 patch looks good.
          Hide
          steve_l added a comment -

          1. if you set your <javac> task up with includeantruntime=false you dont get any version conflict, but then you have to make 100% sure your classpath contains every JAR you need to build

          2. it would be good to give lib/ant.jar a name like lib/ant-1.7.jar, so people can see at a glance what version to use.

          Show
          steve_l added a comment - 1. if you set your <javac> task up with includeantruntime=false you dont get any version conflict, but then you have to make 100% sure your classpath contains every JAR you need to build 2. it would be good to give lib/ant.jar a name like lib/ant-1.7.jar, so people can see at a glance what version to use.
          Hide
          Amareshwari Sriramadasu added a comment -

          I ran findbugs on my machine on the trunk and also with the patch, there no new findbug warnings introduced.

          Show
          Amareshwari Sriramadasu added a comment - I ran findbugs on my machine on the trunk and also with the patch, there no new findbug warnings introduced.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12381171/patch-2019.txt
          against trunk revision 645773.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 17 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs -1. The patch appears to cause Findbugs to fail.

          core tests -1. The patch failed core unit tests.

          contrib tests -1. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/testReport/
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12381171/patch-2019.txt against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 17 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to cause Findbugs to fail. core tests -1. The patch failed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/console This message is automatically generated.
          Hide
          Amareshwari Sriramadasu added a comment -

          All the tests passed on my machine.

          Show
          Amareshwari Sriramadasu added a comment - All the tests passed on my machine.
          Hide
          Amareshwari Sriramadasu added a comment -

          Looks like ant.jar is removed from lib because it causes problems if it disagrees with the version of ant that people are using (HADOOP-1726).
          Here is a patch doing untaring of files using the tar executable.
          test.tar, test.tar.gz and test.tgz files should be put in src/test/org/apache/hadoop/mapred/ .

          Show
          Amareshwari Sriramadasu added a comment - Looks like ant.jar is removed from lib because it causes problems if it disagrees with the version of ant that people are using ( HADOOP-1726 ). Here is a patch doing untaring of files using the tar executable. test.tar, test.tar.gz and test.tgz files should be put in src/test/org/apache/hadoop/mapred/ .
          Hide
          Mahadev konar added a comment -

          +1 on owens comment. of using tar executable

          Show
          Mahadev konar added a comment - +1 on owens comment. of using tar executable
          Hide
          Owen O'Malley added a comment -

          Why don't we just run the real tar executable. I think that pulling in the ant dependence is much more problematic.

          Show
          Owen O'Malley added a comment - Why don't we just run the real tar executable. I think that pulling in the ant dependence is much more problematic.
          Hide
          Owen O'Malley added a comment -

          I'm really concerned about including the ant.jar in hadoop. We've had a lot of problems in the past with conflicting versions of ant.jar.

          Show
          Owen O'Malley added a comment - I'm really concerned about including the ant.jar in hadoop. We've had a lot of problems in the past with conflicting versions of ant.jar.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12381075/test.tar.gz
          against trunk revision 645773.

          @author +1. The patch does not contain any @author tags.

          tests included -1. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          patch -1. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2340/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12381075/test.tar.gz against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included -1. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. patch -1. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2340/console This message is automatically generated.
          Hide
          Amareshwari Sriramadasu added a comment -

          This patch wouldnot run through hudson tests, since this requires ant.jar in lib/ and test.tar, test.tgz and test.tar.gz fies in src/test/org/apache/hadoop/mapred/ . I'm attaching the files separetely, since jar and tar files can not be part of the patch.

          Show
          Amareshwari Sriramadasu added a comment - This patch wouldnot run through hudson tests, since this requires ant.jar in lib/ and test.tar, test.tgz and test.tar.gz fies in src/test/org/apache/hadoop/mapred/ . I'm attaching the files separetely, since jar and tar files can not be part of the patch.
          Hide
          Amareshwari Sriramadasu added a comment -

          Here is a patch supporting .tar, .tgz and .tar.gz files in DistributedCache. I pulled out ant-1.7.0 and used org.apache.tools.tar.TarInputStream and org.apache.tools.tar.TarEntry for untarring. Also updated testcase TestMiniMRWithDFSCaching to add .tar, .tgz and .tar.gz to cache archive and run job.

          Show
          Amareshwari Sriramadasu added a comment - Here is a patch supporting .tar, .tgz and .tar.gz files in DistributedCache. I pulled out ant-1.7.0 and used org.apache.tools.tar.TarInputStream and org.apache.tools.tar.TarEntry for untarring. Also updated testcase TestMiniMRWithDFSCaching to add .tar, .tgz and .tar.gz to cache archive and run job.
          Hide
          steve_l added a comment -

          I'd recommend you just pull in ant-1.7 or (soon) the ant1.7.1 jar and use them directly. That is where the classes originate. They are designed to work outside Ant builds. Creating and releasing snapshots is bad because
          -ASF doesnt like projects releasing code using other project's snapshots (its related to signoff). Certainly were I on your PMC, I'd be vetoing any 1.0 release that was stil using the commons-cli snapshot.
          -you can't build maven/ivy dependency metadata XML files that dont refer to the unstable snapshot repository

          • which makes it impossible for downstream users to reliably recreate your execution environment.
          Show
          steve_l added a comment - I'd recommend you just pull in ant-1.7 or (soon) the ant1.7.1 jar and use them directly. That is where the classes originate. They are designed to work outside Ant builds. Creating and releasing snapshots is bad because -ASF doesnt like projects releasing code using other project's snapshots (its related to signoff). Certainly were I on your PMC, I'd be vetoing any 1.0 release that was stil using the commons-cli snapshot. -you can't build maven/ivy dependency metadata XML files that dont refer to the unstable snapshot repository which makes it impossible for downstream users to reliably recreate your execution environment.
          Hide
          Andrzej Bialecki added a comment -

          The Tar version available from commons sandbox has a subtle bug when creating tar files (it appends only one null block at the end of archive, instead of two empty blocks as expected by GNU tar). Ant <tar> task contains a fixed copy of the same class.

          Show
          Andrzej Bialecki added a comment - The Tar version available from commons sandbox has a subtle bug when creating tar files (it appends only one null block at the end of archive, instead of two empty blocks as expected by GNU tar). Ant <tar> task contains a fixed copy of the same class.
          Hide
          Mahadev konar added a comment -

          i would suggest execing a process for tar -zxf for untarring files.
          It might be problematic on cygwin where tar might not be installed by default but is
          an easier solution adn works for most systems.

          Show
          Mahadev konar added a comment - i would suggest execing a process for tar -zxf for untarring files. It might be problematic on cygwin where tar might not be installed by default but is an easier solution adn works for most systems.
          Hide
          Amareshwari Sriramadasu added a comment -

          Apache commons compress (http://commons.apache.org/sandbox/compress/apidocs/) has TarArchive, TarInputStream etc. classes. But the project is a sandbox component and no releases are available. So shall we create a snapshot and use it ?

          Show
          Amareshwari Sriramadasu added a comment - Apache commons compress ( http://commons.apache.org/sandbox/compress/apidocs/ ) has TarArchive, TarInputStream etc. classes. But the project is a sandbox component and no releases are available. So shall we create a snapshot and use it ?
          Hide
          Yiping Han added a comment -

          Is the .tgz file support avaiable now?

          Show
          Yiping Han added a comment - Is the .tgz file support avaiable now?
          Hide
          Milind Bhandarkar added a comment -

          +1

          Show
          Milind Bhandarkar added a comment - +1

            People

            • Assignee:
              Amareshwari Sriramadasu
              Reporter:
              Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development