Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.14.3
    • 0.18.0
    • None
    • None
    • Reviewed
    • Added support for .tar, .tgz and .tar.gz files in DistributedCache. File sizes are limited to 2GB.

    Description

      Currently the distributed file cache only works with zip and jar archives, which don't work for larger than 2g. We should support .tgz archives also.

      Attachments

        1. patch-2019.txt
          7 kB
          Amareshwari Sriramadasu
        2. test.tar
          10 kB
          Amareshwari Sriramadasu
        3. test.tgz
          0.2 kB
          Amareshwari Sriramadasu
        4. test.tar.gz
          0.2 kB
          Amareshwari Sriramadasu
        5. patch-2019.txt
          6 kB
          Amareshwari Sriramadasu
        6. patch-2019.txt
          8 kB
          Amareshwari Sriramadasu
        7. patch-2019.txt
          13 kB
          Amareshwari Sriramadasu

        Issue Links

        Activity

          hudson Hudson added a comment -
          hudson Hudson added a comment - Integrated in Hadoop-trunk #484 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/484/ )
          ddas Devaraj Das added a comment -

          I just committed this. Thanks, Amareshwari!

          ddas Devaraj Das added a comment - I just committed this. Thanks, Amareshwari!

          Looks like hudson is not able to run findbugs, may be because of dependency of the tar files in build.xml .
          But I ran findbugs on my machine, and there are no new findbug warnings introduced.

          amareshwari Amareshwari Sriramadasu added a comment - Looks like hudson is not able to run findbugs, may be because of dependency of the tar files in build.xml . But I ran findbugs on my machine, and there are no new findbug warnings introduced.
          mahadev Mahadev Konar added a comment -

          patch looks good.. .can you take care of the findbugs warnings?

          mahadev Mahadev Konar added a comment - patch looks good.. .can you take care of the findbugs warnings?
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12381483/patch-2019.txt
          against trunk revision 653638.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 20 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to cause Findbugs to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/testReport/
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/console

          This message is automatically generated.

          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12381483/patch-2019.txt against trunk revision 653638. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 20 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/console This message is automatically generated.

          Added documentation

          amareshwari Amareshwari Sriramadasu added a comment - Added documentation

          we could use existing "hadoop-14-dfs-dir.tgz" since the contents of the tar file don't seem to matter (I'm not so sure).

          No, we cannot use hadoop-14-dfs-dir.tgz for the test TestMiniMRDFSCaching, because we read the contents of tar file also for assertions.

          amareshwari Amareshwari Sriramadasu added a comment - we could use existing "hadoop-14-dfs-dir.tgz" since the contents of the tar file don't seem to matter (I'm not so sure). No, we cannot use hadoop-14-dfs-dir.tgz for the test TestMiniMRDFSCaching, because we read the contents of tar file also for assertions.
          rangadi Raghu Angadi added a comment -

          +1 for the changes. Regd hudson, you could either include binary files in the patch (mostly might work) or just for hudson, we could use existing "hadoop-14-dfs-dir.tgz" since the contents of the tar file don't seem to matter (I'm not so sure).

          rangadi Raghu Angadi added a comment - +1 for the changes. Regd hudson, you could either include binary files in the patch (mostly might work) or just for hudson, we could use existing "hadoop-14-dfs-dir.tgz" since the contents of the tar file don't seem to matter (I'm not so sure).
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12381418/patch-2019.txt
          against trunk revision 653264.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 20 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to cause Findbugs to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/testReport/
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/console

          This message is automatically generated.

          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12381418/patch-2019.txt against trunk revision 653264. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 20 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/console This message is automatically generated.

          Here is a patch doing the untar as suggested in HADOOP-1717.
          I moved the code for untarring to FileUtil.untar() and calling it in TestDFSUpgradeFromImage and also in DistributedCache. Also used the ShellCommandExecutor to run the command.

          amareshwari Amareshwari Sriramadasu added a comment - Here is a patch doing the untar as suggested in HADOOP-1717 . I moved the code for untarring to FileUtil.untar() and calling it in TestDFSUpgradeFromImage and also in DistributedCache. Also used the ShellCommandExecutor to run the command.

          Cancelling patch to address Raghu's comments.

          amareshwari Amareshwari Sriramadasu added a comment - Cancelling patch to address Raghu's comments.
          rangadi Raghu Angadi added a comment -

          Essentially, we can port (or merge) untar part of TestDFSUpgradeFromImage.unpackStorage() to FileUtils.unTar() and use it in both places.

          rangadi Raghu Angadi added a comment - Essentially, we can port (or merge) untar part of TestDFSUpgradeFromImage.unpackStorage() to FileUtils.unTar() and use it in both places.
          rangadi Raghu Angadi added a comment -

          Also, it is better to use ShellCommandExecutor to run the command since it takes care of various errors.

          rangadi Raghu Angadi added a comment - Also, it is better to use ShellCommandExecutor to run the command since it takes care of various errors.
          rangadi Raghu Angadi added a comment -

          This will mostly fail on Solaris since 'tar' does not support '-z' option. See HADOOP-1717 for possible work around.

          rangadi Raghu Angadi added a comment - This will mostly fail on Solaris since 'tar' does not support '-z' option. See HADOOP-1717 for possible work around.
          mahadev Mahadev Konar added a comment -

          +1 patch looks good.

          mahadev Mahadev Konar added a comment - +1 patch looks good.
          steve_l Steve Loughran added a comment -

          1. if you set your <javac> task up with includeantruntime=false you dont get any version conflict, but then you have to make 100% sure your classpath contains every JAR you need to build

          2. it would be good to give lib/ant.jar a name like lib/ant-1.7.jar, so people can see at a glance what version to use.

          steve_l Steve Loughran added a comment - 1. if you set your <javac> task up with includeantruntime=false you dont get any version conflict, but then you have to make 100% sure your classpath contains every JAR you need to build 2. it would be good to give lib/ant.jar a name like lib/ant-1.7.jar, so people can see at a glance what version to use.

          I ran findbugs on my machine on the trunk and also with the patch, there no new findbug warnings introduced.

          amareshwari Amareshwari Sriramadasu added a comment - I ran findbugs on my machine on the trunk and also with the patch, there no new findbug warnings introduced.
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12381171/patch-2019.txt
          against trunk revision 645773.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 17 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs -1. The patch appears to cause Findbugs to fail.

          core tests -1. The patch failed core unit tests.

          contrib tests -1. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/testReport/
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/console

          This message is automatically generated.

          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12381171/patch-2019.txt against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 17 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to cause Findbugs to fail. core tests -1. The patch failed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/console This message is automatically generated.

          All the tests passed on my machine.

          amareshwari Amareshwari Sriramadasu added a comment - All the tests passed on my machine.

          Looks like ant.jar is removed from lib because it causes problems if it disagrees with the version of ant that people are using (HADOOP-1726).
          Here is a patch doing untaring of files using the tar executable.
          test.tar, test.tar.gz and test.tgz files should be put in src/test/org/apache/hadoop/mapred/ .

          amareshwari Amareshwari Sriramadasu added a comment - Looks like ant.jar is removed from lib because it causes problems if it disagrees with the version of ant that people are using ( HADOOP-1726 ). Here is a patch doing untaring of files using the tar executable. test.tar, test.tar.gz and test.tgz files should be put in src/test/org/apache/hadoop/mapred/ .
          mahadev Mahadev Konar added a comment -

          +1 on owens comment. of using tar executable

          mahadev Mahadev Konar added a comment - +1 on owens comment. of using tar executable
          omalley Owen O'Malley added a comment -

          Why don't we just run the real tar executable. I think that pulling in the ant dependence is much more problematic.

          omalley Owen O'Malley added a comment - Why don't we just run the real tar executable. I think that pulling in the ant dependence is much more problematic.
          omalley Owen O'Malley added a comment -

          I'm really concerned about including the ant.jar in hadoop. We've had a lot of problems in the past with conflicting versions of ant.jar.

          omalley Owen O'Malley added a comment - I'm really concerned about including the ant.jar in hadoop. We've had a lot of problems in the past with conflicting versions of ant.jar.
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12381075/test.tar.gz
          against trunk revision 645773.

          @author +1. The patch does not contain any @author tags.

          tests included -1. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          patch -1. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2340/console

          This message is automatically generated.

          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12381075/test.tar.gz against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included -1. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. patch -1. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2340/console This message is automatically generated.

          This patch wouldnot run through hudson tests, since this requires ant.jar in lib/ and test.tar, test.tgz and test.tar.gz fies in src/test/org/apache/hadoop/mapred/ . I'm attaching the files separetely, since jar and tar files can not be part of the patch.

          amareshwari Amareshwari Sriramadasu added a comment - This patch wouldnot run through hudson tests, since this requires ant.jar in lib/ and test.tar, test.tgz and test.tar.gz fies in src/test/org/apache/hadoop/mapred/ . I'm attaching the files separetely, since jar and tar files can not be part of the patch.

          Here is a patch supporting .tar, .tgz and .tar.gz files in DistributedCache. I pulled out ant-1.7.0 and used org.apache.tools.tar.TarInputStream and org.apache.tools.tar.TarEntry for untarring. Also updated testcase TestMiniMRWithDFSCaching to add .tar, .tgz and .tar.gz to cache archive and run job.

          amareshwari Amareshwari Sriramadasu added a comment - Here is a patch supporting .tar, .tgz and .tar.gz files in DistributedCache. I pulled out ant-1.7.0 and used org.apache.tools.tar.TarInputStream and org.apache.tools.tar.TarEntry for untarring. Also updated testcase TestMiniMRWithDFSCaching to add .tar, .tgz and .tar.gz to cache archive and run job.
          steve_l Steve Loughran added a comment -

          I'd recommend you just pull in ant-1.7 or (soon) the ant1.7.1 jar and use them directly. That is where the classes originate. They are designed to work outside Ant builds. Creating and releasing snapshots is bad because
          -ASF doesnt like projects releasing code using other project's snapshots (its related to signoff). Certainly were I on your PMC, I'd be vetoing any 1.0 release that was stil using the commons-cli snapshot.
          -you can't build maven/ivy dependency metadata XML files that dont refer to the unstable snapshot repository

          • which makes it impossible for downstream users to reliably recreate your execution environment.
          steve_l Steve Loughran added a comment - I'd recommend you just pull in ant-1.7 or (soon) the ant1.7.1 jar and use them directly. That is where the classes originate. They are designed to work outside Ant builds. Creating and releasing snapshots is bad because -ASF doesnt like projects releasing code using other project's snapshots (its related to signoff). Certainly were I on your PMC, I'd be vetoing any 1.0 release that was stil using the commons-cli snapshot. -you can't build maven/ivy dependency metadata XML files that dont refer to the unstable snapshot repository which makes it impossible for downstream users to reliably recreate your execution environment.

          The Tar version available from commons sandbox has a subtle bug when creating tar files (it appends only one null block at the end of archive, instead of two empty blocks as expected by GNU tar). Ant <tar> task contains a fixed copy of the same class.

          ab Andrzej Bialecki added a comment - The Tar version available from commons sandbox has a subtle bug when creating tar files (it appends only one null block at the end of archive, instead of two empty blocks as expected by GNU tar). Ant <tar> task contains a fixed copy of the same class.
          mahadev Mahadev Konar added a comment -

          i would suggest execing a process for tar -zxf for untarring files.
          It might be problematic on cygwin where tar might not be installed by default but is
          an easier solution adn works for most systems.

          mahadev Mahadev Konar added a comment - i would suggest execing a process for tar -zxf for untarring files. It might be problematic on cygwin where tar might not be installed by default but is an easier solution adn works for most systems.

          Apache commons compress (http://commons.apache.org/sandbox/compress/apidocs/) has TarArchive, TarInputStream etc. classes. But the project is a sandbox component and no releases are available. So shall we create a snapshot and use it ?

          amareshwari Amareshwari Sriramadasu added a comment - Apache commons compress ( http://commons.apache.org/sandbox/compress/apidocs/ ) has TarArchive, TarInputStream etc. classes. But the project is a sandbox component and no releases are available. So shall we create a snapshot and use it ?
          yhan Yiping Han added a comment -

          Is the .tgz file support avaiable now?

          yhan Yiping Han added a comment - Is the .tgz file support avaiable now?
          milindb Milind Barve added a comment -

          +1

          milindb Milind Barve added a comment - +1

          People

            amareshwari Amareshwari Sriramadasu
            omalley Owen O'Malley
            Votes:
            0 Vote for this issue
            Watchers:
            Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack