Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-2019

DistributedFileCache should support .tgz files in addition to jars and zip files

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.14.3
    • 0.18.0
    • None
    • None
    • Reviewed
    • Added support for .tar, .tgz and .tar.gz files in DistributedCache. File sizes are limited to 2GB.

    Description

      Currently the distributed file cache only works with zip and jar archives, which don't work for larger than 2g. We should support .tgz archives also.

      Attachments

        1. patch-2019.txt
          7 kB
          Amareshwari Sriramadasu
        2. test.tar
          10 kB
          Amareshwari Sriramadasu
        3. test.tgz
          0.2 kB
          Amareshwari Sriramadasu
        4. test.tar.gz
          0.2 kB
          Amareshwari Sriramadasu
        5. patch-2019.txt
          6 kB
          Amareshwari Sriramadasu
        6. patch-2019.txt
          8 kB
          Amareshwari Sriramadasu
        7. patch-2019.txt
          13 kB
          Amareshwari Sriramadasu

        Issue Links

          Activity

            milindb Milind Barve added a comment -

            +1

            milindb Milind Barve added a comment - +1
            yhan Yiping Han added a comment -

            Is the .tgz file support avaiable now?

            yhan Yiping Han added a comment - Is the .tgz file support avaiable now?

            Apache commons compress (http://commons.apache.org/sandbox/compress/apidocs/) has TarArchive, TarInputStream etc. classes. But the project is a sandbox component and no releases are available. So shall we create a snapshot and use it ?

            amareshwari Amareshwari Sriramadasu added a comment - Apache commons compress ( http://commons.apache.org/sandbox/compress/apidocs/ ) has TarArchive, TarInputStream etc. classes. But the project is a sandbox component and no releases are available. So shall we create a snapshot and use it ?
            mahadev Mahadev Konar added a comment -

            i would suggest execing a process for tar -zxf for untarring files.
            It might be problematic on cygwin where tar might not be installed by default but is
            an easier solution adn works for most systems.

            mahadev Mahadev Konar added a comment - i would suggest execing a process for tar -zxf for untarring files. It might be problematic on cygwin where tar might not be installed by default but is an easier solution adn works for most systems.

            The Tar version available from commons sandbox has a subtle bug when creating tar files (it appends only one null block at the end of archive, instead of two empty blocks as expected by GNU tar). Ant <tar> task contains a fixed copy of the same class.

            ab Andrzej Bialecki added a comment - The Tar version available from commons sandbox has a subtle bug when creating tar files (it appends only one null block at the end of archive, instead of two empty blocks as expected by GNU tar). Ant <tar> task contains a fixed copy of the same class.
            steve_l Steve Loughran added a comment -

            I'd recommend you just pull in ant-1.7 or (soon) the ant1.7.1 jar and use them directly. That is where the classes originate. They are designed to work outside Ant builds. Creating and releasing snapshots is bad because
            -ASF doesnt like projects releasing code using other project's snapshots (its related to signoff). Certainly were I on your PMC, I'd be vetoing any 1.0 release that was stil using the commons-cli snapshot.
            -you can't build maven/ivy dependency metadata XML files that dont refer to the unstable snapshot repository

            • which makes it impossible for downstream users to reliably recreate your execution environment.
            steve_l Steve Loughran added a comment - I'd recommend you just pull in ant-1.7 or (soon) the ant1.7.1 jar and use them directly. That is where the classes originate. They are designed to work outside Ant builds. Creating and releasing snapshots is bad because -ASF doesnt like projects releasing code using other project's snapshots (its related to signoff). Certainly were I on your PMC, I'd be vetoing any 1.0 release that was stil using the commons-cli snapshot. -you can't build maven/ivy dependency metadata XML files that dont refer to the unstable snapshot repository which makes it impossible for downstream users to reliably recreate your execution environment.

            Here is a patch supporting .tar, .tgz and .tar.gz files in DistributedCache. I pulled out ant-1.7.0 and used org.apache.tools.tar.TarInputStream and org.apache.tools.tar.TarEntry for untarring. Also updated testcase TestMiniMRWithDFSCaching to add .tar, .tgz and .tar.gz to cache archive and run job.

            amareshwari Amareshwari Sriramadasu added a comment - Here is a patch supporting .tar, .tgz and .tar.gz files in DistributedCache. I pulled out ant-1.7.0 and used org.apache.tools.tar.TarInputStream and org.apache.tools.tar.TarEntry for untarring. Also updated testcase TestMiniMRWithDFSCaching to add .tar, .tgz and .tar.gz to cache archive and run job.

            This patch wouldnot run through hudson tests, since this requires ant.jar in lib/ and test.tar, test.tgz and test.tar.gz fies in src/test/org/apache/hadoop/mapred/ . I'm attaching the files separetely, since jar and tar files can not be part of the patch.

            amareshwari Amareshwari Sriramadasu added a comment - This patch wouldnot run through hudson tests, since this requires ant.jar in lib/ and test.tar, test.tgz and test.tar.gz fies in src/test/org/apache/hadoop/mapred/ . I'm attaching the files separetely, since jar and tar files can not be part of the patch.
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12381075/test.tar.gz
            against trunk revision 645773.

            @author +1. The patch does not contain any @author tags.

            tests included -1. The patch doesn't appear to include any new or modified tests.
            Please justify why no tests are needed for this patch.

            patch -1. The patch command could not apply the patch.

            Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2340/console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12381075/test.tar.gz against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included -1. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. patch -1. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2340/console This message is automatically generated.
            omalley Owen O'Malley added a comment -

            I'm really concerned about including the ant.jar in hadoop. We've had a lot of problems in the past with conflicting versions of ant.jar.

            omalley Owen O'Malley added a comment - I'm really concerned about including the ant.jar in hadoop. We've had a lot of problems in the past with conflicting versions of ant.jar.
            omalley Owen O'Malley added a comment -

            Why don't we just run the real tar executable. I think that pulling in the ant dependence is much more problematic.

            omalley Owen O'Malley added a comment - Why don't we just run the real tar executable. I think that pulling in the ant dependence is much more problematic.
            mahadev Mahadev Konar added a comment -

            +1 on owens comment. of using tar executable

            mahadev Mahadev Konar added a comment - +1 on owens comment. of using tar executable

            Looks like ant.jar is removed from lib because it causes problems if it disagrees with the version of ant that people are using (HADOOP-1726).
            Here is a patch doing untaring of files using the tar executable.
            test.tar, test.tar.gz and test.tgz files should be put in src/test/org/apache/hadoop/mapred/ .

            amareshwari Amareshwari Sriramadasu added a comment - Looks like ant.jar is removed from lib because it causes problems if it disagrees with the version of ant that people are using ( HADOOP-1726 ). Here is a patch doing untaring of files using the tar executable. test.tar, test.tar.gz and test.tgz files should be put in src/test/org/apache/hadoop/mapred/ .

            All the tests passed on my machine.

            amareshwari Amareshwari Sriramadasu added a comment - All the tests passed on my machine.
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12381171/patch-2019.txt
            against trunk revision 645773.

            @author +1. The patch does not contain any @author tags.

            tests included +1. The patch appears to include 17 new or modified tests.

            javadoc +1. The javadoc tool did not generate any warning messages.

            javac +1. The applied patch does not generate any new javac compiler warnings.

            release audit +1. The applied patch does not generate any new release audit warnings.

            findbugs -1. The patch appears to cause Findbugs to fail.

            core tests -1. The patch failed core unit tests.

            contrib tests -1. The patch failed contrib unit tests.

            Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/testReport/
            Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/artifact/trunk/build/test/checkstyle-errors.html
            Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12381171/patch-2019.txt against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 17 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to cause Findbugs to fail. core tests -1. The patch failed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/console This message is automatically generated.

            I ran findbugs on my machine on the trunk and also with the patch, there no new findbug warnings introduced.

            amareshwari Amareshwari Sriramadasu added a comment - I ran findbugs on my machine on the trunk and also with the patch, there no new findbug warnings introduced.
            steve_l Steve Loughran added a comment -

            1. if you set your <javac> task up with includeantruntime=false you dont get any version conflict, but then you have to make 100% sure your classpath contains every JAR you need to build

            2. it would be good to give lib/ant.jar a name like lib/ant-1.7.jar, so people can see at a glance what version to use.

            steve_l Steve Loughran added a comment - 1. if you set your <javac> task up with includeantruntime=false you dont get any version conflict, but then you have to make 100% sure your classpath contains every JAR you need to build 2. it would be good to give lib/ant.jar a name like lib/ant-1.7.jar, so people can see at a glance what version to use.
            mahadev Mahadev Konar added a comment -

            +1 patch looks good.

            mahadev Mahadev Konar added a comment - +1 patch looks good.
            rangadi Raghu Angadi added a comment -

            This will mostly fail on Solaris since 'tar' does not support '-z' option. See HADOOP-1717 for possible work around.

            rangadi Raghu Angadi added a comment - This will mostly fail on Solaris since 'tar' does not support '-z' option. See HADOOP-1717 for possible work around.
            rangadi Raghu Angadi added a comment -

            Also, it is better to use ShellCommandExecutor to run the command since it takes care of various errors.

            rangadi Raghu Angadi added a comment - Also, it is better to use ShellCommandExecutor to run the command since it takes care of various errors.
            rangadi Raghu Angadi added a comment -

            Essentially, we can port (or merge) untar part of TestDFSUpgradeFromImage.unpackStorage() to FileUtils.unTar() and use it in both places.

            rangadi Raghu Angadi added a comment - Essentially, we can port (or merge) untar part of TestDFSUpgradeFromImage.unpackStorage() to FileUtils.unTar() and use it in both places.

            Cancelling patch to address Raghu's comments.

            amareshwari Amareshwari Sriramadasu added a comment - Cancelling patch to address Raghu's comments.

            Here is a patch doing the untar as suggested in HADOOP-1717.
            I moved the code for untarring to FileUtil.untar() and calling it in TestDFSUpgradeFromImage and also in DistributedCache. Also used the ShellCommandExecutor to run the command.

            amareshwari Amareshwari Sriramadasu added a comment - Here is a patch doing the untar as suggested in HADOOP-1717 . I moved the code for untarring to FileUtil.untar() and calling it in TestDFSUpgradeFromImage and also in DistributedCache. Also used the ShellCommandExecutor to run the command.
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12381418/patch-2019.txt
            against trunk revision 653264.

            +1 @author. The patch does not contain any @author tags.

            +1 tests included. The patch appears to include 20 new or modified tests.

            +1 javadoc. The javadoc tool did not generate any warning messages.

            +1 javac. The applied patch does not increase the total number of javac compiler warnings.

            -1 findbugs. The patch appears to cause Findbugs to fail.

            +1 release audit. The applied patch does not increase the total number of release audit warnings.

            -1 core tests. The patch failed core unit tests.

            -1 contrib tests. The patch failed contrib unit tests.

            Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/testReport/
            Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/artifact/trunk/build/test/checkstyle-errors.html
            Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12381418/patch-2019.txt against trunk revision 653264. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 20 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/console This message is automatically generated.
            rangadi Raghu Angadi added a comment -

            +1 for the changes. Regd hudson, you could either include binary files in the patch (mostly might work) or just for hudson, we could use existing "hadoop-14-dfs-dir.tgz" since the contents of the tar file don't seem to matter (I'm not so sure).

            rangadi Raghu Angadi added a comment - +1 for the changes. Regd hudson, you could either include binary files in the patch (mostly might work) or just for hudson, we could use existing "hadoop-14-dfs-dir.tgz" since the contents of the tar file don't seem to matter (I'm not so sure).

            we could use existing "hadoop-14-dfs-dir.tgz" since the contents of the tar file don't seem to matter (I'm not so sure).

            No, we cannot use hadoop-14-dfs-dir.tgz for the test TestMiniMRDFSCaching, because we read the contents of tar file also for assertions.

            amareshwari Amareshwari Sriramadasu added a comment - we could use existing "hadoop-14-dfs-dir.tgz" since the contents of the tar file don't seem to matter (I'm not so sure). No, we cannot use hadoop-14-dfs-dir.tgz for the test TestMiniMRDFSCaching, because we read the contents of tar file also for assertions.

            Added documentation

            amareshwari Amareshwari Sriramadasu added a comment - Added documentation
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12381483/patch-2019.txt
            against trunk revision 653638.

            +1 @author. The patch does not contain any @author tags.

            +1 tests included. The patch appears to include 20 new or modified tests.

            +1 javadoc. The javadoc tool did not generate any warning messages.

            +1 javac. The applied patch does not increase the total number of javac compiler warnings.

            -1 findbugs. The patch appears to cause Findbugs to fail.

            +1 release audit. The applied patch does not increase the total number of release audit warnings.

            -1 core tests. The patch failed core unit tests.

            -1 contrib tests. The patch failed contrib unit tests.

            Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/testReport/
            Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/artifact/trunk/build/test/checkstyle-errors.html
            Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12381483/patch-2019.txt against trunk revision 653638. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 20 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/console This message is automatically generated.
            mahadev Mahadev Konar added a comment -

            patch looks good.. .can you take care of the findbugs warnings?

            mahadev Mahadev Konar added a comment - patch looks good.. .can you take care of the findbugs warnings?

            Looks like hudson is not able to run findbugs, may be because of dependency of the tar files in build.xml .
            But I ran findbugs on my machine, and there are no new findbug warnings introduced.

            amareshwari Amareshwari Sriramadasu added a comment - Looks like hudson is not able to run findbugs, may be because of dependency of the tar files in build.xml . But I ran findbugs on my machine, and there are no new findbug warnings introduced.
            ddas Devaraj Das added a comment -

            I just committed this. Thanks, Amareshwari!

            ddas Devaraj Das added a comment - I just committed this. Thanks, Amareshwari!
            hudson Hudson added a comment -
            hudson Hudson added a comment - Integrated in Hadoop-trunk #484 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/484/ )

            People

              amareshwari Amareshwari Sriramadasu
              omalley Owen O'Malley
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: