Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.14.3
-
None
-
None
-
Reviewed
-
Added support for .tar, .tgz and .tar.gz files in DistributedCache. File sizes are limited to 2GB.
Description
Currently the distributed file cache only works with zip and jar archives, which don't work for larger than 2g. We should support .tgz archives also.
Attachments
Attachments
- patch-2019.txt
- 7 kB
- Amareshwari Sriramadasu
- test.tar
- 10 kB
- Amareshwari Sriramadasu
- test.tgz
- 0.2 kB
- Amareshwari Sriramadasu
- test.tar.gz
- 0.2 kB
- Amareshwari Sriramadasu
- patch-2019.txt
- 6 kB
- Amareshwari Sriramadasu
- patch-2019.txt
- 8 kB
- Amareshwari Sriramadasu
- patch-2019.txt
- 13 kB
- Amareshwari Sriramadasu
Issue Links
- is related to
-
HBASE-196 remove ant.jar from lib directory
- Closed
Activity
Apache commons compress (http://commons.apache.org/sandbox/compress/apidocs/) has TarArchive, TarInputStream etc. classes. But the project is a sandbox component and no releases are available. So shall we create a snapshot and use it ?
i would suggest execing a process for tar -zxf for untarring files.
It might be problematic on cygwin where tar might not be installed by default but is
an easier solution adn works for most systems.
The Tar version available from commons sandbox has a subtle bug when creating tar files (it appends only one null block at the end of archive, instead of two empty blocks as expected by GNU tar). Ant <tar> task contains a fixed copy of the same class.
I'd recommend you just pull in ant-1.7 or (soon) the ant1.7.1 jar and use them directly. That is where the classes originate. They are designed to work outside Ant builds. Creating and releasing snapshots is bad because
-ASF doesnt like projects releasing code using other project's snapshots (its related to signoff). Certainly were I on your PMC, I'd be vetoing any 1.0 release that was stil using the commons-cli snapshot.
-you can't build maven/ivy dependency metadata XML files that dont refer to the unstable snapshot repository
- which makes it impossible for downstream users to reliably recreate your execution environment.
Here is a patch supporting .tar, .tgz and .tar.gz files in DistributedCache. I pulled out ant-1.7.0 and used org.apache.tools.tar.TarInputStream and org.apache.tools.tar.TarEntry for untarring. Also updated testcase TestMiniMRWithDFSCaching to add .tar, .tgz and .tar.gz to cache archive and run job.
This patch wouldnot run through hudson tests, since this requires ant.jar in lib/ and test.tar, test.tgz and test.tar.gz fies in src/test/org/apache/hadoop/mapred/ . I'm attaching the files separetely, since jar and tar files can not be part of the patch.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381075/test.tar.gz
against trunk revision 645773.
@author +1. The patch does not contain any @author tags.
tests included -1. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.
patch -1. The patch command could not apply the patch.
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2340/console
This message is automatically generated.
I'm really concerned about including the ant.jar in hadoop. We've had a lot of problems in the past with conflicting versions of ant.jar.
Why don't we just run the real tar executable. I think that pulling in the ant dependence is much more problematic.
Looks like ant.jar is removed from lib because it causes problems if it disagrees with the version of ant that people are using (HADOOP-1726).
Here is a patch doing untaring of files using the tar executable.
test.tar, test.tar.gz and test.tgz files should be put in src/test/org/apache/hadoop/mapred/ .
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381171/patch-2019.txt
against trunk revision 645773.
@author +1. The patch does not contain any @author tags.
tests included +1. The patch appears to include 17 new or modified tests.
javadoc +1. The javadoc tool did not generate any warning messages.
javac +1. The applied patch does not generate any new javac compiler warnings.
release audit +1. The applied patch does not generate any new release audit warnings.
findbugs -1. The patch appears to cause Findbugs to fail.
core tests -1. The patch failed core unit tests.
contrib tests -1. The patch failed contrib unit tests.
Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/testReport/
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2350/console
This message is automatically generated.
I ran findbugs on my machine on the trunk and also with the patch, there no new findbug warnings introduced.
1. if you set your <javac> task up with includeantruntime=false you dont get any version conflict, but then you have to make 100% sure your classpath contains every JAR you need to build
2. it would be good to give lib/ant.jar a name like lib/ant-1.7.jar, so people can see at a glance what version to use.
This will mostly fail on Solaris since 'tar' does not support '-z' option. See HADOOP-1717 for possible work around.
Also, it is better to use ShellCommandExecutor to run the command since it takes care of various errors.
Essentially, we can port (or merge) untar part of TestDFSUpgradeFromImage.unpackStorage() to FileUtils.unTar() and use it in both places.
Here is a patch doing the untar as suggested in HADOOP-1717.
I moved the code for untarring to FileUtil.untar() and calling it in TestDFSUpgradeFromImage and also in DistributedCache. Also used the ShellCommandExecutor to run the command.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381418/patch-2019.txt
against trunk revision 653264.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 20 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
-1 findbugs. The patch appears to cause Findbugs to fail.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed core unit tests.
-1 contrib tests. The patch failed contrib unit tests.
Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/testReport/
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2396/console
This message is automatically generated.
+1 for the changes. Regd hudson, you could either include binary files in the patch (mostly might work) or just for hudson, we could use existing "hadoop-14-dfs-dir.tgz" since the contents of the tar file don't seem to matter (I'm not so sure).
we could use existing "hadoop-14-dfs-dir.tgz" since the contents of the tar file don't seem to matter (I'm not so sure).
No, we cannot use hadoop-14-dfs-dir.tgz for the test TestMiniMRDFSCaching, because we read the contents of tar file also for assertions.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381483/patch-2019.txt
against trunk revision 653638.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 20 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
-1 findbugs. The patch appears to cause Findbugs to fail.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed core unit tests.
-1 contrib tests. The patch failed contrib unit tests.
Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/testReport/
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2409/console
This message is automatically generated.
Looks like hudson is not able to run findbugs, may be because of dependency of the tar files in build.xml .
But I ran findbugs on my machine, and there are no new findbug warnings introduced.
Integrated in Hadoop-trunk #484 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/484/)
+1