I have a patch here to speed up the BZip2 task (50-90%). At the same time this patch achieves lower memory usage (35%). It's well tested. Please note this is the first time I'm trying to contribute a patch to apache...hoping I can create an attachment on the next page. If not, I will send the patch to Keiron Liddle (keiron@aftexsw.com), the author of the current version.
Created attachment 9184 [details] Patch for package org.apache.tools.bzip2 to enhance performance
Unfortunately I have a couple of problems with this patch. First and most important, if I make them apply, the Ant tasks no longer work. (1) The patch to Input makes <bunzip> fail because the class now expects the stream to start with the BZip magic BZ - which the old didn't. The tasks strips the first two bytes from the stream and so the task now fails. I certainly could change the task as well, but there may be other clients that rely on the old code. This will lead me to problem number (3) below. (2) The patch to Output causes an ArrayIndexOOBException in write0 when I run <bzip2> - I haven't investigated further. (3) is the time I had to spend on making the patch apply. You've changed the published API of the classes, which we can't do. In particular we can't turn BZip2Constants into a class, rename the constants of that interface or turn protected methods/attributes into private ones. In addition the diff is much larger than needed since you moved methods around, changed the indentation and removed braces from one-line blocks. I don't know whether you are still interested in getting this into Ant, given that your patch is almost eighteen months old. Part of the reason that I haven't touched this earlier was (3) above since it looked like way to much work to get the patch applied (and it took me sevaral hours yesterday, I should have run my tests before I changed the coding style to Ant's). I should have said so sooner and I'm sorry about this. If you are no longer interested, then let us close this bug and keep in mind that our BZip2 streams are probably slower than they could be.
Created attachment 14564 [details] path against Ant CVS as of 2005-03-25 New patch, tested against Ant CVS 2005-03-25
Just created an attachment with a new patch. A change in the testcase org.apache.tools.ant.taskdefs.BZip2Test was also neccessary. The previous version did compare the compressed bytes. This may produce false errors. Different BZip2 implementations may produce different but valid compressed output. The new code compares the decompressed bytes.
Thanks for the new patch, Jörg. I'll try to free up some time this week to look into it.
I've moved around some code, fixed indentation and brace placement and removed some trailing spaces - apart from that your newer patch has gone into Ant's CVS HEAD. I can confirm the speed has improved (in my limited tests the improvement for bzip2 was around 50% while the bunzip2 case was less impressive, but still there). At least for my tests Ant consumed slightly more memory than before, but not enough to worry about. Thanks.
I have to put this optimization as there are cases where corrupt bz2 files are created. See: http://issues.apache.org/bugzilla/show_bug.cgi?id=41596
I left in CBZip2InputStream as there as been no reports of problems in reading bzip2 files.
given that we don't have a patch for the output case that doesn't create corrupt archives, I'm closing this.
So there is no update on this? The speedup and reduced memory usage seems very very impressive.
The problem is that some generated bzip files were corrupt and so the change had to be removed. The code is complex (both the original and the patched version) and hard to understand with no unit tests.
correction, there are some unit tests..
The buggy code was being used in Hadoop and we came across the same file corruption issues. I have debugged the code, found, and corrected the problem: https://issues.apache.org/jira/browse/HADOOP-5326 After extensive testing on hadoop, I'm quite confident the bug was removed. Maybe it's time we reopened this.
Created attachment 23382 [details] SVN patch with the corrected version of CBZip2OutputStream.java I checked the other files in the bzip2 package and they are all okay. The change to the old (buggy) version CBZip2OutputStream file is just a +1 that was forgotten during a series of optimizations.
The performance improvement of the new BZip2 code is very significant (as Described by Joerg Wassmer in 2003). Since Rodrigo has found, fixed the bug, and verified the result, we should commit this whenever possible. This will give a big performance improvement to the BZip2 tasks.
will be looking into merging it into Ant trunk as well as commons-compress soon.
svn revision 759138 I'll also merge the changes to the "now proper" compress component in Apache Commons which is bound to have 1.0 release pretty soon. https://issues.apache.org/jira/browse/COMPRESS-58 Many, many thanks!