Bug 24798 - BZip2: faster, less memory
Summary: BZip2: faster, less memory
Status: RESOLVED FIXED
Alias: None
Product: Ant
Classification: Unclassified
Component: Other (show other bugs)
Version: nightly
Hardware: All All
: P3 enhancement (vote)
Target Milestone: 1.8.0
Assignee: Ant Notifications List
URL:
Keywords:
Depends on:
Blocks: 40919
  Show dependency tree
 
Reported: 2003-11-18 21:13 UTC by Joerg Wassmer
Modified: 2009-03-27 06:50 UTC (History)
1 user (show)



Attachments
Patch for package org.apache.tools.bzip2 to enhance performance (141.73 KB, patch)
2003-11-18 21:15 UTC, Joerg Wassmer
Details | Diff
path against Ant CVS as of 2005-03-25 (35.68 KB, patch)
2005-03-25 21:21 UTC, Joerg Wassmer
Details | Diff
SVN patch with the corrected version of CBZip2OutputStream.java (102.03 KB, patch)
2009-03-13 18:20 UTC, Rodrigo Schmidt
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Joerg Wassmer 2003-11-18 21:13:58 UTC
I have a patch here to speed up the BZip2 task (50-90%). At the same time this 
patch achieves lower memory usage (35%). 
It's well tested. 
 
Please note this is the first time I'm trying to contribute a patch to 
apache...hoping I can create an attachment on the next page. 
If not, I will send the patch to Keiron Liddle (keiron@aftexsw.com), the 
author of the current version.
Comment 1 Joerg Wassmer 2003-11-18 21:15:43 UTC
Created attachment 9184 [details]
Patch for package org.apache.tools.bzip2 to enhance performance
Comment 2 Stefan Bodewig 2005-03-24 09:48:26 UTC
Unfortunately I have a couple of problems with this patch.

First and most important, if I make them apply, the Ant tasks no longer work.

(1) The patch to Input makes <bunzip> fail because the class now expects the
stream to start with the BZip magic BZ - which the old didn't.  The tasks strips
the first two bytes from the stream and so the task now fails.  I certainly
could change the task as well, but there may be other clients that rely on the
old code.  This will lead me to problem number (3) below.

(2) The patch to Output causes an ArrayIndexOOBException in write0 when I run
<bzip2> - I haven't investigated further.

(3) is the time I had to spend on making the patch apply.  You've changed the
published API of the classes, which we can't do.  In particular we can't turn
BZip2Constants into a class, rename the constants of that interface or turn
protected methods/attributes into private ones.

In addition the diff is much larger than needed since you moved methods around,
changed the indentation and removed braces from one-line blocks.

I don't know whether you are still interested in getting this into Ant, given
that your patch is almost eighteen months old.  Part of the reason that I haven't
touched this earlier was (3) above since it looked like way to much work to
get the patch applied (and it took me sevaral hours yesterday, I should have
run my tests before I changed the coding style to Ant's).

I should have said so sooner and I'm sorry about this.

If you are no longer interested, then let us close this bug and keep in mind
that our BZip2 streams are probably slower than they could be.
Comment 3 Joerg Wassmer 2005-03-25 21:21:41 UTC
Created attachment 14564 [details]
path against Ant CVS as of 2005-03-25

New patch, tested against Ant CVS 2005-03-25
Comment 4 Joerg Wassmer 2005-03-25 21:27:32 UTC
Just created an attachment with a new patch. 
A change in the testcase org.apache.tools.ant.taskdefs.BZip2Test was also 
neccessary. The previous version did compare the compressed bytes. This may 
produce false errors. Different BZip2 implementations may produce different 
but valid compressed output.  
The new code compares the decompressed bytes. 
Comment 5 Stefan Bodewig 2005-03-30 09:11:56 UTC
Thanks for the new patch, Jörg.  I'll try to free up some time this week to look
into it.
Comment 6 Stefan Bodewig 2005-04-14 11:31:41 UTC
I've moved around some code, fixed indentation and brace placement and removed
some trailing spaces - apart from that your newer patch has gone into Ant's CVS
HEAD.

I can confirm the speed has improved (in my limited tests the improvement for bzip2
was around 50% while the bunzip2 case was less impressive, but still there).  At
least for my tests Ant consumed slightly more memory than before, but not enough
to worry about.

Thanks.
Comment 7 Peter Reilly 2007-05-30 04:11:51 UTC
I have to put this optimization as there are cases
where corrupt bz2 files are created.
See: http://issues.apache.org/bugzilla/show_bug.cgi?id=41596
Comment 8 Peter Reilly 2007-05-30 06:26:54 UTC
I left in CBZip2InputStream as there as been no reports of problems
in reading bzip2 files.
Comment 9 Stefan Bodewig 2008-07-10 23:45:11 UTC
given that we don't have a patch for the output case that doesn't create corrupt archives, I'm closing this.
Comment 10 Zheng Shao 2009-02-25 01:41:35 UTC
So there is no update on this?

The speedup and reduced memory usage seems very very impressive.
Comment 11 Peter Reilly 2009-02-25 02:19:36 UTC
The problem is that some generated bzip files were corrupt and so the change had to be removed. The code is complex (both the original and the patched version) and hard to understand with no unit tests.
Comment 12 Peter Reilly 2009-02-25 02:27:33 UTC
correction, there are some unit tests..
Comment 13 Rodrigo Schmidt 2009-03-11 14:48:06 UTC
The buggy code was being used in Hadoop and we came across the same file corruption issues. I have debugged the code, found, and corrected the problem:

https://issues.apache.org/jira/browse/HADOOP-5326

After extensive testing on hadoop, I'm quite confident the bug was removed.

Maybe it's time we reopened this.
Comment 14 Rodrigo Schmidt 2009-03-13 18:20:58 UTC
Created attachment 23382 [details]
SVN patch with the corrected version of CBZip2OutputStream.java

I checked the other files in the bzip2 package and they are all okay. The change to the old (buggy) version CBZip2OutputStream file is just a +1 that was forgotten during a series of optimizations.
Comment 15 Zheng Shao 2009-03-19 18:38:06 UTC
The performance improvement of the new BZip2 code is very significant (as Described by Joerg Wassmer in 2003).

Since Rodrigo has found, fixed the bug, and verified the result, we should commit this whenever possible. This will give a big performance improvement to the BZip2 tasks.
Comment 16 Stefan Bodewig 2009-03-27 03:20:56 UTC
will be looking into merging it into Ant trunk as well as commons-compress soon.
Comment 17 Stefan Bodewig 2009-03-27 06:50:21 UTC
svn revision 759138

I'll also merge the changes to the "now proper" compress component in Apache
Commons which is bound to have 1.0 release pretty soon.

https://issues.apache.org/jira/browse/COMPRESS-58

Many, many thanks!