Bug 45396 - org.apache.tools.zip is 20x slower than java.util.zip when compressing big files
Summary: org.apache.tools.zip is 20x slower than java.util.zip when compressing big files
Status: RESOLVED FIXED
Alias: None
Product: Ant
Classification: Unclassified
Component: Other (show other bugs)
Version: 1.7.1
Hardware: All Linux
: P2 enhancement (vote)
Target Milestone: 1.8.0
Assignee: Ant Notifications List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-07-14 20:47 UTC by TAMURA Kent
Modified: 2008-07-16 18:53 UTC (History)
1 user (show)



Attachments
performance comparison code (2.79 KB, application/octet-stream)
2008-07-14 20:47 UTC, TAMURA Kent
Details
extended perfromance comparison code (5.44 KB, text/plain)
2008-07-15 23:16 UTC, Stefan Bodewig
Details

Note You need to log in before you can comment on or make changes to this bug.
Description TAMURA Kent 2008-07-14 20:47:05 UTC
Created attachment 22257 [details]
performance comparison code

Environment:
 Ubuntu Linux/amd64
 x86 jre 1.5.0_13-b05
 ant.jar in ant-1.7.1

I'd like to use org.apache.tools.zip instead of java.util.zip because of the filename encoding problem of java.util.zip, and have a performance problem on org.apache.tools.zip.

The attached Java code compress 2 files (3MiB and 2MiB) with org.apache.zip and java.util.zip.  It shows org.apche.zip is 20x slower than java.util.zip.

Output:
% java -cp .:ant-1.7.1.jar ZipPerformance -apache -jdk
==> Benchmarking
Apache: 95832 [ms]
JDK: 4717 [ms]
Comment 1 TAMURA Kent 2008-07-14 21:36:21 UTC
I looked the source code.
When we call ZipOutputStream.write(byte[]) for a large byte array,

* org.apache.tools.zip
  call Deflater.setInput() once for the whole of the array

* java.util.zip
  call Deflater.setInput() multiple times.  One call handles a 512 byte chunk of the array.
Comment 2 Stefan Bodewig 2008-07-15 23:16:18 UTC
Created attachment 22263 [details]
extended perfromance comparison code
Comment 3 Stefan Bodewig 2008-07-15 23:21:31 UTC
I've extended the test code which compressed two big files (2 and 3 MB) to cover the case of many small files (2000 files of 2 or 3 kB) and covered reading as well.

The big file compression case is actually worse on my machine (WinXP) where java.util.zip is more like 40 times faster.  OTOH Ant wins in the small file case.

Ant is slower when reading the ZIPs, but the performance difference isn't as bad.

==> Benchmarking big files
Apache write warmup done
Apache write: 147640 [ms]
JDK write warmup done
JDK write: 3219 [ms]
Apache read warmup done
Apache read: 453 [ms]
JDK Warmup done
JDK read: 125 [ms]
==> Benchmarking small files
Apache write warmup done
Apache write: 4406 [ms]
JDK write warmup done
JDK write: 6531 [ms]
Apache read warmup done
Apache read: 1859 [ms]
JDK Warmup done
JDK read: 1312 [ms]

I made the ocde compile on JDK 1.4 because I wanted to compare different JDKs.  In the end the differeneces were so small I didn't include them here (JDK6 was a bit faster for java.util.zip as well as in the Ant case).

For reference, this is against Ant's subversion revision 677166.
Comment 4 Stefan Bodewig 2008-07-16 06:06:07 UTC
same machine svn revision 677272:

==> Benchmarking big files
Apache write warmup done
Apache write: 3407 [ms]
JDK write warmup done
JDK write: 3297 [ms]
Apache read warmup done
Apache read: 422 [ms]
JDK Warmup done
JDK read: 125 [ms]
==> Benchmarking small files
Apache write warmup done
Apache write: 4438 [ms]
JDK write warmup done
JDK write: 6563 [ms]
Apache read warmup done
Apache read: 1844 [ms]
JDK Warmup done
JDK read: 1359 [ms]

Deflater seems to copy its input around since I can see bigger memory consumption during the Ant code tests.  There is no hint in the Javadocs and I have no idea why chunking the original input should help - other than that it helps the native implementation of Sun's Deflater class.

I've searched through the zlib and InfoZIP code base to find any reference to good byte chunk sizes to pass to the compression library and found that InfoZIP's zip will use between 2kB (SMALL_MEM) and 16 kB (LARGE_MEM).  I've changed the code to use 8kB blocks, which has the side effect of doing nothing when ZipOutputStream is used via <zip> and friends.

Ant's tasks have always read the file content in 8kB chunks and written those blocks to the ZipOutputStream - so Ant's tasks have never seen the poor performance for big files.
Comment 5 TAMURA Kent 2008-07-16 18:53:12 UTC
Thank you for the quick fix!