[COMPRESS-438] ZipFile should create a buffered input stream for decoders inside getInputStream(ZipArchiveEntry) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.16
Component/s: Compressors
Labels:
- zip

Description

When decoders read from a raw underlying stream (such as a file channel), the performance can degrade an order of magnitude compared to the case when there's a simple buffer in between the physical data source and the codec.

See ~~COMPRESS-380~~ for an example of this.

The API of ZipFile is straightforward and tempting enough that blocks of code such as:

    try (ZipFile zfile = new ZipFile("/path/to/zip.zip")) { 
      Enumeration<ZipArchiveEntry> entries = zfile.getEntries();
      while (entries.hasMoreElements()) {
        ZipArchiveEntry e = entries.nextElement();

        try (InputStream is = zfile.getInputStream(e)) {
          // do something with is
        }
      }
    }

seem perfectly justified. The above code suffers from severe performance degradation compared to prebuffered input. Severe means severe. Here are some stats from running a snippet of code similar to the above to "just decompress" the same input (~80mb) compressed with different methods.

# Input compressed with 7za.
7za a -mm=deflate   archive-deflate.zip   input
7za a -mm=deflate64 archive-deflate64.zip input
7za a -mm=bzip2     archive-bzip2.zip     input

Current master branch:

# Direct BoundedInputStream
archive-deflate.zip 0.39 sec., 37,893,785 archived => 81,502,941 decompressed
archive-deflate64.zip 26.98 sec., 37,856,848 archived => 81,502,941 decompressed
archive-bzip2.zip 49.16 sec., 38,166,089 archived => 81,502,941 decompressed

And a simple patch wrapping BoundedInputStream with a BufferedInputStream (deflate64 and bzip2 only, deflate uses java's internal inflater and it prebuffers stuff internally).

# wrapped with BufferedInputStream, bufferSize = 512
archive-deflate.zip 0.38 sec., 37,893,785 archived => 81,502,941 decompressed
archive-deflate64.zip 0.95 sec., 37,856,848 archived => 81,504,783 decompressed
archive-bzip2.zip 3.00 sec., 38,166,089 archived => 81,502,941 decompressed

# wrapped with BufferedInputStream, bufferSize = 1024
archive-deflate.zip 0.41 sec., 37,893,785 archived => 81,502,941 decompressed
archive-deflate64.zip 0.97 sec., 37,856,848 archived => 81,504,783 decompressed
archive-bzip2.zip 2.95 sec., 38,166,089 archived => 81,502,941 decompressed

# wrapped with BufferedInputStream, bufferSize = 4096
archive-deflate.zip 0.42 sec., 37,893,785 archived => 81,502,941 decompressed
archive-deflate64.zip 0.89 sec., 37,856,848 archived => 81,504,783 decompressed
archive-bzip2.zip 2.97 sec., 38,166,089 archived => 81,502,941 decompressed

The difference should be evident, even with a tiny buffer of 512 bytes. To put this into perspective on a larger archive:

archive-deflate.zip 7.68 sec., 1,235,209,977 archived => 1,339,916,520 decompressed
archive-deflate64.zip 784.87 sec., 1,233,470,780 archived => 1,339,916,520 decompressed

deflate64 improves by ~4900%...

archive-deflate.zip 8.12 sec., 1,235,209,977 archived => 1,339,916,520 decompressed
archive-deflate64.zip 16.24 sec., 1,233,470,780 archived => 1,339,981,595 decompressed

I also see that ExplodingInputStream is already wrapping bis in a buffered input stream, so I don't see any reason why this shouldn't be done for other compressor streams. An even better patch (to me) would be to modify the constructors of Deflate64CompressorInputStream and BZip2CompressorInputStream and add a boolean parameter unbuffered: then people would know what they're doing when they pass some input stream and true to such a constructor. The default, single-argument constructor would simply delegate to constructor(inputStream, false) to ensure an input buffer in between the decoder and the raw stream.

The patch is trivial, so I don't attach it?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Check.java
09/Jan/18 21:23
2 kB
Dawid Weiss

Issue Links

is related to

COMPRESS-333 bz2 stream decompressor is 10x slower than it could be

Resolved

relates to

COMPRESS-380 Support for ENHANCED_DEFLATED (Deflate64) in ZIP files

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Dawid Weiss

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Jan/18 20:28

Updated:: 10/Jan/18 09:48

Resolved:: 10/Jan/18 09:10