[COMPRESS-333] bz2 stream decompressor is 10x slower than it could be - ASF JIRA

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.11
Component/s: None
Labels:
None

Description

This is related to ~~COMPRESS-291~~. In short: decompressing 7z archives was an order of magnitude slower in Java than with native tooling.

My investigation showed that the problematic archive used bz2 streams inside. I then did a quick hack-experiment which took bz2 decompressor from the Apache Hadoop project (the Java version, not the native one) and replaced the default one used for bz2 stream decompression of the 7z archiver in commons.

I then ran a quick benchmark on this file:

https://archive.org/download/stackexchange/english.stackexchange.com.7z

The decompression speeds are (SSD, the file was essentially fully cached in memory, so everything is CPU bound):

native {{7za}}: 13 seconds
Commons (original): 222 seconds
Commons (patched w/Hadoop bz2): 30 seconds
Commons (patched w/BufferedInputStream): 28 seconds

Yes, it's still 3 times slower than native code, but it's no longer glacially slow...

My patch is a quick and dirty proof of concept (not committable, see [1]), but it passes the tests. Some notes:

Hadoop's stream isn't suited for handling concatenated bz2 streams, it'd have to be either patched in the code or (better) decorated at a level above the low-level decoder,
I only substituted the decompressor in 7z, but obviously this could benefit in other places (zip, etc.); essentially, I'd remove BZip2CompressorInputStream entirely.
while I toyed around with the above idea I noticed a really annoying thing – all streams are required to extend CompressorInputStream, which only adds one method to count the number of consumed bytes. This complicates the code and makes plugging in other implementations of InputStreams more cumbersome. I could get rid of CompressorInputStream entirely with a few minor changes to the code, but obviously this would be backward incompatible (see [2]).

References:
[1] GitHub fork, bzip2 branch: https://github.com/dweiss/commons-compress/tree/bzip2
[2] Removal and cleanup of CompressorInputStream: https://github.com/dweiss/commons-compress/commit/6948ed371e8ed6e6b69b96ee936d1455cbfd6458

Attachments

Issue Links

is depended upon by

COMPRESS-291 decompress .7z archive very very slow

Resolved

relates to

COMPRESS-438 ZipFile should create a buffered input stream for decoders inside getInputStream(ZipArchiveEntry)

Resolved

Activity

Ascending order - Click to sort in descending order

Dawid Weiss added a comment - 03/Feb/16 13:40 - edited

It's even better than I thought... So I wondered why the hell the slowdown. Looked at both implementations – they differ, but they're clearly derivatives of the same initial code by Keiron Liddle.

So... why the slowdown?

Turns out it's the buffered vs. unbuffered reads, as is typical. This one-liner patch brings decompression speed of 7z files on par with my Hadoop fork...

diff --git a/src/main/java/org/apache/commons/compress/archivers/sevenz/SevenZFile.java b/src/main/java/org/apache/commons/compress/archivers/sevenz/SevenZFile.java
index 398783f..809a9dc 100644
--- a/src/main/java/org/apache/commons/compress/archivers/sevenz/SevenZFile.java
+++ b/src/main/java/org/apache/commons/compress/archivers/sevenz/SevenZFile.java
@@ -17,6 +17,7 @@
  */
 package org.apache.commons.compress.archivers.sevenz;

+import java.io.BufferedInputStream;
 import java.io.ByteArrayInputStream;
 import java.io.Closeable;
 import java.io.DataInput;
@@ -853,8 +854,10 @@ public class SevenZFile implements Closeable {
     private InputStream buildDecoderStack(final Folder folder, final long folderOffset,
                 final int firstPackStreamIndex, SevenZArchiveEntry entry) throws IOException {
         file.seek(folderOffset);
-        InputStream inputStreamStack = new BoundedRandomAccessFileInputStream(file,
-                archive.packSizes[firstPackStreamIndex]);
+        InputStream inputStreamStack =
+            new BufferedInputStream(
+              new BoundedRandomAccessFileInputStream(file,
+                  archive.packSizes[firstPackStreamIndex]));
         LinkedList<SevenZMethodConfiguration> methods = new LinkedList<SevenZMethodConfiguration>();
         for (final Coder coder : folder.getOrderedCoders()) {
             if (coder.numInStreams != 1 || coder.numOutStreams != 1) {

RandomAccessFile is in general very slow in byte-by-byte access (the same applies to channels). An in-Java buffer of some sort speeds up things an order of magnitude. I'd review the remaining code too to make sure no direct RAF.read() accesses take place.

Dawid Weiss added a comment - 03/Feb/16 13:40 - edited It's even better than I thought... So I wondered why the hell the slowdown. Looked at both implementations – they differ, but they're clearly derivatives of the same initial code by Keiron Liddle. So... why the slowdown? Turns out it's the buffered vs. unbuffered reads, as is typical. This one-liner patch brings decompression speed of 7z files on par with my Hadoop fork... diff --git a/src/main/java/org/apache/commons/compress/archivers/sevenz/SevenZFile.java b/src/main/java/org/apache/commons/compress/archivers/sevenz/SevenZFile.java index 398783f..809a9dc 100644 --- a/src/main/java/org/apache/commons/compress/archivers/sevenz/SevenZFile.java +++ b/src/main/java/org/apache/commons/compress/archivers/sevenz/SevenZFile.java @@ -17,6 +17,7 @@ */ package org.apache.commons.compress.archivers.sevenz; + import java.io.BufferedInputStream; import java.io.ByteArrayInputStream; import java.io.Closeable; import java.io.DataInput; @@ -853,8 +854,10 @@ public class SevenZFile implements Closeable { private InputStream buildDecoderStack( final Folder folder, final long folderOffset, final int firstPackStreamIndex, SevenZArchiveEntry entry) throws IOException { file.seek(folderOffset); - InputStream inputStreamStack = new BoundedRandomAccessFileInputStream(file, - archive.packSizes[firstPackStreamIndex]); + InputStream inputStreamStack = + new BufferedInputStream( + new BoundedRandomAccessFileInputStream(file, + archive.packSizes[firstPackStreamIndex])); LinkedList<SevenZMethodConfiguration> methods = new LinkedList<SevenZMethodConfiguration>(); for ( final Coder coder : folder.getOrderedCoders()) { if (coder.numInStreams != 1 || coder.numOutStreams != 1) { RandomAccessFile is in general very slow in byte-by-byte access (the same applies to channels). An in-Java buffer of some sort speeds up things an order of magnitude. I'd review the remaining code too to make sure no direct RAF.read() accesses take place.

ASF GitHub Bot added a comment - 03/Feb/16 13:44

GitHub user dweiss opened a pull request:

https://github.com/apache/commons-compress/pull/7

~~COMPRESS-333~~: adds buffering on top of RandomAccessFile.

Speeds up 7Z handling an order of magnitude as a result.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dweiss/commons-compress ~~COMPRESS-333~~

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/commons-compress/pull/7.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7

commit 369c3165501681ec823cd84957a07b8825321862
Author: Dawid Weiss <dawid.weiss@carrotsearch.com>
Date: 2016-02-03T13:43:29Z

~~COMPRESS-333~~: adds buffering on top of RandomAccessFile. Speeds up 7z handling an order of magnitude as a result.

ASF GitHub Bot added a comment - 03/Feb/16 13:44 GitHub user dweiss opened a pull request: https://github.com/apache/commons-compress/pull/7 COMPRESS-333 : adds buffering on top of RandomAccessFile. Speeds up 7Z handling an order of magnitude as a result. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dweiss/commons-compress COMPRESS-333 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/commons-compress/pull/7.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7 commit 369c3165501681ec823cd84957a07b8825321862 Author: Dawid Weiss <dawid.weiss@carrotsearch.com> Date: 2016-02-03T13:43:29Z COMPRESS-333 : adds buffering on top of RandomAccessFile. Speeds up 7z handling an order of magnitude as a result.

ASF GitHub Bot added a comment - 03/Feb/16 14:15

Github user tcurdt commented on the pull request:

https://github.com/apache/commons-compress/pull/7#issuecomment-179259396

Small change, big impact - that's how we like it
Thanks for digging into this!

ASF GitHub Bot added a comment - 03/Feb/16 14:15 Github user tcurdt commented on the pull request: https://github.com/apache/commons-compress/pull/7#issuecomment-179259396 Small change, big impact - that's how we like it Thanks for digging into this!

ASF GitHub Bot added a comment - 03/Feb/16 14:33

Github user asfgit closed the pull request at:

https://github.com/apache/commons-compress/pull/7

ASF GitHub Bot added a comment - 03/Feb/16 14:33 Github user asfgit closed the pull request at: https://github.com/apache/commons-compress/pull/7

Stefan Bodewig added a comment - 05/Feb/16 17:17

Great, thanks. Let us know when you are done so we can close this issue.

WRT to the Hadoop code, we've had some bad experience with (an early version of?) it in Ant land, where the current bz2 code originates from. Ant used the then-current Hadoop code for a few releases but had to revert to Keiron's original code since the Hadoop code failed to decompress certain files - and we didn't understand it well enough to fix it ourselves. If you are still interested in going that route, I can try to dig our an example file just to ensure the problem has been fixed by now.

Stefan Bodewig added a comment - 05/Feb/16 17:17 Great, thanks. Let us know when you are done so we can close this issue. WRT to the Hadoop code, we've had some bad experience with (an early version of?) it in Ant land, where the current bz2 code originates from. Ant used the then-current Hadoop code for a few releases but had to revert to Keiron's original code since the Hadoop code failed to decompress certain files - and we didn't understand it well enough to fix it ourselves. If you are still interested in going that route, I can try to dig our an example file just to ensure the problem has been fixed by now.

Sebb added a comment - 05/Feb/16 19:33

I can try to dig our an example file just to ensure the problem has been fixed by now.

Seems like that file should be added as a test case to Compress anyway?

Sebb added a comment - 05/Feb/16 19:33 I can try to dig our an example file just to ensure the problem has been fixed by now. Seems like that file should be added as a test case to Compress anyway?

Dawid Weiss added a comment - 05/Feb/16 20:43

I couldn't see much difference between the code you currently have and the one from Hadoop (speed-wise) once I switched buffered input on – this was the core of the issue, not the decompression implementation itself. Perhaps (very likely!) there are still nuances in the code to improve upon, but I won't have the time to look into it (I'd rather look into implementing random-access to 7z archive entries).

I'm glad I found out the buffered/non-buffered issue though as the previous difference in performance was quite shocking.

Dawid Weiss added a comment - 05/Feb/16 20:43 I couldn't see much difference between the code you currently have and the one from Hadoop (speed-wise) once I switched buffered input on – this was the core of the issue, not the decompression implementation itself. Perhaps (very likely!) there are still nuances in the code to improve upon, but I won't have the time to look into it (I'd rather look into implementing random-access to 7z archive entries). I'm glad I found out the buffered/non-buffered issue though as the previous difference in performance was quite shocking.

Stefan Bodewig added a comment - 05/Feb/16 20:48

Thanks, then I'm closing this now. WRT Random-Access in 7ZIP, see ~~COMPRESS-327~~.

@sebb, you are right, I should be able to find the Bugzilla issue.

Stefan Bodewig added a comment - 05/Feb/16 20:48 Thanks, then I'm closing this now. WRT Random-Access in 7ZIP, see COMPRESS-327 . @sebb, you are right, I should be able to find the Bugzilla issue.

Stefan Bodewig added a comment - 05/Feb/16 20:55

My memory has been failing me (it's been nine years), it has been creating corrupt output. https://bz.apache.org/bugzilla/show_bug.cgi?id=41596

And even later it has been fixed and Ant (and thus Commons Compress) have started with using a fork of the Hadoop code again.

Stefan Bodewig added a comment - 05/Feb/16 20:55 My memory has been failing me (it's been nine years), it has been creating corrupt output. https://bz.apache.org/bugzilla/show_bug.cgi?id=41596 And even later it has been fixed and Ant (and thus Commons Compress) have started with using a fork of the Hadoop code again.

Dawid Weiss added a comment - 15/Mar/16 20:52

For completeness, ~~COMPRESS-327~~ talks about seeking through the file – this wouldn't yield random-access to 7zip entries if they're block-compressed (I recall explaining why this is a problem in a comment on another issue, but can't recall which one it was). In any case, if the blocks are large, providing random-access to any entry could (and very likely eventually would for somebody) lead to exponential decompression times because you'd decompress the same information over and over. The simplest case is reading files "in reverse" with solid blocks containing many, many files (or very large files).

Dawid Weiss added a comment - 15/Mar/16 20:52 For completeness, COMPRESS-327 talks about seeking through the file – this wouldn't yield random-access to 7zip entries if they're block-compressed (I recall explaining why this is a problem in a comment on another issue, but can't recall which one it was). In any case, if the blocks are large, providing random-access to any entry could (and very likely eventually would for somebody) lead to exponential decompression times because you'd decompress the same information over and over. The simplest case is reading files "in reverse" with solid blocks containing many, many files (or very large files).

People

Assignee:: Unassigned

Reporter:: Dawid Weiss

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 03/Feb/16 12:47

Updated:: 09/Jan/18 21:35

Resolved:: 05/Feb/16 20:48