Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.0
-
None
Description
adding support for compressing and decompressing of files with LZMA algoritm (Lempel-Ziv-Markov chain-Algorithm)
(see http://markmail.org/search/?q=list%3Aorg.apache.commons.users/#query:list%3Aorg.apache.commons.users%2F+page:1+mid:syn4uuvbzusevtko+state:results)
Attachments
Attachments
- compress-trunk-lzmaRev0.patch
- 702 kB
- maurel jean francois
- compress-trunk-lzmaRev1.patch
- 136 kB
- maurel jean francois
Issue Links
- is related to
-
COMPRESS-54 Add 7zip archive support
- Resolved
-
COMPRESS-373 support writing the "old" lzma format
- Resolved
Activity
Separate from my replies in LEGAL-72, I'm concerned that we're forking an active codebase.
None of the changes above look like good reasons to fork, so we should be sending those changes back to the original project instead. Obviously we might have to do something in the short term if any of the changes are a release blocker.
Ideally this should be a maven dependency on an lzma-sdk-java-9.12.jar.
FWIW I'm with Hen here.
I think Commons Compress should provide adapters (i.e. the Compressor*Stream classes and maybe later support inside the zip package) to the lzma-sdk but not the lzma codebase itself.
I agree.
However, we'll still need to find a way to include the 7Zip Java code in the build.
At present, the code only seems to be available as source files in a zip file, with top-level package SevenZip.
As far as I can tell, the code is not available from a Maven repository currently.
Possible options:
- download the zip, extract the files and compile to jar during build.
- get the code added to Maven
- include source code (unchanged) in SVN
Unless anybody manages to get the code into the mvn repo I'd prefer the download and build option. Then again I have no idea if that would be painful to do with Maven.
It's possible to do this using Maven+Ant.
Initially could perhaps be just Ant on its own, as a pre-requisite step for building the code.
Another question: should the compiled code be put in a separate Maven artifact?
If we do that, then this is akin to publishing the code on Maven.
By the way, I've had a look the source, and two of the files (LzmaAlone and LzmaBench) don't compile as they refer to non-existent classes.
These classes are main classes and are not necessary.
a new version of the patch correcting a bug in uncompress (decode)
known issue: possible to read lzma signature but not to write it
Someone randomly take their code, jar it up and send it to the Maven repo.
Forking's not bad if we have to do it (though I assume we've failed to get this in communication with the original project); but forking inside compress is bad. As we don't plan to turn it into an actively developed component, it's probably easier for someone to knock up a code.google project.
FWIW, here is http://git.tukaani.org/?p=xz-java.git;a=summary a pure java implementation of the xz tools, though looks like not a complete one but it can be a viable base for commons-compress support.
There are plans to add support for the .xz format which supports LZMA2 compression. Once there is .xz support, is there a need for .lzma support? For most purposes, the .lzma format is a legacy format superseded by the .xz format.
I think support for the "old" lzma format would be beneficial for all people having to deal with legacy archives, so at least read-only support would be great.
Then again adding write support won't hurt either. We could recommend people use XZ instead inside the docs, of course.
http://jponge.github.com/lzma-java/ Seems to have an Apache licensed version of the LZMA compression (and streams) based on the original 7z SDK. Could incorporate that into commons-compress and then just need to write a compatible archiver.
Not sure how alive this issue still is..
The fundamental problem is that Commons Compress does decompression via CompressorInputStream’s read() methods, which are a pull-model interface, while the LZMA SDK (in the public domain) does it with Decoder.code(), a method that takes a compressed input stream and an output stream to decompress to, then reads, decompresses, and writes, only returning when the entire file is decompressed. There is no way to convert this to a pull-model CompressorInputStream: either you have to pull in one thread while pushing from another, or push everything into a ByteArrayInputStream (which needs O(n) memory!!) and then pull from that afterwards. Both are really ugly solutions: thread per stream is heavy and creating new threads is not allowed in some environments (eg. unsigned Applets and Java EE servers), while trying to allocate O(n) memory can OutOfMemoryError the entire JVM.
The Java LZMA attempts out there rate as follows:
Maurel’s patch here uses O(n) memory, and decompresses the entire stream in the constructor and stores it in a ByteArrayInputStream which is then copied from on each read().
http://jponge.github.io/lzma-java/ is licensed ASLv2 and states how it solved the push/pull problem: “Although not a derivate work, the streaming api classes were inspired from the work of Christopher League. I reused his technique of fake streams and working threads to pass the data around between encoders/decoders and "normal" Java streams.” In other words, it pushes in one thread and pulls in another. Actual decompression in the other thread is still done with the LZMA SDK, which it just wraps into an InputStream subclass.
http://contrapunctus.net/league/haques/lzmajio/ was done by Christopher League, it’s under “LGPL or the Common Public License” and has the same push in one thread pull in another story. It’s also just a wrapper of the LZMA SDK.
http://tukaani.org/xz/java.html is in the public domain and is already used by Commons Compress to provide XZ compression support. It supports XZ and LZMA2 only and supports them well - proper pull-model InputStream with no O(n) memory or background threads. LZMA2 is a different file format from LZMA. But then again LZMA2 uses LZMA internally. I’ll have to investigate in detail.
We've decided to use the "store in memory" approach in the pack200 case which has the same problem. As an alternative there is a temp-file solution - write the stream to disk and re-read it from there. This can be used for big inputs and is offered for pack200 as well. We briefly discussed the thread model back then, it could probably be added as a third strategy.
We already depend on XZ for Java for the XZ streams, so any solution based on Lasse's code wouldn't require us to use any other external dependency which would be great. Lasse responded to this issue about two years ago so he'll likely see our comments
Hi,
For your information MinGW project uses lzma compression.
http://sourceforge.net/projects/mingw/files/MinGW/Base/mingw-rt/mingwrt-4.0/
Regards
Compared to the normal way of extracting a file from an archive (read->decompress->write), the temp-file solution requires read->decompress->write-temp->read-temp->write, increasing I/O time proportionally to the size of the decompressed file (ie. at least doubling it), which is why I didn't even consider it.
It seems like LZMA2 breaks up the stream to be compressed into blocks, and can (de)compress the blocks independently of each other (which has the benefit of allowing fast, multi-threaded decompression). In Lasse's code, LZMA2InputStream uses O(n) memory per block in the method RangeDecoder.prepareInputBuffer() called from LZMA2InputStream.decodeChunkHeader(). For LZMA however, the "block" is the entire file. Luckily it seems pretty easy to patch RangeDecoder to read incrementally. LZMA2InputStream probably has to also be modified, as I don't think LZMA has a chunk header. I don't know what else may be necessary.
Oh and even if LZMA is a legacy format, we still need it for reading .7z files, which always use LZMA for header compression (which is enabled by default).
Read support for LZMA has been added inside the LZMA branch - it relies on an unreleased version of XZ for Java.
I prepared a patch proposal based on 7zip sdk assuming that 7zip sdk is compatible with ASF licence
I tried to follow guidance I could find for this patch however
I am not familier with the process of writing and submitting patches so any further guidance are welcome