[COMPRESS-111] support for lzma files - ASF JIRA

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0
Fix Version/s: 1.6
Component/s: Compressors
Labels:
None

Description

adding support for compressing and decompressing of files with LZMA algoritm (Lempel-Ziv-Markov chain-Algorithm)
(see http://markmail.org/search/?q=list%3Aorg.apache.commons.users/#query:list%3Aorg.apache.commons.users%2F+page:1+mid:syn4uuvbzusevtko+state:results)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

compress-trunk-lzmaRev0.patch
15/May/10 08:57
702 kB
maurel jean francois
compress-trunk-lzmaRev1.patch
17/May/10 20:13
136 kB
maurel jean francois

Issue Links

is related to

COMPRESS-54 Add 7zip archive support

Resolved

COMPRESS-373 support writing the "old" lzma format

Resolved

Activity

Ascending order - Click to sort in descending order

maurel jean francois added a comment - 15/May/10 08:54

I prepared a patch proposal based on 7zip sdk assuming that 7zip sdk is compatible with ASF licence

code style has been changed to make it more like Sun standards
scope of variables and methods has been restricted as much as possible
a simple test case with resources is added

I tried to follow guidance I could find for this patch however
I am not familier with the process of writing and submitting patches so any further guidance are welcome

maurel jean francois added a comment - 15/May/10 08:54 I prepared a patch proposal based on 7zip sdk assuming that 7zip sdk is compatible with ASF licence code style has been changed to make it more like Sun standards scope of variables and methods has been restricted as much as possible a simple test case with resources is added I tried to follow guidance I could find for this patch however I am not familier with the process of writing and submitting patches so any further guidance are welcome

maurel jean francois added a comment - 15/May/10 08:57

proposal for a patch adding lzma support based on 7zip sdk

maurel jean francois added a comment - 15/May/10 08:57 proposal for a patch adding lzma support based on 7zip sdk

Henri Yandell added a comment - 16/May/10 17:39

Separate from my replies in ~~LEGAL-72~~, I'm concerned that we're forking an active codebase.

None of the changes above look like good reasons to fork, so we should be sending those changes back to the original project instead. Obviously we might have to do something in the short term if any of the changes are a release blocker.

Ideally this should be a maven dependency on an lzma-sdk-java-9.12.jar.

Henri Yandell added a comment - 16/May/10 17:39 Separate from my replies in LEGAL-72 , I'm concerned that we're forking an active codebase. None of the changes above look like good reasons to fork, so we should be sending those changes back to the original project instead. Obviously we might have to do something in the short term if any of the changes are a release blocker. Ideally this should be a maven dependency on an lzma-sdk-java-9.12.jar.

Stefan Bodewig added a comment - 17/May/10 09:36

FWIW I'm with Hen here.

I think Commons Compress should provide adapters (i.e. the Compressor*Stream classes and maybe later support inside the zip package) to the lzma-sdk but not the lzma codebase itself.

Stefan Bodewig added a comment - 17/May/10 09:36 FWIW I'm with Hen here. I think Commons Compress should provide adapters (i.e. the Compressor*Stream classes and maybe later support inside the zip package) to the lzma-sdk but not the lzma codebase itself.

Sebb added a comment - 17/May/10 10:04

I agree.

However, we'll still need to find a way to include the 7Zip Java code in the build.

At present, the code only seems to be available as source files in a zip file, with top-level package SevenZip.
As far as I can tell, the code is not available from a Maven repository currently.

Possible options:

download the zip, extract the files and compile to jar during build.
get the code added to Maven
include source code (unchanged) in SVN

Sebb added a comment - 17/May/10 10:04 I agree. However, we'll still need to find a way to include the 7Zip Java code in the build. At present, the code only seems to be available as source files in a zip file, with top-level package SevenZip. As far as I can tell, the code is not available from a Maven repository currently. Possible options: download the zip, extract the files and compile to jar during build. get the code added to Maven include source code (unchanged) in SVN

Stefan Bodewig added a comment - 17/May/10 10:19

Unless anybody manages to get the code into the mvn repo I'd prefer the download and build option. Then again I have no idea if that would be painful to do with Maven.

Stefan Bodewig added a comment - 17/May/10 10:19 Unless anybody manages to get the code into the mvn repo I'd prefer the download and build option. Then again I have no idea if that would be painful to do with Maven.

Sebb added a comment - 17/May/10 11:10

It's possible to do this using Maven+Ant.

Initially could perhaps be just Ant on its own, as a pre-requisite step for building the code.

Another question: should the compiled code be put in a separate Maven artifact?
If we do that, then this is akin to publishing the code on Maven.

By the way, I've had a look the source, and two of the files (LzmaAlone and LzmaBench) don't compile as they refer to non-existent classes.
These classes are main classes and are not necessary.

Sebb added a comment - 17/May/10 11:10 It's possible to do this using Maven+Ant. Initially could perhaps be just Ant on its own, as a pre-requisite step for building the code. Another question: should the compiled code be put in a separate Maven artifact? If we do that, then this is akin to publishing the code on Maven. By the way, I've had a look the source, and two of the files (LzmaAlone and LzmaBench) don't compile as they refer to non-existent classes. These classes are main classes and are not necessary.

maurel jean francois added a comment - 17/May/10 20:13

a new version of the patch correcting a bug in uncompress (decode)
known issue: possible to read lzma signature but not to write it

maurel jean francois added a comment - 17/May/10 20:13 a new version of the patch correcting a bug in uncompress (decode) known issue: possible to read lzma signature but not to write it

Henri Yandell added a comment - 22/May/10 07:00

Someone randomly take their code, jar it up and send it to the Maven repo.

Forking's not bad if we have to do it (though I assume we've failed to get this in communication with the original project); but forking inside compress is bad. As we don't plan to turn it into an actively developed component, it's probably easier for someone to knock up a code.google project.

Henri Yandell added a comment - 22/May/10 07:00 Someone randomly take their code, jar it up and send it to the Maven repo. Forking's not bad if we have to do it (though I assume we've failed to get this in communication with the original project); but forking inside compress is bad. As we don't plan to turn it into an actively developed component, it's probably easier for someone to knock up a code.google project.

Alexander Kurtakov added a comment - 03/May/11 15:53

FWIW, here is http://git.tukaani.org/?p=xz-java.git;a=summary a pure java implementation of the xz tools, though looks like not a complete one but it can be a viable base for commons-compress support.

Alexander Kurtakov added a comment - 03/May/11 15:53 FWIW, here is http://git.tukaani.org/?p=xz-java.git;a=summary a pure java implementation of the xz tools, though looks like not a complete one but it can be a viable base for commons-compress support.

Lasse Collin added a comment - 20/Aug/11 11:42

There are plans to add support for the .xz format which supports LZMA2 compression. Once there is .xz support, is there a need for .lzma support? For most purposes, the .lzma format is a legacy format superseded by the .xz format.

Lasse Collin added a comment - 20/Aug/11 11:42 There are plans to add support for the .xz format which supports LZMA2 compression. Once there is .xz support, is there a need for .lzma support? For most purposes, the .lzma format is a legacy format superseded by the .xz format.

Stefan Bodewig added a comment - 07/Nov/11 15:43

I think support for the "old" lzma format would be beneficial for all people having to deal with legacy archives, so at least read-only support would be great.

Then again adding write support won't hurt either. We could recommend people use XZ instead inside the docs, of course.

Stefan Bodewig added a comment - 07/Nov/11 15:43 I think support for the "old" lzma format would be beneficial for all people having to deal with legacy archives, so at least read-only support would be great. Then again adding write support won't hurt either. We could recommend people use XZ instead inside the docs, of course.

Adam Kunicki added a comment - 14/Sep/12 15:01

http://jponge.github.com/lzma-java/ Seems to have an Apache licensed version of the LZMA compression (and streams) based on the original 7z SDK. Could incorporate that into commons-compress and then just need to write a compatible archiver.

Not sure how alive this issue still is..

Adam Kunicki added a comment - 14/Sep/12 15:01 http://jponge.github.com/lzma-java/ Seems to have an Apache licensed version of the LZMA compression (and streams) based on the original 7z SDK. Could incorporate that into commons-compress and then just need to write a compatible archiver. Not sure how alive this issue still is..

Damjan Jovanovic added a comment - 07/May/13 21:01

The fundamental problem is that Commons Compress does decompression via CompressorInputStream’s read() methods, which are a pull-model interface, while the LZMA SDK (in the public domain) does it with Decoder.code(), a method that takes a compressed input stream and an output stream to decompress to, then reads, decompresses, and writes, only returning when the entire file is decompressed. There is no way to convert this to a pull-model CompressorInputStream: either you have to pull in one thread while pushing from another, or push everything into a ByteArrayInputStream (which needs O(n) memory!!) and then pull from that afterwards. Both are really ugly solutions: thread per stream is heavy and creating new threads is not allowed in some environments (eg. unsigned Applets and Java EE servers), while trying to allocate O(n) memory can OutOfMemoryError the entire JVM.

The Java LZMA attempts out there rate as follows:

Maurel’s patch here uses O(n) memory, and decompresses the entire stream in the constructor and stores it in a ByteArrayInputStream which is then copied from on each read().

http://jponge.github.io/lzma-java/ is licensed ASLv2 and states how it solved the push/pull problem: “Although not a derivate work, the streaming api classes were inspired from the work of Christopher League. I reused his technique of fake streams and working threads to pass the data around between encoders/decoders and "normal" Java streams.” In other words, it pushes in one thread and pulls in another. Actual decompression in the other thread is still done with the LZMA SDK, which it just wraps into an InputStream subclass.

http://contrapunctus.net/league/haques/lzmajio/ was done by Christopher League, it’s under “LGPL or the Common Public License” and has the same push in one thread pull in another story. It’s also just a wrapper of the LZMA SDK.

http://tukaani.org/xz/java.html is in the public domain and is already used by Commons Compress to provide XZ compression support. It supports XZ and LZMA2 only and supports them well - proper pull-model InputStream with no O(n) memory or background threads. LZMA2 is a different file format from LZMA. But then again LZMA2 uses LZMA internally. I’ll have to investigate in detail.

Damjan Jovanovic added a comment - 07/May/13 21:01 The fundamental problem is that Commons Compress does decompression via CompressorInputStream’s read() methods, which are a pull-model interface, while the LZMA SDK (in the public domain) does it with Decoder.code(), a method that takes a compressed input stream and an output stream to decompress to, then reads, decompresses, and writes, only returning when the entire file is decompressed. There is no way to convert this to a pull-model CompressorInputStream: either you have to pull in one thread while pushing from another, or push everything into a ByteArrayInputStream (which needs O(n) memory!!) and then pull from that afterwards. Both are really ugly solutions: thread per stream is heavy and creating new threads is not allowed in some environments (eg. unsigned Applets and Java EE servers), while trying to allocate O(n) memory can OutOfMemoryError the entire JVM. The Java LZMA attempts out there rate as follows: Maurel’s patch here uses O(n) memory, and decompresses the entire stream in the constructor and stores it in a ByteArrayInputStream which is then copied from on each read(). http://jponge.github.io/lzma-java/ is licensed ASLv2 and states how it solved the push/pull problem: “Although not a derivate work, the streaming api classes were inspired from the work of Christopher League. I reused his technique of fake streams and working threads to pass the data around between encoders/decoders and "normal" Java streams.” In other words, it pushes in one thread and pulls in another. Actual decompression in the other thread is still done with the LZMA SDK, which it just wraps into an InputStream subclass. http://contrapunctus.net/league/haques/lzmajio/ was done by Christopher League, it’s under “LGPL or the Common Public License” and has the same push in one thread pull in another story. It’s also just a wrapper of the LZMA SDK. http://tukaani.org/xz/java.html is in the public domain and is already used by Commons Compress to provide XZ compression support. It supports XZ and LZMA2 only and supports them well - proper pull-model InputStream with no O(n) memory or background threads. LZMA2 is a different file format from LZMA. But then again LZMA2 uses LZMA internally. I’ll have to investigate in detail.

Stefan Bodewig added a comment - 08/May/13 04:24

We've decided to use the "store in memory" approach in the pack200 case which has the same problem. As an alternative there is a temp-file solution - write the stream to disk and re-read it from there. This can be used for big inputs and is offered for pack200 as well. We briefly discussed the thread model back then, it could probably be added as a third strategy.

We already depend on XZ for Java for the XZ streams, so any solution based on Lasse's code wouldn't require us to use any other external dependency which would be great. Lasse responded to this issue about two years ago so he'll likely see our comments

Stefan Bodewig added a comment - 08/May/13 04:24 We've decided to use the "store in memory" approach in the pack200 case which has the same problem. As an alternative there is a temp-file solution - write the stream to disk and re-read it from there. This can be used for big inputs and is offered for pack200 as well. We briefly discussed the thread model back then, it could probably be added as a third strategy. We already depend on XZ for Java for the XZ streams, so any solution based on Lasse's code wouldn't require us to use any other external dependency which would be great. Lasse responded to this issue about two years ago so he'll likely see our comments

maurel jean francois added a comment - 08/May/13 06:47

Hi,
For your information MinGW project uses lzma compression.
http://sourceforge.net/projects/mingw/files/MinGW/Base/mingw-rt/mingwrt-4.0/

Regards

maurel jean francois added a comment - 08/May/13 06:47 Hi, For your information MinGW project uses lzma compression. http://sourceforge.net/projects/mingw/files/MinGW/Base/mingw-rt/mingwrt-4.0/ Regards

Damjan Jovanovic added a comment - 08/May/13 07:41 - edited

Compared to the normal way of extracting a file from an archive (read->decompress->write), the temp-file solution requires read->decompress->write-temp->read-temp->write, increasing I/O time proportionally to the size of the decompressed file (ie. at least doubling it), which is why I didn't even consider it.

It seems like LZMA2 breaks up the stream to be compressed into blocks, and can (de)compress the blocks independently of each other (which has the benefit of allowing fast, multi-threaded decompression). In Lasse's code, LZMA2InputStream uses O(n) memory per block in the method RangeDecoder.prepareInputBuffer() called from LZMA2InputStream.decodeChunkHeader(). For LZMA however, the "block" is the entire file. Luckily it seems pretty easy to patch RangeDecoder to read incrementally. LZMA2InputStream probably has to also be modified, as I don't think LZMA has a chunk header. I don't know what else may be necessary.

Oh and even if LZMA is a legacy format, we still need it for reading .7z files, which always use LZMA for header compression (which is enabled by default).

Damjan Jovanovic added a comment - 08/May/13 07:41 - edited Compared to the normal way of extracting a file from an archive (read->decompress->write), the temp-file solution requires read->decompress->write-temp->read-temp->write, increasing I/O time proportionally to the size of the decompressed file (ie. at least doubling it), which is why I didn't even consider it. It seems like LZMA2 breaks up the stream to be compressed into blocks, and can (de)compress the blocks independently of each other (which has the benefit of allowing fast, multi-threaded decompression). In Lasse's code, LZMA2InputStream uses O(n) memory per block in the method RangeDecoder.prepareInputBuffer() called from LZMA2InputStream.decodeChunkHeader(). For LZMA however, the "block" is the entire file. Luckily it seems pretty easy to patch RangeDecoder to read incrementally. LZMA2InputStream probably has to also be modified, as I don't think LZMA has a chunk header. I don't know what else may be necessary. Oh and even if LZMA is a legacy format, we still need it for reading .7z files, which always use LZMA for header compression (which is enabled by default).

Stefan Bodewig added a comment - 09/Jun/13 11:50

Read support for LZMA has been added inside the LZMA branch - it relies on an unreleased version of XZ for Java.

Stefan Bodewig added a comment - 09/Jun/13 11:50 Read support for LZMA has been added inside the LZMA branch - it relies on an unreleased version of XZ for Java.

Stefan Bodewig added a comment - 29/Sep/13 06:15

branch has been merged with svn revision 1525353

Stefan Bodewig added a comment - 29/Sep/13 06:15 branch has been merged with svn revision 1525353

Stefan Bodewig added a comment - 29/Nov/16 19:53

write support will become available with ~~COMPRESS-373~~

Stefan Bodewig added a comment - 29/Nov/16 19:53 write support will become available with COMPRESS-373

People

Assignee:: Unassigned

Reporter:: maurel jean francois

Votes:: 6 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 10/May/10 14:36

Updated:: 30/Nov/16 21:16

Resolved:: 29/Sep/13 06:15