Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.9
    • Component/s: Compressors
    • Labels:

      Description

      GZIP is not a compression algorithm "as such". The de facto (and currently the only supported) compression algorithm it uses is DEFLATE.
      GZIP adds a header of minimum 10 bytes and a footer of 8 bytes to a "deflated" data stream. Find out more here: http://en.wikipedia.org/wiki/Gzip#File_format

      I have no problem with the current GZIP support, but it would be nice if CommonsCompress would also have compression and decompression support for "raw" DEFLATE streams and DEFLATE streams with the zlib header.

      Similarly to the GZIP support in CommonsCompress these functionality can be implemented very easily using the standard java.util.zip package, as done in the provided patch.

      1. COMPRESS-263_DeflateSupport.patch
        26 kB
        Matthias Stevens
      2. COMPRESS-263_DeflateSupport_v1.1.patch
        29 kB
        Matthias Stevens
      3. bla.tar.deflate
        0.5 kB
        Matthias Stevens
      4. bla.tar.deflatez
        0.5 kB
        Matthias Stevens

        Activity

        Hide
        Matthias Stevens added a comment -

        Implements DEFLATE compressor and decompressor. Uses java.util.zip.
        Also includes a JUnit test case to test the new functionality.

        Show
        Matthias Stevens added a comment - Implements DEFLATE compressor and decompressor. Uses java.util.zip. Also includes a JUnit test case to test the new functionality.
        Hide
        Matthias Stevens added a comment -

        Added patch which implements this feature. Please consider including this in the next official release.
        Thanks!

        Show
        Matthias Stevens added a comment - Added patch which implements this feature. Please consider including this in the next official release. Thanks!
        Hide
        Stefan Bodewig added a comment -

        Thanks, Matthias. Could you please separately attach the archive you use in the test case, it hasn't become part of the patch.

        Show
        Stefan Bodewig added a comment - Thanks, Matthias. Could you please separately attach the archive you use in the test case, it hasn't become part of the patch.
        Hide
        Matthias Stevens added a comment -

        Ok, I must have forgotten to include it in the patch.

        Show
        Matthias Stevens added a comment - Ok, I must have forgotten to include it in the patch.
        Hide
        Matthias Stevens added a comment -

        It's weird, the test archive was including in the patch creation. But anyway, I'll upload it separately as well.

        In the meantime I've expanded the test case a bit and it new uses 2 different test archived. I'll upload the new patch + the 2 archives now.

        Show
        Matthias Stevens added a comment - It's weird, the test archive was including in the patch creation. But anyway, I'll upload it separately as well. In the meantime I've expanded the test case a bit and it new uses 2 different test archived. I'll upload the new patch + the 2 archives now.
        Hide
        Matthias Stevens added a comment -

        Improved test case. Now uses 2 test archives, also uploaded here.

        Show
        Matthias Stevens added a comment - Improved test case. Now uses 2 test archives, also uploaded here.
        Hide
        Stefan Bodewig added a comment -

        Thanks, I'll probably commit this during the weekend (minus the changes to the POM )

        One thing I might quibble about is the name of isZlibHeaderPresent - this reads the wrong way when applied to the writing side, where should...BePresent was more appropriate. How about withZlibHeader?

        Also, do you think it possible to auto-detect the format, at least in the case where the stream contains a ZLIB header?

        Show
        Stefan Bodewig added a comment - Thanks, I'll probably commit this during the weekend (minus the changes to the POM ) One thing I might quibble about is the name of isZlibHeaderPresent - this reads the wrong way when applied to the writing side, where should...BePresent was more appropriate. How about withZlibHeader? Also, do you think it possible to auto-detect the format, at least in the case where the stream contains a ZLIB header?
        Hide
        Stefan Bodewig added a comment -

        I've committed your patch unchanged as svn revision 1602546

        Apart from the isZlibHeaderPresent name already mentioned there are three things I will change but you may want to discuss or provide a patch for:

        • we need docs
        • the count-invocations in input stream are counting uncompressed bytes where they should be counting the compressed amount. I think wrapping the original stream in a CountingInputStream is the way I'd go.
        • add counting to the output stream
        Show
        Stefan Bodewig added a comment - I've committed your patch unchanged as svn revision 1602546 Apart from the isZlibHeaderPresent name already mentioned there are three things I will change but you may want to discuss or provide a patch for: we need docs the count-invocations in input stream are counting uncompressed bytes where they should be counting the compressed amount. I think wrapping the original stream in a CountingInputStream is the way I'd go. add counting to the output stream
        Hide
        Matthias Stevens added a comment -

        Thanks for committing the patch!

        Oops, I didn't mean to include the changes to the POM in my patch. Sorry, I'm quite new to all of this.

        You're probably right about the isZlibHeaderPresent name, I did consider a few alternatives, but they all were either reading (e.g. "expectZlibHeader") or writing (e.g. "putZlibHeader") specific. Your "withZlibtHeader" sounds good to me though, go for it .

        I think auto-detection of a headerless zlib stream is not possible, but perhaps it is possible with the zlib header. I'll look into that.

        As for your other remarks: I'll look into providing docs, but I might not have time the coming days/weeks. I'll also look at the byte counting stuff. Can you perhaps point me to an example in Compress were this is used the right way?
        Thanks!

        Show
        Matthias Stevens added a comment - Thanks for committing the patch! Oops, I didn't mean to include the changes to the POM in my patch. Sorry, I'm quite new to all of this. You're probably right about the isZlibHeaderPresent name, I did consider a few alternatives, but they all were either reading (e.g. "expectZlibHeader") or writing (e.g. "putZlibHeader") specific. Your "withZlibtHeader" sounds good to me though, go for it . I think auto-detection of a headerless zlib stream is not possible, but perhaps it is possible with the zlib header. I'll look into that. As for your other remarks: I'll look into providing docs, but I might not have time the coming days/weeks. I'll also look at the byte counting stuff. Can you perhaps point me to an example in Compress were this is used the right way? Thanks!
        Hide
        Matthias Stevens added a comment -

        I've done some more research regarding auto-detection of DEFLATE streams with a zlib header.
        It turns out the "ZLIB Compressed Data Format Specification" (http://tools.ietf.org/html/rfc1950) does unfortunately not define a "magic number"-like identifier. However, the first 4 bits in the header (a field called named "CM", for "Compression Method") should effectively always have the value 8, which indicates the compression method is DEFLATE. The Zlib format specification mentions that "Other compressed data formats [besides DEFLATE] are not specified in this version of the zlib specification".
        So based on these specs alone (I haven't verified it yet in practise), it appears safe to assume that all valid DEFLATEd files with a zlib header will start with CM = 8.

        Show
        Matthias Stevens added a comment - I've done some more research regarding auto-detection of DEFLATE streams with a zlib header. It turns out the "ZLIB Compressed Data Format Specification" ( http://tools.ietf.org/html/rfc1950 ) does unfortunately not define a "magic number"-like identifier. However, the first 4 bits in the header (a field called named "CM", for "Compression Method") should effectively always have the value 8, which indicates the compression method is DEFLATE. The Zlib format specification mentions that "Other compressed data formats [besides DEFLATE] are not specified in this version of the zlib specification". So based on these specs alone (I haven't verified it yet in practise), it appears safe to assume that all valid DEFLATEd files with a zlib header will start with CM = 8.
        Hide
        Stefan Bodewig added a comment -

        Well, that's only four bits of the first byte, this is bound to cause a lot of false positives so we better don't go that route IMHO. Thanks for your investigation.

        Counting on the output side should be easy, but interestingly the other compression format don't seem to do this either (only the archivers do). So let's skip that for deflate. I also only now realized XZCompressorInputStream's count is doing just what you have implemented as well, I'll look into it myself. There is org.apache.commons.compress.utils.CountingInputStream that used to be wrapped around the original input for some formats but no longer seems to be used.

        Re: withZlibHeader - see svn revision 1603054

        Show
        Stefan Bodewig added a comment - Well, that's only four bits of the first byte, this is bound to cause a lot of false positives so we better don't go that route IMHO. Thanks for your investigation. Counting on the output side should be easy, but interestingly the other compression format don't seem to do this either (only the archivers do). So let's skip that for deflate. I also only now realized XZCompressorInputStream's count is doing just what you have implemented as well, I'll look into it myself. There is org.apache.commons.compress.utils.CountingInputStream that used to be wrapped around the original input for some formats but no longer seems to be used. Re: withZlibHeader - see svn revision 1603054
        Hide
        Stefan Bodewig added a comment -

        Counting isn't used consistently in Compress, gzip counts the decompressed bytes and only counts inside the input stream as well, so I won't bother you with it here. All that's missing is the documentation, I'll look into it.

        Show
        Stefan Bodewig added a comment - Counting isn't used consistently in Compress, gzip counts the decompressed bytes and only counts inside the input stream as well, so I won't bother you with it here. All that's missing is the documentation, I'll look into it.
        Hide
        Stefan Bodewig added a comment -

        docs aded with svn revision 1609881

        Show
        Stefan Bodewig added a comment - docs aded with svn revision 1609881
        Hide
        Stefan Bodewig added a comment -

        looking at the coverage report some additional rather trivial tests might help, I'll see to slipping them in.

        Show
        Stefan Bodewig added a comment - looking at the coverage report some additional rather trivial tests might help, I'll see to slipping them in.
        Hide
        Matthias Stevens added a comment -

        Sorry for my lack of responsiveness the last 10 weeks, been extremely busy on the job.
        Is there still something I can help you with w.r.t. this feature?

        Show
        Matthias Stevens added a comment - Sorry for my lack of responsiveness the last 10 weeks, been extremely busy on the job. Is there still something I can help you with w.r.t. this feature?
        Hide
        Stefan Bodewig added a comment -

        Thanks, I've already increased the test coverage a few weeks ago, so I think all is well.

        The vote for Compress 1.9 is currently on the way, if all goes well it will be released in a few days and include DEFLATE support.

        Show
        Stefan Bodewig added a comment - Thanks, I've already increased the test coverage a few weeks ago, so I think all is well. The vote for Compress 1.9 is currently on the way, if all goes well it will be released in a few days and include DEFLATE support.

          People

          • Assignee:
            Unassigned
            Reporter:
            Matthias Stevens
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development