Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: io
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2.

      1. HADOOP-6837-lzma-java-20100623.patch
        127 kB
        Nicholas Carlini
      2. HADOOP-6837-lzma-c-20100719.patch
        177 kB
        Nicholas Carlini
      3. HADOOP-6837-lzma-1-20100722.patch
        302 kB
        Nicholas Carlini
      4. HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch
        102 kB
        Greg Roelofs
      5. HADOOP-6837-lzma-2-20100806.patch
        371 kB
        Nicholas Carlini
      6. HADOOP-6837-lzma-3-20100809.patch
        362 kB
        Nicholas Carlini
      7. HADOOP-6837-lzma-4-20100811.patch
        361 kB
        Nicholas Carlini

        Activity

        Hide
        Nicholas Carlini added a comment -

        Attached a patch. It includes an LzmaCodec, LzmaInputStream, and LzmaOutputStream. The LZMA compression/decompression uses the LZMA SDK from http://www.7-zip.org/sdk.html. The code has been tested minimally – when used in io.SequenceFile.java, it passes the TestSetFile/TestArrayFile tests. I will attach another later on when I have fully-tested code. I will also be working on a native version of the code written in C, also based off of the LZMA SDK. It is significantly faster than than the Java code.

        Show
        Nicholas Carlini added a comment - Attached a patch. It includes an LzmaCodec, LzmaInputStream, and LzmaOutputStream. The LZMA compression/decompression uses the LZMA SDK from http://www.7-zip.org/sdk.html . The code has been tested minimally – when used in io.SequenceFile.java, it passes the TestSetFile/TestArrayFile tests. I will attach another later on when I have fully-tested code. I will also be working on a native version of the code written in C, also based off of the LZMA SDK. It is significantly faster than than the Java code.
        Hide
        Allen Wittenauer added a comment -

        The 7z SDK license is "Public Domain" and 7z LZMA is LGPL. Is that compatible with the APL?

        Show
        Allen Wittenauer added a comment - The 7z SDK license is "Public Domain" and 7z LZMA is LGPL. Is that compatible with the APL?
        Hide
        Greg Roelofs added a comment -

        7-Zip is LGPL; the LZMA SDK is not:

        "License

        LZMA SDK is placed in the public domain."

        Given that both packages are hosted at the same site, with links to each other on the left bar, I think we can safely assume they know the difference between the two and have made a conscious decision to release them accordingly.

        Show
        Greg Roelofs added a comment - 7-Zip is LGPL; the LZMA SDK is not: "License LZMA SDK is placed in the public domain." Given that both packages are hosted at the same site, with links to each other on the left bar, I think we can safely assume they know the difference between the two and have made a conscious decision to release them accordingly.
        Hide
        Eli Collins added a comment -

        Hey Nicholas,

        Cool stuff. Unfortunately LGPL in incompatible with APL so we couldn't check this in. See more at http://www.apache.org/legal/resolved.html

        The LGPL is ineligible primarily due to the restrictions it places on larger works, violating the third license criterion. Therefore, LGPL-licensed works must not be included in Apache products

        Do you need to use this particular codec or are you just looking for something better than gzip/bzip2? If the latter HADOOP-6349 (support for FastLZ) would be a great place to direct your efforts, it's got a compatible license and like LZMA is significantly faster than gzip/bzip (and faster than the open source version of lzo).

        Thanks,
        Eli

        Show
        Eli Collins added a comment - Hey Nicholas, Cool stuff. Unfortunately LGPL in incompatible with APL so we couldn't check this in. See more at http://www.apache.org/legal/resolved.html The LGPL is ineligible primarily due to the restrictions it places on larger works, violating the third license criterion. Therefore, LGPL-licensed works must not be included in Apache products Do you need to use this particular codec or are you just looking for something better than gzip/bzip2? If the latter HADOOP-6349 (support for FastLZ) would be a great place to direct your efforts, it's got a compatible license and like LZMA is significantly faster than gzip/bzip (and faster than the open source version of lzo). Thanks, Eli
        Hide
        Eli Collins added a comment -

        If LZMA is public domain then it should safe to include. Would be good to have clarification from the author.

        Show
        Eli Collins added a comment - If LZMA is public domain then it should safe to include. Would be good to have clarification from the author.
        Hide
        Greg Roelofs added a comment -

        See the last line of their FAQ.

        Show
        Greg Roelofs added a comment - See the last line of their FAQ.
        Hide
        Greg Roelofs added a comment -

        LZMA is not faster than gzip/bzip2; it compresses better. FastLZ (next item on Nicholas's plate) is faster than LZO but compresses more poorly than everything else (except maybe LZW). They're both useful, but they address different parts of the problem domain.

        Show
        Greg Roelofs added a comment - LZMA is not faster than gzip/bzip2; it compresses better. FastLZ (next item on Nicholas's plate) is faster than LZO but compresses more poorly than everything else (except maybe LZW). They're both useful, but they address different parts of the problem domain.
        Hide
        Nicholas Carlini added a comment -

        Per the FAQ:

        "You can also read about the LZMA SDK, which is available under a more liberal license."

        http://www.7-zip.org/faq.html

        Show
        Nicholas Carlini added a comment - Per the FAQ: "You can also read about the LZMA SDK, which is available under a more liberal license." http://www.7-zip.org/faq.html
        Hide
        Scott Carey added a comment -

        Isn't there a new variant of LZMA (file extension xz) that uses LZMA2 and is block based (and therefore splittable)? We should definitely make sure that is the variant we want to support.

        LZMA is slower than gzip, but compresses better than both bzip2 and gzip. It is also optimized for fast decompression – it decompresses significantly faster than bzip2 (but not as fast as gzip).

        This link is useful for understanding the performance / compression ratio differences across the various compression levels provided for each:

        http://tukaani.org/lzma/benchmarks.html

        LZO, FastLZ, LZF, and the like are all faster than the above three but compress at a lower ratio. With LZMA support (hopefully .xz files, not the older 7zip) there is little reason to use bzip2 anymore – lzma level 2 compresses as fast as bzip2 level 1, but has a compression ratio as high as bzip2 level 9. lzma always decompresses 2 to 7 times as fast as bzip2 (only ~ half the decompression speed of gzip).

        It is the ideal archival storage format.

        Show
        Scott Carey added a comment - Isn't there a new variant of LZMA (file extension xz) that uses LZMA2 and is block based (and therefore splittable)? We should definitely make sure that is the variant we want to support. LZMA is slower than gzip, but compresses better than both bzip2 and gzip. It is also optimized for fast decompression – it decompresses significantly faster than bzip2 (but not as fast as gzip). This link is useful for understanding the performance / compression ratio differences across the various compression levels provided for each: http://tukaani.org/lzma/benchmarks.html LZO, FastLZ, LZF, and the like are all faster than the above three but compress at a lower ratio. With LZMA support (hopefully .xz files, not the older 7zip) there is little reason to use bzip2 anymore – lzma level 2 compresses as fast as bzip2 level 1, but has a compression ratio as high as bzip2 level 9. lzma always decompresses 2 to 7 times as fast as bzip2 (only ~ half the decompression speed of gzip). It is the ideal archival storage format.
        Hide
        Nicholas Carlini added a comment -

        The Java code from the SDK hasn't been updated since version 4.61 (which is as of 23 November, 2008), so support for LZMA2 would need to rely on C code, or be ported to Java.

        The compression ratios of LZMA and LZMA2 are nearly identical (+/- .01% from the tests I did). It does look like LZMA2 is block based and is splittable, so that would be a major plus for it.

        On the differences between LZMA and LZMA2:

                  LZMA2 is an extension on top of the original LZMA. LZMA2 uses
                  LZMA internally, but adds support for flushing the encoder,
                  uncompressed chunks, eases stateful decoder implementations,
                  and improves support for multithreading.

        http://tukaani.org/xz/xz-file-format.txt

        I did have to add support for flushing the encoder to the Java code (flushing the encoder still produces valid lzma-compressed output).

        Show
        Nicholas Carlini added a comment - The Java code from the SDK hasn't been updated since version 4.61 (which is as of 23 November, 2008), so support for LZMA2 would need to rely on C code, or be ported to Java. The compression ratios of LZMA and LZMA2 are nearly identical (+/- .01% from the tests I did). It does look like LZMA2 is block based and is splittable, so that would be a major plus for it. On the differences between LZMA and LZMA2:           LZMA2 is an extension on top of the original LZMA. LZMA2 uses           LZMA internally, but adds support for flushing the encoder,           uncompressed chunks, eases stateful decoder implementations,           and improves support for multithreading. http://tukaani.org/xz/xz-file-format.txt I did have to add support for flushing the encoder to the Java code (flushing the encoder still produces valid lzma-compressed output).
        Hide
        Greg Roelofs added a comment -

        Scott Carey wrote:

        lzma always decompresses 2 to 7 times as fast as bzip2 (only ~ half the decompression speed of gzip).

        I didn't see that in my tests. My measurements (last column) are in terms of compressed MB/sec, i.e., scaled by the compression ratio, but the ratios are close enough that that isn't a big factor:

        bzip2-1: text = 78.9% (1.1),   1.464 (0.028) ucMB/sec,   1.189 (0.037) cMB/sec
                 bin  = 50.1% (3.4),   1.395 (0.021) ucMB/sec,   2.170 (0.036) cMB/sec
        bzip2-9: text = 80.5% (1.0),   1.415 (0.028) ucMB/sec,   1.135 (0.037) cMB/sec
                 bin  = 51.6% (3.6),   1.340 (0.020) ucMB/sec,   1.878 (0.032) cMB/sec
        
        xz-1:    text = 79.6% (1.0),   2.705 (0.097) ucMB/sec,   1.457 (0.049) cMB/sec
                 bin  = 53.3% (3.5),   1.820 (0.031) ucMB/sec,   2.93  (0.20)  cMB/sec
        xz-9:    text = 82.4% (0.8),   0.240 (0.011) ucMB/sec,   1.433 (0.051) cMB/sec
                 bin  = 57.2% (3.6),   0.351 (0.010) ucMB/sec,   2.73  (0.17)  cMB/sec
        

        So xz/LZMA is definitely faster to decompress, but not immensely so. (This was all C code. The "text" and "bin" measurements are averages across roughly 350 files of each type, various sizes. Not a perfect corpus, but it should be varied enough to draw some reasonable conclusions. On the other hand, the file sizes are definitely much smaller than is typical in Hadoop jobs.)

        Btw, I didn't see Nicholas mention it, but all of the LZMA variants he tested appear to be stream-compatible--that is, any of the tools can decompress any of the others' streams, possibly modulo some header-parsing.

        Show
        Greg Roelofs added a comment - Scott Carey wrote: lzma always decompresses 2 to 7 times as fast as bzip2 (only ~ half the decompression speed of gzip). I didn't see that in my tests. My measurements (last column) are in terms of compressed MB/sec, i.e., scaled by the compression ratio, but the ratios are close enough that that isn't a big factor: bzip2-1: text = 78.9% (1.1), 1.464 (0.028) ucMB/sec, 1.189 (0.037) cMB/sec bin = 50.1% (3.4), 1.395 (0.021) ucMB/sec, 2.170 (0.036) cMB/sec bzip2-9: text = 80.5% (1.0), 1.415 (0.028) ucMB/sec, 1.135 (0.037) cMB/sec bin = 51.6% (3.6), 1.340 (0.020) ucMB/sec, 1.878 (0.032) cMB/sec xz-1: text = 79.6% (1.0), 2.705 (0.097) ucMB/sec, 1.457 (0.049) cMB/sec bin = 53.3% (3.5), 1.820 (0.031) ucMB/sec, 2.93 (0.20) cMB/sec xz-9: text = 82.4% (0.8), 0.240 (0.011) ucMB/sec, 1.433 (0.051) cMB/sec bin = 57.2% (3.6), 0.351 (0.010) ucMB/sec, 2.73 (0.17) cMB/sec So xz/LZMA is definitely faster to decompress, but not immensely so. (This was all C code. The "text" and "bin" measurements are averages across roughly 350 files of each type, various sizes. Not a perfect corpus, but it should be varied enough to draw some reasonable conclusions. On the other hand, the file sizes are definitely much smaller than is typical in Hadoop jobs.) Btw, I didn't see Nicholas mention it, but all of the LZMA variants he tested appear to be stream-compatible--that is, any of the tools can decompress any of the others' streams, possibly modulo some header-parsing.
        Hide
        Scott Carey added a comment -

        What happens if you compress a tarball of those files instead?

        Here are my results, on a directory with 4.1GB of ~64MB files. The content is mixed binary/text (key/value data, binary keys, mixed binary/text values).

        This is on CentOS 5.5, with the 'xz' and 'bzip2' packages installed via yum.

        Compression / decompression speed. Disk is capable of 200MB/sec read/write, 16GB RAM, Nehalem based processor (Xeon E5620, 2.4Ghz).
        Tests confirmed to be CPU bound with no iowait. measurements are in MB/sec for the uncompressed data.
        Source tarball, 4130 MB (100%)

        type compressed size compressed size as percent time to compress compression rate time to decompress decompression rate
        gzip -1 1430MB (34.6%) 105 s (39.3 MB/sec) 42 s 98.3 MB/sec
        gzip -6 1240MB (30.0%) 251 s (16.5 MB/sec) 41.5s 99.5 MB/sec
        bzip2 -2 1003MB (24.3%) 656 s (6.3 MB/sec) 168 s 24.6 MB/sec
        bzip2 -6 942MB (22.8%) 725 s (5.7 MB/sec) 176 s 23.5 MB/sec
        bzip2 -9 926MB (22.4%) 763 s (5.4 MB/sec) 181 s 22.8 MB/sec
        xz -2 993MB (24.0%) 429 s (9.63 MB/sec) 95s 43.5 MB/sec
        xz -6 794MB (19.2%) 2861 s (1.44 MB/sec) 83s 49.7 MB/sec

        Note that on today's newest processors, gzip decompresses at gigabit ethernet speeds. xz is half that, and bzip2 about half that again. Gzip ane zx decompress faster at higher compression ratios, bzip2 decompresses slower at higher ratios. All compress slower the higher the ratio, but bzip2 only slows down by ~20% or so from the fast to slow settings, while gzip and xz slow down by a factor of 10+ (I did not do -9 tests here for those, they are very slow).

        IMO, since xz-2 is almost 2x as fast at compression and decompression as bzip2, and similar in compression ratio, it leaves little room for bzip2's use.
        At higher compression levels, xz is very slow to compress, but achieves compression ratios significantly better than anything else and still decompresses very fast, so its great for archival storage.

        For faster compression, gzip -1 or lzo and other compression types without an entropy coder are the only options.

        The link I provided above has several cases where xz is 3 or more times faster than bzip2 at decompression, but my data doesn't behave that way.

        Raw Data:

        $ time cat packed.tar | gzip -c1 > packed.gz1
        real 1m44.938s
        user 1m42.200s
        sys 0m5.300s

        $ time cat packed.tar | gzip -c6 > packed.gz6
        real 4m11.051s
        user 4m8.438s
        sys 0m5.317s

        $ time cat packed.tar | bzip2 -2 > packed.bz2-2
        real 10m55.795s
        user 10m52.989s
        sys 0m5.030s

        $ time cat packed.tar | bzip2 -6 > packed.bz2-6
        real 12m4.847s
        user 12m2.049s
        sys 0m5.345s

        $ time cat packed.tar | bzip2 -9 > packed.bz2-9
        real 12m43.063s
        user 12m40.353s
        sys 0m4.797s

        $ time cat packed.tar | xz -zv -2 - > packed.xz
        100.0 % 991.1 MiB / 4,125.0 MiB = 0.240 9.6 MiB/s 7:09
        real 7m9.369s
        user 7m6.985s
        sys 0m7.140s

        $ time cat packed.tar | xz -zv -6 - > packed.xz6
        100.0 % 792.6 MiB / 4,125.0 MiB = 0.192 1.4 MiB/s 47:41
        real 47m41.033s
        user 47m37.794s
        sys 0m8.371s

        ------
        Tests of decompression:

        $ time cat packed.gz1 | gunzip > /dev/null
        real 0m42.081s
        user 0m41.814s
        sys 0m1.361s

        $ time cat packed.gz6 | gunzip > /dev/null
        real 0m41.512s
        user 0m41.021s
        sys 0m1.086s

        $ time cat packed.bz2-2 | bunzip2 > /dev/null
        real 2m48.528s
        user 2m48.014s
        sys 0m1.455s

        $ time cat packed.bz2-6 | bunzip2 > /dev/null
        real 2m56.511s
        user 2m55.999s
        sys 0m1.302s

        $ time cat packed.bz2-9 | bunzip2 > /dev/null
        real 3m1.064s
        user 3m0.559s
        sys 0m1.409s

        $ time cat packed.xz | xz -dc > /dev/null
        real 1m35.239s
        user 1m34.873s
        sys 0m1.301s

        $ time cat packed.xz6 | xz -dc > /dev/null
        real 1m23.219s
        user 1m22.771s
        sys 0m1.126s

        Show
        Scott Carey added a comment - What happens if you compress a tarball of those files instead? Here are my results, on a directory with 4.1GB of ~64MB files. The content is mixed binary/text (key/value data, binary keys, mixed binary/text values). This is on CentOS 5.5, with the 'xz' and 'bzip2' packages installed via yum. Compression / decompression speed. Disk is capable of 200MB/sec read/write, 16GB RAM, Nehalem based processor (Xeon E5620, 2.4Ghz). Tests confirmed to be CPU bound with no iowait. measurements are in MB/sec for the uncompressed data. Source tarball, 4130 MB (100%) type compressed size compressed size as percent time to compress compression rate time to decompress decompression rate gzip -1 1430MB (34.6%) 105 s (39.3 MB/sec) 42 s 98.3 MB/sec gzip -6 1240MB (30.0%) 251 s (16.5 MB/sec) 41.5s 99.5 MB/sec bzip2 -2 1003MB (24.3%) 656 s (6.3 MB/sec) 168 s 24.6 MB/sec bzip2 -6 942MB (22.8%) 725 s (5.7 MB/sec) 176 s 23.5 MB/sec bzip2 -9 926MB (22.4%) 763 s (5.4 MB/sec) 181 s 22.8 MB/sec xz -2 993MB (24.0%) 429 s (9.63 MB/sec) 95s 43.5 MB/sec xz -6 794MB (19.2%) 2861 s (1.44 MB/sec) 83s 49.7 MB/sec Note that on today's newest processors, gzip decompresses at gigabit ethernet speeds. xz is half that, and bzip2 about half that again. Gzip ane zx decompress faster at higher compression ratios, bzip2 decompresses slower at higher ratios. All compress slower the higher the ratio, but bzip2 only slows down by ~20% or so from the fast to slow settings, while gzip and xz slow down by a factor of 10+ (I did not do -9 tests here for those, they are very slow). IMO, since xz-2 is almost 2x as fast at compression and decompression as bzip2, and similar in compression ratio, it leaves little room for bzip2's use. At higher compression levels, xz is very slow to compress, but achieves compression ratios significantly better than anything else and still decompresses very fast, so its great for archival storage. For faster compression, gzip -1 or lzo and other compression types without an entropy coder are the only options. The link I provided above has several cases where xz is 3 or more times faster than bzip2 at decompression, but my data doesn't behave that way. Raw Data: $ time cat packed.tar | gzip -c1 > packed.gz1 real 1m44.938s user 1m42.200s sys 0m5.300s $ time cat packed.tar | gzip -c6 > packed.gz6 real 4m11.051s user 4m8.438s sys 0m5.317s $ time cat packed.tar | bzip2 -2 > packed.bz2-2 real 10m55.795s user 10m52.989s sys 0m5.030s $ time cat packed.tar | bzip2 -6 > packed.bz2-6 real 12m4.847s user 12m2.049s sys 0m5.345s $ time cat packed.tar | bzip2 -9 > packed.bz2-9 real 12m43.063s user 12m40.353s sys 0m4.797s $ time cat packed.tar | xz -zv -2 - > packed.xz 100.0 % 991.1 MiB / 4,125.0 MiB = 0.240 9.6 MiB/s 7:09 real 7m9.369s user 7m6.985s sys 0m7.140s $ time cat packed.tar | xz -zv -6 - > packed.xz6 100.0 % 792.6 MiB / 4,125.0 MiB = 0.192 1.4 MiB/s 47:41 real 47m41.033s user 47m37.794s sys 0m8.371s ------ Tests of decompression: $ time cat packed.gz1 | gunzip > /dev/null real 0m42.081s user 0m41.814s sys 0m1.361s $ time cat packed.gz6 | gunzip > /dev/null real 0m41.512s user 0m41.021s sys 0m1.086s $ time cat packed.bz2-2 | bunzip2 > /dev/null real 2m48.528s user 2m48.014s sys 0m1.455s $ time cat packed.bz2-6 | bunzip2 > /dev/null real 2m56.511s user 2m55.999s sys 0m1.302s $ time cat packed.bz2-9 | bunzip2 > /dev/null real 3m1.064s user 3m0.559s sys 0m1.409s $ time cat packed.xz | xz -dc > /dev/null real 1m35.239s user 1m34.873s sys 0m1.301s $ time cat packed.xz6 | xz -dc > /dev/null real 1m23.219s user 1m22.771s sys 0m1.126s
        Hide
        Nicholas Carlini added a comment -

        Uploaded C code with LzmaNativeInputStream and LzmaNativeOutputStream. Testing is the same as that for the Java code. The documentation is limited on the C side, and there are still (commented out) debug statements scattered all over.

        Show
        Nicholas Carlini added a comment - Uploaded C code with LzmaNativeInputStream and LzmaNativeOutputStream. Testing is the same as that for the Java code. The documentation is limited on the C side, and there are still (commented out) debug statements scattered all over.
        Hide
        Hong Tang added a comment -

        @nicolas, per our offline conversation last week, have you looked into whether the licensing of liblzma is suitable for inclusion in Hadoop? Liblzma seems better in the sense that its API resembles closely the APIs of other compression libraries like bzip or zlib and should shrink the amount of coding work needed to support C (and Java over JNI).

        Show
        Hong Tang added a comment - @nicolas, per our offline conversation last week, have you looked into whether the licensing of liblzma is suitable for inclusion in Hadoop? Liblzma seems better in the sense that its API resembles closely the APIs of other compression libraries like bzip or zlib and should shrink the amount of coding work needed to support C (and Java over JNI).
        Hide
        Nicholas Carlini added a comment -

        I spoke with Greg about it just now and he said it would probably be better for me to work on FastLZ first, and come back to doing that latter.

        Show
        Nicholas Carlini added a comment - I spoke with Greg about it just now and he said it would probably be better for me to work on FastLZ first, and come back to doing that latter.
        Hide
        Nicholas Carlini added a comment -

        Attached a patch merging the java and c code together. Fixed two bugs in the C code, and removed some unneeded methods/arguments. Ran TestCodec on both of them for several hours each, and didn't find any bugs. Tested (again with TestCodec) all possible dictionary sizes from 1<<12 to 1<<23. Also added more (much needed) documentation.

        Show
        Nicholas Carlini added a comment - Attached a patch merging the java and c code together. Fixed two bugs in the C code, and removed some unneeded methods/arguments. Ran TestCodec on both of them for several hours each, and didn't find any bugs. Tested (again with TestCodec) all possible dictionary sizes from 1<<12 to 1<<23. Also added more (much needed) documentation.
        Hide
        Nicholas Carlini added a comment -

        Attached an update patch.

        Fixed the checksum mismatch. It was possible for the decompressor to run out of input after reading the header bytes but not notice if the block ID was 1. So if there were fewer than 26 bytes in the input (but more than 16) and the byte ID was 1 then it wouldn't notice and just use whatever happened to be in the buffer at the time.

        Fixed a bug in the decompressor where it would incorrectly indicate it was finished if at the end of decompressing a block there was no more input left to decompress and decompress() was then called again (TestCodec seed 1333275328, 2011623221, -1402938700 or -1990280158; generate 50,000 records). Actually, the decompressor never returns finished now. This is because the only time the decompressor should return true is if it somehow knows the end of the stream has been reached and it doesn't, it just guesses that if it has read all the bytes it currently has then it's done, which is not the case.

        Implemented getRemaining().

        Removed iOff from both the compressor and decompressor. It was initialized to zero from the start and was only ever modified after that by setting it to 0.

        Modified TestCodec to accept a seed as an argument.

        Removed the rest of the carriage returns.

        I will be adding a native version over the next few days and will upload that patch when it's done.

        Show
        Nicholas Carlini added a comment - Attached an update patch. Fixed the checksum mismatch. It was possible for the decompressor to run out of input after reading the header bytes but not notice if the block ID was 1. So if there were fewer than 26 bytes in the input (but more than 16) and the byte ID was 1 then it wouldn't notice and just use whatever happened to be in the buffer at the time. Fixed a bug in the decompressor where it would incorrectly indicate it was finished if at the end of decompressing a block there was no more input left to decompress and decompress() was then called again (TestCodec seed 1333275328, 2011623221, -1402938700 or -1990280158; generate 50,000 records). Actually, the decompressor never returns finished now. This is because the only time the decompressor should return true is if it somehow knows the end of the stream has been reached and it doesn't, it just guesses that if it has read all the bytes it currently has then it's done, which is not the case. Implemented getRemaining(). Removed iOff from both the compressor and decompressor. It was initialized to zero from the start and was only ever modified after that by setting it to 0. Modified TestCodec to accept a seed as an argument. Removed the rest of the carriage returns. I will be adding a native version over the next few days and will upload that patch when it's done.
        Hide
        Nicholas Carlini added a comment -

        ... that was supposed to go on HADOOP-6349, not here. Ignore that.

        Show
        Nicholas Carlini added a comment - ... that was supposed to go on HADOOP-6349 , not here. Ignore that.
        Hide
        Greg Roelofs added a comment -

        Hand-edited pseudo-patch for easier reviewing.

        "New" files from the LZMA SDK were diffed against that (version 9.12), and patch-hunks containing only trivial changes (Java package, formatting, Apache boilerplate, white space, etc.) were replaced with a corresponding meta-comment. Note that the 9.12 diffs were "diff -ruw" (i.e., ignore whitespace-only changes on single lines) and only one directory level above the relevant code ("LZMA-SDK" vs. "LZMA"). The Java and C diffs were then combined, and the truly new files (e.g., LzmaCodec.java) were then added back from the original patch.

        3700 lines vs. original 10400+

        Show
        Greg Roelofs added a comment - Hand-edited pseudo-patch for easier reviewing. "New" files from the LZMA SDK were diffed against that (version 9.12), and patch-hunks containing only trivial changes (Java package, formatting, Apache boilerplate, white space, etc.) were replaced with a corresponding meta-comment. Note that the 9.12 diffs were "diff -ruw" (i.e., ignore whitespace-only changes on single lines) and only one directory level above the relevant code ("LZMA-SDK" vs. "LZMA"). The Java and C diffs were then combined, and the truly new files (e.g., LzmaCodec.java) were then added back from the original patch. 3700 lines vs. original 10400+
        Hide
        Greg Roelofs added a comment -

        First, apologies for having steered Nicholas wrong on the liblzma issue. As
        Hong noted, it provides a much saner (that is, zlib-like) API for this sort
        of thing, but I mistakenly thought it shared the GPL license of (parts of)
        xz, so we ignored it and he worked on the LZMA SDK code instead. (The latter
        did include a Java port, however; liblzma does not.)

        Overall, the 20100722 patch looks pretty decent (given the starting material),
        but it does include some less-than-beautiful workarounds to cope with the
        impedance mismatch between push- and pull-style I/O models. In light of the
        fact that liblzma is, in fact, public-domain software (every file under
        xz-4.999.9beta-143-g3e49/src/liblzma is either explicitly in the public
        domain or has been automatically generated by such a file), I'm going to ask
        that Nicholas redo the native-code version to use liblzma rather than the
        SDK. (Unfortunately, it looks like the transformation from C SDK to liblzma
        was a significant amount of work, so it doesn't appear that a trivial "liblzma-
        ification" of the Java SDK code is likely. If Nicholas concurs with that
        assessment, we can instead file a separate JIRA to port the liblzma C code
        to Java.) Note that liblzma includes an LZMA2 codec, so Scott Carey's
        splittable-codec suggestion is within reach, too.

        OK, enough preamble. There were a number of relatively minor style issues,
        which I'll simply list below, but the five main concerns were:

        • FakeInputStream.java, FakeOutputStream.java: the linked lists of byte
          arrays are tough to swallow, even given the push/pull problem, even given
          our previous discussions on the matter. It would be good to know what the
          stats are on these things in "typical" cases--how frequently does overflow
          occur in LzmaInputStream, for example, and how many buffers are used?
        • Is the Code(..., len) call in LzmaInputStream guaranteed to produce len
          bytes if it returns true? The calling read() function assumes this, but
          it's not at all obvious to me; the parameter is outSize in Code(), and I
          don't see that it's decremented or even really used at all (other than
          being stored in oldOutSize), unless it's buried inside macros defined
          elsewhere.

        The next two (or perhaps three) are no longer directly relevant, but they're
        general things to watch out for:

        • The return value from inStream.read() in LzmaNativeInputStream.java is
          ignored even though there's no guarantee the call will return the requested
          number of bytes. A comment ("never have to call ... again") reiterates
          this error.
        • There's no need for recursion in LzmaNativeOutputStream.java's write()
          method; iterative solutions tend to be far cleaner, I think. (Probably
          ditto for LzmaNativeInputStream.java's read() method.)
        • LzmaCompressor.c has a pair of memleaks (state->outStream, state->inStream).

        Here are the minor readability/maintainability/cosmetic/style issues:

        • add LZMA SDK version (apparently 9.12) and perhaps its release date to the
          boilerplate
        • tabs/formatting of LZMA SDK code (CRC.java, BinTree, etc.): I think
          tabs are frowned upon in Hadoop, though I might be wrong; at any rate,
          they seem to be rarely used
          • for easy Hadoop-style formatting, "indent -i2 -br -l80" is a start (though
            it's sometimes confused by Java/C++ constructs)
        • reuse existing CRC implementation(s)? (JDK has one, Hadoop has another)
        • prefer lowercase "lzma" for subdirs
        • use uppercase "LZMA" when referring to codec/algorithm (e.g., comments)
        • add README mentioning anything special/weird/etc. (e.g., weird hashtable
          issue); summary of changes made for Hadoop; potential Java/C diffs; binary
          compatibility between various output formats (other than trivial headers/
          footers); LZMA2 (splittable, not yet implemented); liblzma (much cleaner,
          more zlib-like implementation, still PD); etc.
        • "ant javadoc" run yet? (apparently not recently)
        • line lengths, particularly comments (LzmaInputStream.java, etc.): should
          be no more than 80 columns in general (Hadoop guidelines)
        • avoid generic variable names for globals and class members; use existing
          conventions where possible (e.g., look at gzip/zlib and bzip2 code)
        • LzmaCodec.java:
          • uppercase "LZMA" when referring to codec/algorithm in general
          • "funcionality" x 4
          • "throws ... {" continuation line: don't further indent
        • LZ/InWindow.java
          • leftover debug code at end
        • RangeCoder/Encoder.java
          • spurious blank line (else just boilerplate)
        • FakeOutputStream.java:
          • "stuffeed"
          • "ammount"
          • isOverflow() -> didOverflow()
        • LzmaInputStream.java:
          • [uses FakeOutputStream]
          • "bufferd"
          • "we 've"
          • index -> overflowIndex (or similar): too generic
          • hasRead -> initialized
          • doInit(): use LzmaCodec getDefaultExtension() (or simply change text to
            "LZMA") instead of hardcoded ".lzma"
            • if higher-level code extended to support .xz, would this code even know?
          • read(): do doInit() after args-checking/throws
          • 3rd read() has bogus @return comment
          • overflow approach: data on how frequent, how many buffers?
            • no reuse: creating new FakeOutputStream on every read()
        • LzmaOutputStream.java
          • [uses FakeInputStream]
          • hasWritten -> initialized
          • 6 -> static final int defaultCompressionLevel
          • ctor: level -> this.level; l -> level (broken javadoc currently)
          • "may cause a lower compression ratio": measured? by how much?
        • LzmaNativeInputStream.java
          • line lengths
          • hasRead -> initialized
          • tmp -> tmpBuf
          • max -> maxBlockSize? ...?
          • index -> ...?
          • buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf (~zlib)
          • what -> something better
          • 1<<16, 1<<12 -> static final whatever ints, and use N*1024 instead
          • allocateDirect() should be using config'd buffer size, I think (check
            zlib/similar impls for use of io.file.buffer.size)
          • giveMore(): return val not used, but if it were, true/false seems backward
            (expect: if !giveMore() => nothing available)
          • read(): point of extras just to support recursion? (extras, what: what??)
          • "decompresed"
        • LzmaNativeOutputStream.java
          • buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf (~zlib)
          • 1<<16, 1<<17 -> static final whatever int (and 64*1024, etc.)
          • hasWritten -> initialized
          • tmp -> tmpBuf
        • LZMA/Encoder.java
          • "their" -> "original SDK" or similar
          • "Setup the encoder" -> "Set up the encoder"
          • "Do one encoding of a block" -> "Encode one block" ?
        • LzmaCompressor.c
          • buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf (~zlib)
        • LzmaDecompressor.c
          • inPos vs. posInInBuf: ???
          • buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf (~zlib)
          • 1<<16, 1<<12, difference -> macros or static const ints
        • Makefile.am
          • zlib -> LZMA
          • "All these are setup" -> "All these are set up"
          • "are also done" -> "is also done"
        • LzFind.c
          • remove commented-out stuff
          • (1<<15)+(1<<14) -> 48*1024 (etc.)
        • LZMA/LzmaEnc.c
          • tabs
        Show
        Greg Roelofs added a comment - First, apologies for having steered Nicholas wrong on the liblzma issue. As Hong noted, it provides a much saner (that is, zlib-like) API for this sort of thing, but I mistakenly thought it shared the GPL license of (parts of) xz, so we ignored it and he worked on the LZMA SDK code instead. (The latter did include a Java port, however; liblzma does not.) Overall, the 20100722 patch looks pretty decent (given the starting material), but it does include some less-than-beautiful workarounds to cope with the impedance mismatch between push- and pull-style I/O models. In light of the fact that liblzma is, in fact, public-domain software (every file under xz-4.999.9beta-143-g3e49/src/liblzma is either explicitly in the public domain or has been automatically generated by such a file), I'm going to ask that Nicholas redo the native-code version to use liblzma rather than the SDK. (Unfortunately, it looks like the transformation from C SDK to liblzma was a significant amount of work, so it doesn't appear that a trivial "liblzma- ification" of the Java SDK code is likely. If Nicholas concurs with that assessment, we can instead file a separate JIRA to port the liblzma C code to Java.) Note that liblzma includes an LZMA2 codec, so Scott Carey's splittable-codec suggestion is within reach, too. OK, enough preamble. There were a number of relatively minor style issues, which I'll simply list below, but the five main concerns were: FakeInputStream.java, FakeOutputStream.java: the linked lists of byte arrays are tough to swallow, even given the push/pull problem, even given our previous discussions on the matter. It would be good to know what the stats are on these things in "typical" cases--how frequently does overflow occur in LzmaInputStream, for example, and how many buffers are used? Is the Code(..., len) call in LzmaInputStream guaranteed to produce len bytes if it returns true? The calling read() function assumes this, but it's not at all obvious to me; the parameter is outSize in Code(), and I don't see that it's decremented or even really used at all (other than being stored in oldOutSize), unless it's buried inside macros defined elsewhere. The next two (or perhaps three) are no longer directly relevant, but they're general things to watch out for: The return value from inStream.read() in LzmaNativeInputStream.java is ignored even though there's no guarantee the call will return the requested number of bytes. A comment ("never have to call ... again") reiterates this error. There's no need for recursion in LzmaNativeOutputStream.java's write() method; iterative solutions tend to be far cleaner, I think. (Probably ditto for LzmaNativeInputStream.java's read() method.) LzmaCompressor.c has a pair of memleaks (state->outStream, state->inStream). Here are the minor readability/maintainability/cosmetic/style issues: add LZMA SDK version (apparently 9.12) and perhaps its release date to the boilerplate tabs/formatting of LZMA SDK code (CRC.java, BinTree, etc.): I think tabs are frowned upon in Hadoop, though I might be wrong; at any rate, they seem to be rarely used for easy Hadoop-style formatting, "indent -i2 -br -l80" is a start (though it's sometimes confused by Java/C++ constructs) reuse existing CRC implementation(s)? (JDK has one, Hadoop has another) prefer lowercase "lzma" for subdirs use uppercase "LZMA" when referring to codec/algorithm (e.g., comments) add README mentioning anything special/weird/etc. (e.g., weird hashtable issue); summary of changes made for Hadoop; potential Java/C diffs; binary compatibility between various output formats (other than trivial headers/ footers); LZMA2 (splittable, not yet implemented); liblzma (much cleaner, more zlib-like implementation, still PD); etc. "ant javadoc" run yet? (apparently not recently) line lengths, particularly comments (LzmaInputStream.java, etc.): should be no more than 80 columns in general (Hadoop guidelines) avoid generic variable names for globals and class members; use existing conventions where possible (e.g., look at gzip/zlib and bzip2 code) LzmaCodec.java: uppercase "LZMA" when referring to codec/algorithm in general "funcionality" x 4 "throws ... {" continuation line: don't further indent LZ/InWindow.java leftover debug code at end RangeCoder/Encoder.java spurious blank line (else just boilerplate) FakeOutputStream.java: "stuffeed" "ammount" isOverflow() -> didOverflow() LzmaInputStream.java: [uses FakeOutputStream] "bufferd" "we 've" index -> overflowIndex (or similar): too generic hasRead -> initialized doInit(): use LzmaCodec getDefaultExtension() (or simply change text to "LZMA") instead of hardcoded ".lzma" if higher-level code extended to support .xz, would this code even know? read(): do doInit() after args-checking/throws 3rd read() has bogus @return comment overflow approach: data on how frequent, how many buffers? no reuse: creating new FakeOutputStream on every read() LzmaOutputStream.java [uses FakeInputStream] hasWritten -> initialized 6 -> static final int defaultCompressionLevel ctor: level -> this.level; l -> level (broken javadoc currently) "may cause a lower compression ratio": measured? by how much? LzmaNativeInputStream.java line lengths hasRead -> initialized tmp -> tmpBuf max -> maxBlockSize? ...? index -> ...? buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf (~zlib) what -> something better 1<<16, 1<<12 -> static final whatever ints, and use N*1024 instead allocateDirect() should be using config'd buffer size, I think (check zlib/similar impls for use of io.file.buffer.size) giveMore(): return val not used, but if it were, true/false seems backward (expect: if !giveMore() => nothing available) read(): point of extras just to support recursion? (extras, what: what??) "decompresed" LzmaNativeOutputStream.java buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf (~zlib) 1<<16, 1<<17 -> static final whatever int (and 64*1024, etc.) hasWritten -> initialized tmp -> tmpBuf LZMA/Encoder.java "their" -> "original SDK" or similar "Setup the encoder" -> "Set up the encoder" "Do one encoding of a block" -> "Encode one block" ? LzmaCompressor.c buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf (~zlib) LzmaDecompressor.c inPos vs. posInInBuf: ??? buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf (~zlib) 1<<16, 1<<12, difference -> macros or static const ints Makefile.am zlib -> LZMA "All these are setup" -> "All these are set up" "are also done" -> "is also done" LzFind.c remove commented-out stuff (1<<15)+(1<<14) -> 48*1024 (etc.) LZMA/LzmaEnc.c tabs
        Hide
        Nicholas Carlini added a comment -

        Responding to the major comments – will upload a patch that fixes these and the smaller comments soon.

        FakeInputStream LinkedList:
        This LinkedList can get fairly long, depending on how write is called. Worst case it can have upwards of 12 million elements, which is far beyond acceptable. This is the case if the write(single_byte) is called over and over. Each call will add a new link. Looking back at this, linked list probably wasn't the best way to go.

        There are two (obvious) ways that write() could have worked. One is using linked lists as I did. The other way would be to create a byte array that can hold forceWriteLen bytes and just copy into it; however this can be as large as 12MB. There are then two other ways to make this work. The first is just allocating the 12MB right up front. The other way is to start it with maybe just 64k, and make it grow (by powers of two) until it reaches 12MB, however this would end up arraycopying a little under 12MB in total more than the other solution. I will implement one of these for the patch.

        FakeOutputStream LinkedList:
        This linked list has a more reasonable use. Its purpose is to hold extra bytes just in case the input stream gives too many. I am fairly confident that at most 272 bytes (maximum number of fast bytes - 1) can be written to it. The reason I used a linked list, however, is that I couldn't formally prove this after going through code. I wanted to be safe and just in case their code doesn't behave as it should, everything will work on the OutputStream end.

        Code(..., len)
        I think I remember figuring out that Code(...) will return at least (but possibly more than) len bytes with the one exception that when the end of the stream is reached it will only read up to the end of the stream. I will modify the decompressor to no longer assume this and use the actual number of bytes read instead.

        Fixed the inStream.read() bug (and will be in patch I upload). Added a while loop to read until EOF is reached so the assumptions are true.

        Tail call recursive methods -> while loop. Java should add tail-call optimizations when methods only call themselves recursively (which would require no changes to the bytecode).

        Fixed memory leaks.

        Show
        Nicholas Carlini added a comment - Responding to the major comments – will upload a patch that fixes these and the smaller comments soon. FakeInputStream LinkedList: This LinkedList can get fairly long, depending on how write is called. Worst case it can have upwards of 12 million elements, which is far beyond acceptable. This is the case if the write(single_byte) is called over and over. Each call will add a new link. Looking back at this, linked list probably wasn't the best way to go. There are two (obvious) ways that write() could have worked. One is using linked lists as I did. The other way would be to create a byte array that can hold forceWriteLen bytes and just copy into it; however this can be as large as 12MB. There are then two other ways to make this work. The first is just allocating the 12MB right up front. The other way is to start it with maybe just 64k, and make it grow (by powers of two) until it reaches 12MB, however this would end up arraycopying a little under 12MB in total more than the other solution. I will implement one of these for the patch. FakeOutputStream LinkedList: This linked list has a more reasonable use. Its purpose is to hold extra bytes just in case the input stream gives too many. I am fairly confident that at most 272 bytes (maximum number of fast bytes - 1) can be written to it. The reason I used a linked list, however, is that I couldn't formally prove this after going through code. I wanted to be safe and just in case their code doesn't behave as it should, everything will work on the OutputStream end. Code(..., len) I think I remember figuring out that Code(...) will return at least (but possibly more than) len bytes with the one exception that when the end of the stream is reached it will only read up to the end of the stream. I will modify the decompressor to no longer assume this and use the actual number of bytes read instead. Fixed the inStream.read() bug (and will be in patch I upload). Added a while loop to read until EOF is reached so the assumptions are true. Tail call recursive methods -> while loop. Java should add tail-call optimizations when methods only call themselves recursively (which would require no changes to the bytecode). Fixed memory leaks.
        Hide
        Nicholas Carlini added a comment -

        I found a bug where Java compression would set a very, very wrong dictionary size. Instead of setting the dictionary size to, say, 2^24, it would set it to 24 (which would then be forced up to the min size, but still, very very wrong).

        I also added a fairly long package.html (~3000 words) with documentation about how what I did works, so anyone else who wants to modify it hopefully won't need to spend forever exploring the code again figuring out how it works.

        And also, I was both right and wrong about giveMore(). I was right when I first wrote it, and wrong when I said I fixed it by using the return value. The return values were actually left over from when I was checking for the end of stream in the Java end, but I realized that it was possible (because of the semi-circular buffer) for Java to indicate an EOF but for it not to be really true. So I had moved that check to the C code and just never removed that code from that Java end.

        Fixed the linked list stuff.

        Also a fairly significant directory restructure. The modified SDK code now is in src/contrib/SevenZip. Java code under src/java and C code is under src/native. I removed all of my re-formatting of their code so should a future version of the SDK be released, it shouldn't be as hard to do a diff and apply the patch to make this code better. The makefile from there builds to the same build tree as otherwise for java.

        In order to get it building correctly, I had to modify the base build.xml and the contrib/build.xml. compile-core-classes now also depends on compile-contrib-before. compile-contrib-before now calls compile-before on contrib/build.xml. From there contrib/build.xml calls compile-before on contrib/*/build.xml, with failonerror set to false so this change will not break any other build scripts. (This change is required because Lzma

        {Input,Output}

        Stream stream requires the classes to be built first.)

        I also cleaned up the code and fixed all the review comments.

        There will be at least one more version of this patch for things I didn't catch.

        Show
        Nicholas Carlini added a comment - I found a bug where Java compression would set a very, very wrong dictionary size. Instead of setting the dictionary size to, say, 2^24, it would set it to 24 (which would then be forced up to the min size, but still, very very wrong). I also added a fairly long package.html (~3000 words) with documentation about how what I did works, so anyone else who wants to modify it hopefully won't need to spend forever exploring the code again figuring out how it works. And also, I was both right and wrong about giveMore(). I was right when I first wrote it, and wrong when I said I fixed it by using the return value. The return values were actually left over from when I was checking for the end of stream in the Java end, but I realized that it was possible (because of the semi-circular buffer) for Java to indicate an EOF but for it not to be really true. So I had moved that check to the C code and just never removed that code from that Java end. Fixed the linked list stuff. Also a fairly significant directory restructure. The modified SDK code now is in src/contrib/SevenZip. Java code under src/java and C code is under src/native. I removed all of my re-formatting of their code so should a future version of the SDK be released, it shouldn't be as hard to do a diff and apply the patch to make this code better. The makefile from there builds to the same build tree as otherwise for java. In order to get it building correctly, I had to modify the base build.xml and the contrib/build.xml. compile-core-classes now also depends on compile-contrib-before. compile-contrib-before now calls compile-before on contrib/build.xml. From there contrib/build.xml calls compile-before on contrib/*/build.xml, with failonerror set to false so this change will not break any other build scripts. (This change is required because Lzma {Input,Output} Stream stream requires the classes to be built first.) I also cleaned up the code and fixed all the review comments. There will be at least one more version of this patch for things I didn't catch.
        Hide
        Greg Roelofs added a comment -

        20100806 version looks pretty good. The first couple of issues are the main ones:

        • FakeOutputStream LinkedList of 1 KB buffers (BLOCKSIZE = 1024): supposed to be 128 KB buffers (package.html)
        • directory structure = unholy mix of 7zip, Hadoop (expected directory structure similar to LZMA SDK tarball contents, e.g., lzma912/Java/SevenZip/Compression/LZMA/Decoder.java and lzma912/C/LzFind.c)
          • OK to trim some of tarball's levels out (and to be different for Java, C), and top level need not be "lzma912" (though it makes connection to download more obvious)--I know I suggested "SevenZip", but I thought that appeared in both the Java and the C paths: suggest lzma912 or lzma-9.12 instead
        • kDefaultDictionaryLogSize no longer changed to 16: OK?
        • apparently bogus files:
          • src/contrib/SevenZip/ivy/libraries.properties
          • src/contrib/ec2/bin/hadoop-ec2-env.sh
        • "LZMA SDK" -> "LZMA SDK 9.12" in all boilerplate
        • CRC: prefer to reuse existing (e.g., PureJavaCrc32); should be compatible
        • LzmaNativeInputStream.java:
          • "circularwould"
          • read(): fast busy-loop not thread-friendly...and not necessary: read(b[]) (InputStream) blocks until at least 1 byte available--zero returned only if b[] has length zero, which is not true of oneByte[]
          • read(): t unnecessary; just do: return (int)oneByte[0] & 0xFF;
            or even: return (ret == -1)? -1 : (int)oneByte[0] & 0xFF;
          • 1<<13
        • LzmaNativeOutputStream.java:
          • buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf as above
          • 1<<16
        • LzmaOutputStream.java:
          • 1<<17
        • Makefile.am
          • "All these are setup" -> "All these are set up"
          • "are also done" -> "is also done"

        (Per previous feedback, please change all "1<<N" to the Java equivalent of static const int kSomeBufferSize = M * 1024: easier to read, easier to change later.)

        Btw, it's always best to follow the existing style consistently than to use your own for your changes (with the possible exception of the boilerplate comment). Perhaps Emacs hides the issue, but with 8-space tabs, your changes to the contrib LZMA files are a complete mismatch to their style.

        Be sure to run "ant javadoc" (and fix any new issues) before the next patch, and give "ant test" a shot, too (over the weekend if you happen to see this--it takes several hours to run). I'll work with you to get "ant test-patch" going.

        Show
        Greg Roelofs added a comment - 20100806 version looks pretty good. The first couple of issues are the main ones: FakeOutputStream LinkedList of 1 KB buffers (BLOCKSIZE = 1024): supposed to be 128 KB buffers (package.html) directory structure = unholy mix of 7zip, Hadoop (expected directory structure similar to LZMA SDK tarball contents, e.g., lzma912/Java/SevenZip/Compression/LZMA/Decoder.java and lzma912/C/LzFind.c) OK to trim some of tarball's levels out (and to be different for Java, C), and top level need not be "lzma912" (though it makes connection to download more obvious)--I know I suggested "SevenZip", but I thought that appeared in both the Java and the C paths: suggest lzma912 or lzma-9.12 instead kDefaultDictionaryLogSize no longer changed to 16: OK? apparently bogus files: src/contrib/SevenZip/ivy/libraries.properties src/contrib/ec2/bin/hadoop-ec2-env.sh "LZMA SDK" -> "LZMA SDK 9.12" in all boilerplate CRC: prefer to reuse existing (e.g., PureJavaCrc32); should be compatible LzmaNativeInputStream.java: "circularwould" read(): fast busy-loop not thread-friendly...and not necessary: read(b[]) (InputStream) blocks until at least 1 byte available--zero returned only if b[] has length zero, which is not true of oneByte[] read(): t unnecessary; just do: return (int)oneByte [0] & 0xFF; or even: return (ret == -1)? -1 : (int)oneByte [0] & 0xFF; 1<<13 LzmaNativeOutputStream.java: buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf as above 1<<16 LzmaOutputStream.java: 1<<17 Makefile.am "All these are setup" -> "All these are set up" "are also done" -> "is also done" (Per previous feedback, please change all "1<<N" to the Java equivalent of static const int kSomeBufferSize = M * 1024: easier to read, easier to change later.) Btw, it's always best to follow the existing style consistently than to use your own for your changes (with the possible exception of the boilerplate comment). Perhaps Emacs hides the issue, but with 8-space tabs, your changes to the contrib LZMA files are a complete mismatch to their style. Be sure to run "ant javadoc" (and fix any new issues) before the next patch, and give "ant test" a shot, too (over the weekend if you happen to see this--it takes several hours to run). I'll work with you to get "ant test-patch" going.
        Hide
        Greg Roelofs added a comment -

        On second thought, if you put the version into the src/contrib path as I suggested, there's no need to add it to the boilerplate text, too. That will make future forward-ports simpler (i.e., they can use the same boilerplate text).

        Show
        Greg Roelofs added a comment - On second thought, if you put the version into the src/contrib path as I suggested, there's no need to add it to the boilerplate text, too. That will make future forward-ports simpler (i.e., they can use the same boilerplate text).
        Hide
        Nicholas Carlini added a comment -

        I'm going to respond to each one this time to make sure I don't miss anything again...

        FakeOutputStream isn't the one I'm talking about in package.html. That's for the OutputStream/FakeInputStream. FakeOutputStream is just the one where I couldn't justify the maximum acting correctly (wrtiting a max of 273 bytes extra) so I added the linked list in case anything goes wrong.

        Changed directory structure as suggested – src/contrib/lzma912/src/SevenZip/* has the java files and src/contrib/lzma912/src/native has the native code.

        The default dictionary size doesn't matter, because it gets set to something else on initialization. I left it there to keep the minimum set of changes.

        I needed library.properties there for it to build correctly: I have no idea what it does, but if I delete it then it doesn't build. And when it's empty it doesn't complain, so that's why it's there.

        Sorry about the ec2 file; I didn't realize that went into the patch! It's not supposed to be there.

        Deleted CRC.java ... apparently it was never used.

        Removed the read() while loop and wrote as suggested.

        Fixed names (buffered/sendData) again, I had fixed them but then reverted to an older version when I introduced a bug.

        Removed 1<<x's to y*1024.

        Makefile.am, fixed grammar.

        Yeah, sorry. I didn't realize that happened with the off-indentation. Fixed it.

        I'll talk to you about ant test-patch. I'm running ant test right now and if anything goes wrong I'll push another patch. ant javadoc has no warnings.

        Show
        Nicholas Carlini added a comment - I'm going to respond to each one this time to make sure I don't miss anything again... FakeOutputStream isn't the one I'm talking about in package.html. That's for the OutputStream/FakeInputStream. FakeOutputStream is just the one where I couldn't justify the maximum acting correctly (wrtiting a max of 273 bytes extra) so I added the linked list in case anything goes wrong. Changed directory structure as suggested – src/contrib/lzma912/src/SevenZip/* has the java files and src/contrib/lzma912/src/native has the native code. The default dictionary size doesn't matter, because it gets set to something else on initialization. I left it there to keep the minimum set of changes. I needed library.properties there for it to build correctly: I have no idea what it does, but if I delete it then it doesn't build. And when it's empty it doesn't complain, so that's why it's there. Sorry about the ec2 file; I didn't realize that went into the patch! It's not supposed to be there. Deleted CRC.java ... apparently it was never used. Removed the read() while loop and wrote as suggested. Fixed names (buffered/sendData) again, I had fixed them but then reverted to an older version when I introduced a bug. Removed 1<<x's to y*1024. Makefile.am, fixed grammar. Yeah, sorry. I didn't realize that happened with the off-indentation. Fixed it. I'll talk to you about ant test-patch. I'm running ant test right now and if anything goes wrong I'll push another patch. ant javadoc has no warnings.
        Hide
        Greg Roelofs added a comment -

        FakeOutputStream isn't the one I'm talking about in package.html. That's for the OutputStream/ FakeInputStream. FakeOutputStream is just the one where I couldn't justify the maximum acting correctly (wrtiting a max of 273 bytes extra) so I added the linked list in case anything goes wrong.

        Argh, yes, of course...we discussed that at least twice already. Sorry for spacing.

        Did you ever instrument it to emit a warning if it did go above 273 extra bytes?

        Show
        Greg Roelofs added a comment - FakeOutputStream isn't the one I'm talking about in package.html. That's for the OutputStream/ FakeInputStream. FakeOutputStream is just the one where I couldn't justify the maximum acting correctly (wrtiting a max of 273 bytes extra) so I added the linked list in case anything goes wrong. Argh, yes, of course...we discussed that at least twice already. Sorry for spacing. Did you ever instrument it to emit a warning if it did go above 273 extra bytes?
        Hide
        Nicholas Carlini added a comment -

        I did. The average number of overflow bytes is 24. I never saw it go above 120. A quick sed/dc script tells me the standard deviation is 18. So I'm fairly sure that I am correct in that it will never go above 273. Trying with setting the number of fast bytes to 273 gives average of 37 and standard deviation of 26.

        Show
        Nicholas Carlini added a comment - I did. The average number of overflow bytes is 24. I never saw it go above 120. A quick sed/dc script tells me the standard deviation is 18. So I'm fairly sure that I am correct in that it will never go above 273. Trying with setting the number of fast bytes to 273 gives average of 37 and standard deviation of 26.
        Hide
        Pelle Nilsson added a comment -

        Do I read these comments correctly that LZMA2/xz is not included in the current patch, and might not be included as part of this issue since the LZMA Java lib does not support it?

        Show
        Pelle Nilsson added a comment - Do I read these comments correctly that LZMA2/xz is not included in the current patch, and might not be included as part of this issue since the LZMA Java lib does not support it?
        Hide
        Nicholas Carlini added a comment -

        The current patch does not support LZMA2, and correct, it probably won't be included in this patch. There will probably be another JIRA filed to swap out the current C code with the LibLzma library to make the C side of things cleaner. LibLzma also has LZMA2 support, so that would be free. It should be trivial to add LibLzma which has a very Zlib-like interface. And then there will probably be another JIRA to port LibLzma to Java, and make everything nice and clean.

        Show
        Nicholas Carlini added a comment - The current patch does not support LZMA2, and correct, it probably won't be included in this patch. There will probably be another JIRA filed to swap out the current C code with the LibLzma library to make the C side of things cleaner. LibLzma also has LZMA2 support, so that would be free. It should be trivial to add LibLzma which has a very Zlib-like interface. And then there will probably be another JIRA to port LibLzma to Java, and make everything nice and clean.
        Hide
        Nicholas Carlini added a comment -

        Fixed two FindBugs warnings. Made one variable static and removed another.

        Also I fixed build.xml (I think, it works correctly when doing 'ant package' on src/contrib, and it didn't work on the previous patch) to not crash when ant package is run. Had to override all of the build-contrib targets with empty ones. Otherwise it'd go looking for directories which didn't exist (because I don't need them to exist).

        I also changed build-contrib.xml (which is included by the other build.xml's) to have a compile-before target in it so there won't be any errors (even if I did set failonerror to false). Safer that way.

        FindBugs fails because of four errors in the original SevenZip code. I could change those to static to make FindBugs happy but that would make more work to apply a patch to the java code, so unless there's a particularly good reason to do so I won't.
        SIC Should SevenZip.Compression.LZMA.Decoder$LenDecoder be a static inner class?
        SIC Should SevenZip.Compression.LZMA.Decoder$LiteralDecoder be a static inner class?
        SIC Should SevenZip.Compression.LZMA.Encoder$LiteralEncoder be a static inner class?
        SIC Should SevenZip.Compression.LZMA.Encoder$Optimal be a static inner class?

        And JavaDoc find an error:
        [javadoc] javadoc: error - Illegal package name: ""
        However, I can not reproduce this on my local machine. Let's see if it happens with Hudson.

        Show
        Nicholas Carlini added a comment - Fixed two FindBugs warnings. Made one variable static and removed another. Also I fixed build.xml (I think, it works correctly when doing 'ant package' on src/contrib, and it didn't work on the previous patch) to not crash when ant package is run. Had to override all of the build-contrib targets with empty ones. Otherwise it'd go looking for directories which didn't exist (because I don't need them to exist). I also changed build-contrib.xml (which is included by the other build.xml's) to have a compile-before target in it so there won't be any errors (even if I did set failonerror to false). Safer that way. FindBugs fails because of four errors in the original SevenZip code. I could change those to static to make FindBugs happy but that would make more work to apply a patch to the java code, so unless there's a particularly good reason to do so I won't. SIC Should SevenZip.Compression.LZMA.Decoder$LenDecoder be a static inner class? SIC Should SevenZip.Compression.LZMA.Decoder$LiteralDecoder be a static inner class? SIC Should SevenZip.Compression.LZMA.Encoder$LiteralEncoder be a static inner class? SIC Should SevenZip.Compression.LZMA.Encoder$Optimal be a static inner class? And JavaDoc find an error: [javadoc] javadoc: error - Illegal package name: "" However, I can not reproduce this on my local machine. Let's see if it happens with Hudson.
        Hide
        Hong Tang added a comment -

        Nicholas's has done some interesting works. But unfortunately I will -1 for marking it "patch available". The currently patch carries a modified version of LZMA SDK. This is a huge maintenance overhead going forward where a much simpler solution clearly exists. We should explore the liblzma route first as I mentioned in http://bit.ly/cDz2Pk.

        Show
        Hong Tang added a comment - Nicholas's has done some interesting works. But unfortunately I will -1 for marking it "patch available". The currently patch carries a modified version of LZMA SDK. This is a huge maintenance overhead going forward where a much simpler solution clearly exists. We should explore the liblzma route first as I mentioned in http://bit.ly/cDz2Pk .
        Hide
        Jakob Homan added a comment -

        It may be good to address the code style issue now, since this patch diverges significantly from our standard: http://wiki.apache.org/hadoop/CodeReviewChecklist Eclipse can re-format everything into Hadoop's style pretty well, if that will save time.

        Show
        Jakob Homan added a comment - It may be good to address the code style issue now, since this patch diverges significantly from our standard: http://wiki.apache.org/hadoop/CodeReviewChecklist Eclipse can re-format everything into Hadoop's style pretty well, if that will save time.
        Hide
        Hong Tang added a comment -

        Canceling the patch.

        Show
        Hong Tang added a comment - Canceling the patch.
        Hide
        Greg Roelofs added a comment -

        The currently patch carries a modified version of LZMA SDK. This is a huge maintenance overhead going forward where a much simpler solution clearly exists.

        If the modifications are rolled into the SDK, this issue goes away. Nicholas, can you create a current diff of the src/contrib part relative to lzma912's original path structure (i.e., so it applies cleanly to a stock lzma912 codebase)? Then we can send it off to the 7Zip folks and see if they're willing to incorporate it.

        (And only "partly exists." There's no Java in liblzma. On the other hand, consensus around here seems to be that built-in Java support isn't necessary.)

        It may be good to address the code style issue now, since this patch diverges significantly from our standard

        Only the src/contrib portion does, and that was intentional. bzip2 is no longer actively developed, so an in-tree, heavily modified port is no big deal. LZMA, however, is still a very active project, and if we ever wanted to upgrade to a newer release (e.g., for performance or correctness reasons), we do not want a lot of whitespace noise hiding the real diffs.

        But this issue also largely disappears if the substantive modifications are accepted upstream; then the formatting is fairly irrelevant, though still a pain for diffs and patches. Either way, I don't think style rules are or should necessarily be applicable to contrib code (in the outside-the-core-codebase sense of "contrib").

        Show
        Greg Roelofs added a comment - The currently patch carries a modified version of LZMA SDK. This is a huge maintenance overhead going forward where a much simpler solution clearly exists. If the modifications are rolled into the SDK, this issue goes away. Nicholas, can you create a current diff of the src/contrib part relative to lzma912's original path structure (i.e., so it applies cleanly to a stock lzma912 codebase)? Then we can send it off to the 7Zip folks and see if they're willing to incorporate it. (And only "partly exists." There's no Java in liblzma. On the other hand, consensus around here seems to be that built-in Java support isn't necessary.) It may be good to address the code style issue now, since this patch diverges significantly from our standard Only the src/contrib portion does, and that was intentional. bzip2 is no longer actively developed, so an in-tree, heavily modified port is no big deal. LZMA, however, is still a very active project, and if we ever wanted to upgrade to a newer release (e.g., for performance or correctness reasons), we do not want a lot of whitespace noise hiding the real diffs. But this issue also largely disappears if the substantive modifications are accepted upstream; then the formatting is fairly irrelevant, though still a pain for diffs and patches. Either way, I don't think style rules are or should necessarily be applicable to contrib code (in the outside-the-core-codebase sense of "contrib").
        Hide
        Joydeep Sen Sarma added a comment -

        thanks to everyone on getting lzma into hadoop. it seems to be very promising.

        i have tried applying the latest patch to both hadoop-0.20 (yahoo/facebook branch) and common- trunk. in both cases - when i try running TestCodec after compiling the native codec - i get a sigsegv:

        [junit] Running org.apache.hadoop.io.compress.TestCodec
        [junit] #
        [junit] # An unexpected error has been detected by Java Runtime Environment:
        [junit] #
        [junit] # SIGSEGV (0xb) at pc=0x00002aaad5215659, pid=16028, tid=1076017472
        [junit] #
        [junit] # Java VM: Java HotSpot(TM) 64-Bit Server VM (10.0-b23 mixed mode linux-amd64)
        [junit] # Problematic frame:
        [junit] # C [libhadoop.so.1.0.0+0x5659] thisRead+0x49
        [junit] #

        separate from this - i had a question about tuning the compression level. in my testing on internal data using the lzma utility built from the SDK - i found a bunch of interesting option that provided a more suitable compromise between compression ratio/cpu (-a0 -mfhc4 -d24 -fbxxx) than the default. eyeing the 'level' based normalization - it seems i won't be able to quite achieve the settings i want by specifying a level. so it seems that being able to configure these options separately would be very useful.

        Show
        Joydeep Sen Sarma added a comment - thanks to everyone on getting lzma into hadoop. it seems to be very promising. i have tried applying the latest patch to both hadoop-0.20 (yahoo/facebook branch) and common- trunk. in both cases - when i try running TestCodec after compiling the native codec - i get a sigsegv: [junit] Running org.apache.hadoop.io.compress.TestCodec [junit] # [junit] # An unexpected error has been detected by Java Runtime Environment: [junit] # [junit] # SIGSEGV (0xb) at pc=0x00002aaad5215659, pid=16028, tid=1076017472 [junit] # [junit] # Java VM: Java HotSpot(TM) 64-Bit Server VM (10.0-b23 mixed mode linux-amd64) [junit] # Problematic frame: [junit] # C [libhadoop.so.1.0.0+0x5659] thisRead+0x49 [junit] # separate from this - i had a question about tuning the compression level. in my testing on internal data using the lzma utility built from the SDK - i found a bunch of interesting option that provided a more suitable compromise between compression ratio/cpu (-a0 -mfhc4 -d24 -fbxxx) than the default. eyeing the 'level' based normalization - it seems i won't be able to quite achieve the settings i want by specifying a level. so it seems that being able to configure these options separately would be very useful.
        Hide
        Erik Forsberg added a comment -

        Any progress on getting the new patch based on liblzma ready?

        Show
        Erik Forsberg added a comment - Any progress on getting the new patch based on liblzma ready?
        Hide
        Greg Roelofs added a comment -

        Any progress on getting the new patch based on liblzma ready?

        Not that I'm aware of. Nicholas has been concentrating entirely on his
        studies, AFAIK, and I've been working on other MR issues.

        Show
        Greg Roelofs added a comment - Any progress on getting the new patch based on liblzma ready? Not that I'm aware of. Nicholas has been concentrating entirely on his studies, AFAIK, and I've been working on other MR issues.
        Hide
        Harsh J added a comment -

        FB's hadoop-0.20 seems to have a working implementation of this, although I do not know results of its stability yet: https://github.com/facebook/hadoop-20/blob/master/src/core/org/apache/hadoop/io/compress/LzmaCodec.java (and others).

        Show
        Harsh J added a comment - FB's hadoop-0.20 seems to have a working implementation of this, although I do not know results of its stability yet: https://github.com/facebook/hadoop-20/blob/master/src/core/org/apache/hadoop/io/compress/LzmaCodec.java (and others).
        Hide
        Joydeep Sen Sarma added a comment -

        yes - the fb-hadoop tree has a working implementation. most of the original code came from Baidu.

        we tried to convert many petabytes to lzma. (switching from gzip compressed rcfile to lzma compressed). aside from speed issues (writes are very slow in spite of trying our best to fiddle around with different lzma settings directly in code) - the problem is we got rare corruptions every once in a while. these didn't seem to have anything to do with hadoop code - but the lzma codec itself. certain blocks would be unreadable. we had to abandon the conversion project at that point.

        my gut is that for small scale uses - the lzma stuff as implemented in fb-hadoop-20 works.

        across petabytes of data - where every rcfile block (1MB) has multiple compressed streams (1 per column) - and we are literally opening and closing billions of compressed streams - there are latent bugs in lzma (that were well beyond our capability to debug - leave alone reproduce accurately).

        we never had the same issues with gzip obviously (so the problem cannot be hadoop components like HDFS).

        Show
        Joydeep Sen Sarma added a comment - yes - the fb-hadoop tree has a working implementation. most of the original code came from Baidu. we tried to convert many petabytes to lzma. (switching from gzip compressed rcfile to lzma compressed). aside from speed issues (writes are very slow in spite of trying our best to fiddle around with different lzma settings directly in code) - the problem is we got rare corruptions every once in a while. these didn't seem to have anything to do with hadoop code - but the lzma codec itself. certain blocks would be unreadable. we had to abandon the conversion project at that point. my gut is that for small scale uses - the lzma stuff as implemented in fb-hadoop-20 works. across petabytes of data - where every rcfile block (1MB) has multiple compressed streams (1 per column) - and we are literally opening and closing billions of compressed streams - there are latent bugs in lzma (that were well beyond our capability to debug - leave alone reproduce accurately). we never had the same issues with gzip obviously (so the problem cannot be hadoop components like HDFS).

          People

          • Assignee:
            Nicholas Carlini
            Reporter:
            Nicholas Carlini
          • Votes:
            2 Vote for this issue
            Watchers:
            19 Start watching this issue

            Dates

            • Created:
              Updated:

              Development