Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-9785

LZ4 code may need upgrade (lz4.c embedded in libHadoop is r43 18 months ago, while latest version is r98)



    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Duplicate
    • 2.0.4-alpha, 3.0.0-alpha1
    • 2.3.0
    • io, native
    • None


      While analyzing compression performance of different Hadoop codecs I noticed that the LZ4 code was taken from revision 43 of https://code.google.com/p/lz4/. The latest version is r98 and there may be extra performance benefits we can gain from using r98.

      We may involve the original LZ4 author Yann Collet on these discussions, as the current LZ4 code includes additional algorithms and parameters.

      To start the investigation, I ran preliminary experiments with the Silesia corpus and there seems to be an improvement on throughput for compression and decompression in the latest release when compared with r43 (haven't done enough analysis to conclude anything statistically, but looks good).

      Here is raw output using LZ4 from r43 with a SUBSET of the silesia corpus (http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia)

      File: silesia/dickens

          • Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) ***
            Compressed 10192446 bytes into 6433123 bytes ==> 63.12%
            Done in 0.07 s ==> 138.86 MB/s
          • Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) ***
            Successfully decoded 10192446 bytes
            Done in 0.02 s ==> 486.01 MB/s

      File: silesia/mozilla

          • Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) ***
            Compressed 51220480 bytes into 26379814 bytes ==> 51.50%
            Done in 0.25 s ==> 195.39 MB/s
          • Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) ***
            Successfully decoded 51220480 bytes
            Done in 0.12 s ==> 407.06 MB/s

      File: silesia/mr

          • Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) ***
            Compressed 9970564 bytes into 5669268 bytes ==> 56.86%
            Done in 0.04 s ==> 237.72 MB/s
          • Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) ***
            Successfully decoded 9970564 bytes
            Done in 0.02 s ==> 475.43 MB/s

      File: silesia/nci

          • Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) ***
            Compressed 33553445 bytes into 5880292 bytes ==> 17.53%
            Done in 0.08 s ==> 399.99 MB/s
          • Compression CLI using LZ4 algorithm , by Yann Collet (Jul 29 2013) ***
            Successfully decoded 33553445 bytes
            Done in 0.06 s ==> 533.32 MB/s

      And here raw output of LZ4 from the latest release r98

      File: silesia/dickens

          • Full LZ4 speed analyzer , by Yann Collet (Jul 29 2013) ***
            Loading silesia/dickens...
            1-LZ4_compress : 10192446 ->^M1-LZ4_compress : 10192446 -> 6434313 (63.13%), 172.3 MB/s
            1-LZ4_decompress_fast : 10192446 ->^M1-LZ4_decompress_fast : 10192446 -> 676.0 MB/s^MLZ4_decompress_fast : 10192446 -> 676.0 MB/s

      File: silesia/mozilla

          • Full LZ4 speed analyzer , by Yann Collet (Jul 29 2013) ***
            Loading silesia/mozilla...
            1-LZ4_compress : 51220480 ->^M1-LZ4_compress : 51220480 -> 26382113 (51.51%), 281.7 MB/s
            1-LZ4_decompress_fast : 51220480 ->^M1-LZ4_decompress_fast : 51220480 -> 1003.1 MB/s^MLZ4_decompress_fast : 51220480 -> 1003.1 MB/s

      File: silesia/mr

          • Full LZ4 speed analyzer , by Yann Collet (Jul 29 2013) ***
            Loading silesia/mr...
            1-LZ4_compress : 9970564 ->^M1-LZ4_compress : 9970564 -> 5669255 (56.86%), 268.3 MB/s
            1-LZ4_decompress_fast : 9970564 ->^M1-LZ4_decompress_fast : 9970564 -> 788.7 MB/s^MLZ4_decompress_fast : 9970564 -> 788.7 MB/s

      File: silesia/nci

          • Full LZ4 speed analyzer , by Yann Collet (Jul 29 2013) ***
            Loading silesia/nci...
            1-LZ4_compress : 33553445 ->^M1-LZ4_compress : 33553445 -> 5883923 (17.54%), 584.9 MB
            1-LZ4_decompress_fast : 33553445 ->^M1-LZ4_decompress_fast : 33553445 -> 1208.3 MB/s^MLZ4_decompress_fast : 33553445 -> 1208.3 MB/s


        Issue Links



              Unassigned Unassigned
              gflarrahondo German Florez-Larrahondo
              0 Vote for this issue
              4 Start watching this issue

