Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      LZ4 is like LZ0, but with better decompression rates and it's BSD license, which means we can incorporate it in svn. Information about it is found here http://code.google.com/p/lz4/ . Additionally, there exists a JNI library for it (and snappy, for ACCUMULO-139 ) at https://github.com/decster/jnicompressions . I did not find the license for that, but it's a potential option.

        Issue Links

          Activity

          Show
          decster Binglin Chang added a comment - It is apache license: https://github.com/decster/jnicompressions/blob/master/LICENSE
          Hide
          vines John Vines added a comment -

          As a quick test to see if this is still worthwhile, I made a 574M Rfile with the following code-

          CachableBlockFile.Writer _cbw = new CachableBlockFile.Writer(FileSystem.getLocal(new Configuration()).create(new Path("/tmp/bigTest.rf"), false, 4096,
                  (short) -1, 1 << 26), "none", new Configuration());
              Writer writer = new RFile.Writer(_cbw, (int) 100 * 1024, (int) 128 * 1024);
              
              Random r = new Random();
              byte[] colfb, colqb, value;
              colfb = new byte[128];
              colqb = new byte[128];
              value = new byte[128];
              
              String colf, colq;
              Value val = new Value();
              writer.startDefaultLocalityGroup();
              for (int i = 0; i < 1000000; i++) {
                r.nextBytes(colfb);
                r.nextBytes(colqb);
                colf = new String(colfb);
                colq = new String(colqb);
                Key k = new Key(String.format("%128d", i), colf, colq);
                
                r.nextBytes(value);
                val.set(value);
                writer.append(k, val);
              }
              
              writer.close();
            }
          

          So these are uncompressed RFiles.

          I then tried a few different compressions to compare it easily.
          Gzip - 265M compressed (2.166 ratio), compression time 50.79s, decompression time 4.57s
          lz4 fast compression - 435M compressed (1.319 ratio), compression time 1.98s, decompression time 0.41s
          lz4 high compression - 352M compressed (1.630 ratio), compression time 29.66s, decompression time 0.32s
          lzo default compression - 398M compressed (1.442 ratio), compression time 2.24s, decompression time 1.36s
          lzo fast compression - 400M compressed (1.435 ratio), compression time 2.12s, decompression time 0.21s
          Snappy - 418M compressed (1.373 ratio), compression time 4.06s, decompression time 2.18s

          Compared the others, the least compression ratio for starters. At the fastest, it compresses a negligable amount faster then lzo but decompresses at almost double, but it's in a low resolution area so that may not be accurate. All in all, I say it's negligable enough that I'm not going to bother, but it would be a good exercise for a first time contributor.

          Show
          vines John Vines added a comment - As a quick test to see if this is still worthwhile, I made a 574M Rfile with the following code- CachableBlockFile.Writer _cbw = new CachableBlockFile.Writer(FileSystem.getLocal( new Configuration()).create( new Path( "/tmp/bigTest.rf" ), false , 4096, ( short ) -1, 1 << 26), "none" , new Configuration()); Writer writer = new RFile.Writer(_cbw, ( int ) 100 * 1024, ( int ) 128 * 1024); Random r = new Random(); byte [] colfb, colqb, value; colfb = new byte [128]; colqb = new byte [128]; value = new byte [128]; String colf, colq; Value val = new Value(); writer.startDefaultLocalityGroup(); for ( int i = 0; i < 1000000; i++) { r.nextBytes(colfb); r.nextBytes(colqb); colf = new String (colfb); colq = new String (colqb); Key k = new Key( String .format( "%128d" , i), colf, colq); r.nextBytes(value); val.set(value); writer.append(k, val); } writer.close(); } So these are uncompressed RFiles. I then tried a few different compressions to compare it easily. Gzip - 265M compressed (2.166 ratio), compression time 50.79s, decompression time 4.57s lz4 fast compression - 435M compressed (1.319 ratio), compression time 1.98s, decompression time 0.41s lz4 high compression - 352M compressed (1.630 ratio), compression time 29.66s, decompression time 0.32s lzo default compression - 398M compressed (1.442 ratio), compression time 2.24s, decompression time 1.36s lzo fast compression - 400M compressed (1.435 ratio), compression time 2.12s, decompression time 0.21s Snappy - 418M compressed (1.373 ratio), compression time 4.06s, decompression time 2.18s Compared the others, the least compression ratio for starters. At the fastest, it compresses a negligable amount faster then lzo but decompresses at almost double, but it's in a low resolution area so that may not be accurate. All in all, I say it's negligable enough that I'm not going to bother, but it would be a good exercise for a first time contributor.
          Hide
          mdrob Mike Drob added a comment -

          Are the options for compression codecs pluggable? Seems like it would be nice to just have a compressor/decompressor interface and tell people that they can just throw implementations in lib/ext.

          Does Accumulo do much with the compressors other than toss them over to HDFS for getting codecs?

          Show
          mdrob Mike Drob added a comment - Are the options for compression codecs pluggable? Seems like it would be nice to just have a compressor/decompressor interface and tell people that they can just throw implementations in lib/ext. Does Accumulo do much with the compressors other than toss them over to HDFS for getting codecs?
          Hide
          kturner Keith Turner added a comment -

          Are the options for compression codecs pluggable?

          Not really w/ Accumulo's copy of BCFile. See org.apache.accumulo.core.file.rfile.bcfile.Compression.Algorithm. Would be nice to change this or see if a newer version of BCFile has.

          Show
          kturner Keith Turner added a comment - Are the options for compression codecs pluggable? Not really w/ Accumulo's copy of BCFile. See org.apache.accumulo.core.file.rfile.bcfile.Compression.Algorithm. Would be nice to change this or see if a newer version of BCFile has.
          Hide
          mdrob Mike Drob added a comment -

          Closing as Won't Fix based on John Vines's assessment of negligible benefit. If somebody wants to pick this up, feel free to re-open.

          Show
          mdrob Mike Drob added a comment - Closing as Won't Fix based on John Vines 's assessment of negligible benefit. If somebody wants to pick this up, feel free to re-open.

            People

            • Assignee:
              Unassigned
              Reporter:
              vines John Vines
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development