Hive
  1. Hive
  2. HIVE-1838

Add quickLZ compression codec for Hive.

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Activity

      Hide
      He Yongqiang added a comment -

      Just found that there is already a jira on Hadoop side:

      https://issues.apache.org/jira/browse/HADOOP-6349

      Show
      He Yongqiang added a comment - Just found that there is already a jira on Hadoop side: https://issues.apache.org/jira/browse/HADOOP-6349
      Hide
      Ashutosh Chauhan added a comment -

      No. I mean compression codec for Hive. It could be used to compress intermediate data.

      Oh, I see you mean map outputs(of MR jobs). I confused with eventual job output which gets written out in RCFile.

      Here are some results:

      Thanks for sharing that. Thats useful. From your command, it seems you only ran test once. To discount statistical variance (because of FS caches, OS cache, jvm warm-up etc.) I guess running them multiple times and averaging them may be a good idea.

      Show
      Ashutosh Chauhan added a comment - No. I mean compression codec for Hive. It could be used to compress intermediate data. Oh, I see you mean map outputs(of MR jobs). I confused with eventual job output which gets written out in RCFile. Here are some results: Thanks for sharing that. Thats useful. From your command, it seems you only ran test once. To discount statistical variance (because of FS caches, OS cache, jvm warm-up etc.) I guess running them multiple times and averaging them may be a good idea.
      Hide
      He Yongqiang added a comment -

      Since the compression ratio is not good, it is not good to compress the final data (rcfile/sequencefile/textfile) if the space is a problem.

      Show
      He Yongqiang added a comment - Since the compression ratio is not good, it is not good to compress the final data (rcfile/sequencefile/textfile) if the space is a problem.
      Hide
      He Yongqiang added a comment -

      No. I mean compression codec for Hive. It could be used to compress intermediate data.

      Here are some results:

      5. Hadoop compression with native library (COMPRESSLEVEL=BEST_SPEED)
      time java -Djava.library.path=/data/users/heyongqiang/hadoop-0.20/build/native/Linux-amd64-64/lib/ CompressFile

      real 0m34.179s
      user 0m29.031s
      sys 0m1.607s

      compressed size: 275M

      6. LZF
      [heyongqiang@dev782 compress_test]$ time lzf -c 000000_0

      real 0m39.031s
      user 0m8.727s
      sys 0m2.231s
      compressed size: 393M

      7. FastLZ
      time fastlz/6pack -1 000000_0 000000_0.fastlz
      real 0m19.020s
      user 0m18.083s
      sys 0m0.935s

      compressed size: 391M

      8.QuickLZ
      time ./compress_file ../000000_0 ../000000_0.quicklz

      real 0m15.652s
      user 0m14.047s
      sys 0m1.603s

      compressed size: 334M

      I modified QuickLZ's compress_file code to use a buffer for fairness. It turns out the result is very close to FastLZ. The modified version of QuickLZ is just one second better.

      Show
      He Yongqiang added a comment - No. I mean compression codec for Hive. It could be used to compress intermediate data. Here are some results: 5. Hadoop compression with native library (COMPRESSLEVEL=BEST_SPEED) time java -Djava.library.path=/data/users/heyongqiang/hadoop-0.20/build/native/Linux-amd64-64/lib/ CompressFile real 0m34.179s user 0m29.031s sys 0m1.607s compressed size: 275M 6. LZF [heyongqiang@dev782 compress_test] $ time lzf -c 000000_0 real 0m39.031s user 0m8.727s sys 0m2.231s compressed size: 393M 7. FastLZ time fastlz/6pack -1 000000_0 000000_0.fastlz real 0m19.020s user 0m18.083s sys 0m0.935s compressed size: 391M 8.QuickLZ time ./compress_file ../000000_0 ../000000_0.quicklz real 0m15.652s user 0m14.047s sys 0m1.603s compressed size: 334M I modified QuickLZ's compress_file code to use a buffer for fairness. It turns out the result is very close to FastLZ. The modified version of QuickLZ is just one second better.
      Hide
      Ashutosh Chauhan added a comment -

      Compression codec for Hive? I think, you mean compression codec for RCFile. In default settings, it uses builtin DefaultCodec from Hadoop. Did you do some tests and found out performance is not as good as what you would have liked? If so, it will be great if you can share the results of your experiments.

      Show
      Ashutosh Chauhan added a comment - Compression codec for Hive? I think, you mean compression codec for RCFile. In default settings, it uses builtin DefaultCodec from Hadoop. Did you do some tests and found out performance is not as good as what you would have liked? If so, it will be great if you can share the results of your experiments.
      Hide
      He Yongqiang added a comment -

      FastLZ maybe better because of the license. And FastLZ has almost the same compression speed compared to quickLZ.

      Show
      He Yongqiang added a comment - FastLZ maybe better because of the license. And FastLZ has almost the same compression speed compared to quickLZ.

        People

        • Assignee:
          Unassigned
          Reporter:
          He Yongqiang
        • Votes:
          0 Vote for this issue
          Watchers:
          1 Start watching this issue

          Dates

          • Created:
            Updated:

            Development