Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-2928

Evaluate rebasing Hudi's default compression from Gzip to Zstd

    XMLWordPrintableJSON

Details

    Description

      Currently, having Gzip as a default we prioritize Compression/Storage cost at the expense of

      • Compute (on the write-path): about 30% of Compute burned during bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) 
      • Compute (on the read-path), as well as queries Latencies: queries scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put is 3-4x less than Snappy, Zstd, EX)

      P.S Spark switched its default compression algorithm to Snappy a while ago.

       

      EDIT

      We should actually evaluate putting in zstd instead of Snappy. It has compression ratios comparable to Gzip, while bringing in much better performance:

      https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/

       

       

       

      Attachments

        1. image-2021-12-03-13-13-02-892.png
          226 kB
          Alexey Kudinkin
        2. Screen Shot 2021-12-03 at 12.36.13 PM.png
          1.31 MB
          Alexey Kudinkin
        3. Screen Shot 2021-12-06 at 11.49.05 AM.png
          184 kB
          Alexey Kudinkin

        Issue Links

          Activity

            People

              alexey.kudinkin Alexey Kudinkin
              alexey.kudinkin Alexey Kudinkin
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: