Details
-
Improvement
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
None
-
None
Description
Currently, having Gzip as a default we prioritize Compression/Storage cost at the expense of
- Compute (on the write-path): about 30% of Compute burned during bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below)
- Compute (on the read-path), as well as queries Latencies: queries scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put is 3-4x less than Snappy, Zstd, EX)
P.S Spark switched its default compression algorithm to Snappy a while ago.
EDIT
We should actually evaluate putting in zstd instead of Snappy. It has compression ratios comparable to Gzip, while bringing in much better performance:
https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/