[HUDI-2928] Evaluate rebasing Hudi's default compression from Gzip to Zstd - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: performance, storage-management
Labels:
- pull-request-available

Story Points:
2
Epic Link:
Performance Improvements

Description

Currently, having Gzip as a default we prioritize Compression/Storage cost at the expense of

Compute (on the write-path): about 30% of Compute burned during bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below)
Compute (on the read-path), as well as queries Latencies: queries scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put is 3-4x less than Snappy, Zstd, EX)

P.S Spark switched its default compression algorithm to Snappy a while ago.

EDIT

We should actually evaluate putting in zstd instead of Snappy. It has compression ratios comparable to Gzip, while bringing in much better performance:

https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Screen Shot 2021-12-06 at 11.49.05 AM.png
06/Dec/21 19:49
184 kB
Alexey Kudinkin
Screen Shot 2021-12-03 at 12.36.13 PM.png
03/Dec/21 21:03
1.31 MB
Alexey Kudinkin
image-2021-12-03-13-13-02-892.png
03/Dec/21 21:13
226 kB
Alexey Kudinkin

Issue Links

is a child of

HUDI-2151 Make performant out-of-box configs

Closed

is blocked by

HUDI-2811 Support Spark 3.2 and Parquet 1.12.x

Closed

is related to

HUDI-2948 Hudi Clustering Performance

Closed

links to

GitHub Pull Request #4214

Activity

People

Assignee:: Alexey Kudinkin

Reporter:: Alexey Kudinkin

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 03/Dec/21 20:48

Updated:: 10/Mar/23 01:50