Details
-
Improvement
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
2.7.2
-
None
Description
I've been testing Brotli, a new compression library based on LZ77 from Google. Google's brotli benchmarks look really good and we're also seeing a significant improvement in compression size, compression speed, or both.
[blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet --compression-codec snappy --overwrite real 1m17.106s user 1m30.804s sys 0m4.404s [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet --compression-codec brotli --overwrite real 1m16.640s user 1m24.244s sys 0m6.412s [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet --compression-codec gzip --overwrite real 3m39.496s user 3m48.736s sys 0m3.880s [blue@work Downloads]$ ls -l -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. Another test resulted in a slightly larger Brotli file than gzip produced, but Brotli was 4x faster. I'd like to get this compression codec into Hadoop.
Brotli is licensed with the MIT license, and the JNI library jbrotli is ALv2.
Attachments
Attachments
Issue Links
- is related to
-
ORC-1463 Support brotli codec
- Closed
- relates to
-
PARQUET-521 Add Brotli compression to Parquet
- Patch Available
- links to