Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Won't Do
-
None
-
None
-
None
Description
The parquet-format spec doesn't currently specify whether lz4-compressed data should be framed or not. We should choose one and make it explicit in the spec, as they are not inter-operable. After some discussions with others [1], we think it would be beneficial to use the framed format, which adds a small header in exchange for more self-contained decompression as well as a richer feature set (checksums, parallel decompression, etc).
The current arrow implementation compresses using the lz4 block format, and this would need to be updated when we add the spec clarification.
If backwards compatibility is a concern, I would suggest adding an additional LZ4_FRAMED compression type, but that may be more noise than anything.
Attachments
Issue Links
- is superceded by
-
PARQUET-1996 [Format] Add interoperable LZ4 codec, deprecate existing LZ4 codec
- Resolved
-
PARQUET-1998 [C++] Implement LZ4_RAW compression
- Resolved
- relates to
-
PARQUET-1878 [C++] lz4 codec is not compatible with Hadoop Lz4Codec
- Resolved
-
PARQUET-1118 Build a corpus of Parquet files that client implementations can use for validation
- Open