Description
Even though https://issues.apache.org/jira/browse/SPARK-36879 added implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).
Even though there apparently aren't many writers of the standalone DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array–6 and could be more efficient for types of binary/string data that don't take good advantage of sharing common prefixes for incremental encoding.
The problem can be reproduced by trying to load one of the https://github.com/apache/parquet-testing files (delta_length_byte_array.parquet).
Attachments
Attachments
Issue Links
- relates to
-
SPARK-36879 Support Parquet v2 data page encodings for the vectorized path
- Resolved
- links to