[SPARK-36879] Support Parquet v2 data page encodings for the vectorized path - ASF JIRA

XML

Word

Printable

JSON

Currently Spark only support Parquet V1 encodings (i.e., PLAIN/DICTIONARY/RLE) in the vectorized path, and throws exception otherwise:

java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY

It will be good to support v2 encodings too, including DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY as well as BYTE_STREAM_SPLIT as listed in https://github.com/apache/parquet-format/blob/master/Encodings.md

incorporates

SPARK-37975 Implement vectorized BYTE_STREAM_SPLIT encoding for Parquet V2 support

SPARK-37974 Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support

is related to

SPARK-40128 Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader

supercedes

SPARK-26509 Parquet DELTA_BYTE_ARRAY is not supported in Spark 2.x's Vectorized Reader

links to

[Github] Pull Request #34471 (parthchandra)

(2 links to)