Description
Currently Spark only support Parquet V1 encodings (i.e., PLAIN/DICTIONARY/RLE) in the vectorized path, and throws exception otherwise:
java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY
It will be good to support v2 encodings too, including DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY as well as BYTE_STREAM_SPLIT as listed in https://github.com/apache/parquet-format/blob/master/Encodings.md
Attachments
Issue Links
- incorporates
-
SPARK-37975 Implement vectorized BYTE_STREAM_SPLIT encoding for Parquet V2 support
- Open
-
SPARK-37974 Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
- Resolved
- is related to
-
SPARK-40128 Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader
- Resolved
- supercedes
-
SPARK-26509 Parquet DELTA_BYTE_ARRAY is not supported in Spark 2.x's Vectorized Reader
- Resolved
- links to