Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40128

Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.3.0
    • 3.4.0
    • SQL
    • None
    • Hide
      Added support for keeping vectorized reads enabled for Parquet files using the DELTA_LENGTH_BYTE_ARRAY encoding as a standalone column encoding. Previously, the related DELTA_BINARY_PACKED and DELTA_BYTE_ARRAY encodings were accepted as column encodings, but DELTA_LENGTH_BYTE_ARRAY would still be rejected as "unsupported".
      Show
      Added support for keeping vectorized reads enabled for Parquet files using the DELTA_LENGTH_BYTE_ARRAY encoding as a standalone column encoding. Previously, the related DELTA_BINARY_PACKED and DELTA_BYTE_ARRAY encodings were accepted as column encodings, but DELTA_LENGTH_BYTE_ARRAY would still be rejected as "unsupported".

    Description

      Even though https://issues.apache.org/jira/browse/SPARK-36879 added implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).

      Even though there apparently aren't many writers of the standalone DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array–6 and could be more efficient for types of binary/string data that don't take good advantage of sharing common prefixes for incremental encoding.

      The problem can be reproduced by trying to load one of the https://github.com/apache/parquet-testing files (delta_length_byte_array.parquet).

      Attachments

        Issue Links

          Activity

            People

              dennishuo Dennis Huo
              dennishuo Dennis Huo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: