Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35743 Improve Parquet vectorized reader
  3. SPARK-36879

Support Parquet v2 data page encodings for the vectorized path

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.3.0
    • 3.3.0
    • SQL
    • None

    Description

      Currently Spark only support Parquet V1 encodings (i.e., PLAIN/DICTIONARY/RLE) in the vectorized path, and throws exception otherwise:

      java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY
      

      It will be good to support v2 encodings too, including DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY as well as BYTE_STREAM_SPLIT as listed in https://github.com/apache/parquet-format/blob/master/Encodings.md

      Attachments

        Issue Links

          Activity

            People

              parthc Parth Chandra
              csun Chao Sun
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: