Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38094

Parquet: enable matching schema columns by field id

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.3.0
    • 3.3.0
    • Spark Core
    • None

    Description

      Field Id is a native field in the Parquet schema (https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398)

      After this PR, when the requested schema has field IDs, Parquet readers will first use the field ID to determine which Parquet columns to read, before falling back to using column names as before. It enables matching columns by field id for supported DWs like iceberg and Delta.

      This PR supports:

      • vectorized reader
      • Parquet-mr reader

      Attachments

        Issue Links

          Activity

            People

              jackierwzhang Jackie Zhang
              jackierwzhang Jackie Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: