[SPARK-38094] Parquet: enable matching schema columns by field id - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.3.0
Fix Version/s: 3.3.0
Component/s: Spark Core
Labels:
None

Description

Field Id is a native field in the Parquet schema (https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398)

After this PR, when the requested schema has field IDs, Parquet readers will first use the field ID to determine which Parquet columns to read, before falling back to using column names as before. It enables matching columns by field id for supported DWs like iceberg and Delta.

This PR supports:

vectorized reader
Parquet-mr reader

Attachments

Issue Links

Blocked

SPARK-39997 ParquetSchemaConverter fails match schema by id

In Progress

links to

[Github] Pull Request #35385 (jackierwzhang)

[Github] Pull Request #35700 (jackierwzhang)

Activity

People

Assignee:: Jackie Zhang

Reporter:: Jackie Zhang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 03/Feb/22 00:39

Updated:: 06/Aug/22 14:03

Resolved:: 18/Feb/22 15:12