Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.4.0
-
None
-
None
Description
https://github.com/apache/spark/pull/37228 introduces ability to compute row indexes, which users can access through `_metadata.row_index` column. Internally this is achieved with the help of an extra column `_tmp_metadata_row_index`. When present in the schema sent to parquet reader, the reader populates it with row indexes, and the values are later placed in the `_metadata` struct.
While relatively unlikely, it's still possible, that a user might want to include column `_tmp_metadata_row_index` in their data. In such scenario, the column will be populated with row indexes, rather than data read from the file.
For repro, search `FileMetadataStructRowIndexSuite.scala` for this Jira ticket number.
We could introduce some kind of countermeasure to handle this scenario.
Attachments
Issue Links
- is related to
-
SPARK-37980 Extend METADATA column to support row indices for file based data sources
- Resolved