[SPARK-40059] Row indexes can overshadow user-created data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.4.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

https://github.com/apache/spark/pull/37228 introduces ability to compute row indexes, which users can access through `_metadata.row_index` column. Internally this is achieved with the help of an extra column `_tmp_metadata_row_index`. When present in the schema sent to parquet reader, the reader populates it with row indexes, and the values are later placed in the `_metadata` struct.

While relatively unlikely, it's still possible, that a user might want to include column `_tmp_metadata_row_index` in their data. In such scenario, the column will be populated with row indexes, rather than data read from the file.

For repro, search `FileMetadataStructRowIndexSuite.scala` for this Jira ticket number.

We could introduce some kind of countermeasure to handle this scenario.

Attachments

Issue Links

is related to

SPARK-37980 Extend METADATA column to support row indices for file based data sources

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Ala Luszczak

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 12/Aug/22 15:11

Updated:: 12/Aug/22 15:11