Description
Spark recently added hidden metadata column support for File based datasources as part of SPARK-37273.
We should extend it to support ROW_INDEX/ROW_POSITION also.
Meaning of ROW_POSITION:
ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th row in the file will have ROW_INDEX 5.
Use cases:
Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple uniquely identifies row in a table. This information can be used to mark rows e.g. this can be used by indexer etc.
Attachments
Issue Links
- causes
-
SPARK-39634 Allow file splitting in combination with row index generation
- Resolved
- is blocked by
-
SPARK-39806 Queries accessing METADATA struct crash on partitioned tables
- Resolved
- relates to
-
PARQUET-2117 Add rowPosition API in parquet record readers
- Resolved
-
SPARK-40059 Row indexes can overshadow user-created data
- Open
- links to