Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read parquet file in columnar fashion or record-by-record.
It will be great to extend them to also support rowPosition API which can tell the position of the current record in the parquet file.
The rowPosition can be used as a unique row identifier to mark a row. This can be useful to create an index (e.g. B+ tree) over a parquet file/parquet table (e.g. Spark/Hive).
There are multiple projects in the parquet eco-system which can benefit from such a functionality:
- Apache Iceberg needs this functionality. It has this implementation already as it relies on low level parquet APIs - Link1, Link2
- Apache Spark can use this functionality -
SPARK-37980
Attachments
Issue Links
- is depended upon by
-
PARQUET-2145 Release 1.12.3
- Resolved
- is related to
-
PARQUET-2161 Row positions are computed incorrectly when range or offset metadata filter is used
- Resolved
-
SPARK-37980 Extend METADATA column to support row indices for file based data sources
- Resolved