[PARQUET-2117] Add rowPosition API in parquet record readers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.12.3
Component/s: parquet-mr
Labels:
None

Description

Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read parquet file in columnar fashion or record-by-record.

It will be great to extend them to also support rowPosition API which can tell the position of the current record in the parquet file.

The rowPosition can be used as a unique row identifier to mark a row. This can be useful to create an index (e.g. B+ tree) over a parquet file/parquet table (e.g. Spark/Hive).

There are multiple projects in the parquet eco-system which can benefit from such a functionality:

Apache Iceberg needs this functionality. It has this implementation already as it relies on low level parquet APIs - Link1, Link2
Apache Spark can use this functionality - ~~SPARK-37980~~

Attachments

Issue Links

is depended upon by

PARQUET-2145 Release 1.12.3

Resolved

is related to

PARQUET-2161 Row positions are computed incorrectly when range or offset metadata filter is used

Resolved

SPARK-37980 Extend METADATA column to support row indices for file based data sources

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Prakhar Jain

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 01/Feb/22 19:05

Updated:: 23/Jun/24 03:32

Resolved:: 08/Jun/22 18:16