Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-2117

Add rowPosition API in parquet record readers

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.12.3
    • parquet-mr
    • None

    Description

      Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read parquet file in columnar fashion or record-by-record.

      It will be great to extend them to also support rowPosition API which can tell the position of the current record in the parquet file.

      The rowPosition can be used as a unique row identifier to mark a row. This can be useful to create an index (e.g. B+ tree) over a parquet file/parquet table (e.g.  Spark/Hive).

      There are multiple projects in the parquet eco-system which can benefit from such a functionality: 

      1. Apache Iceberg needs this functionality. It has this implementation already as it relies on low level parquet APIs -  Link1, Link2
      2. Apache Spark can use this functionality - SPARK-37980

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              prakharjain09 Prakhar Jain
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: