Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13518

Identify selected row when using filters

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • C++, Parquet, Python
    • None

    Description

      I created a proposed enhancement to speed up reading of specific rows arrow-13517 https://issues.apache.org/jira/browse/ARROW-13517

      proposing extending the functions that provides filter parquet.read_table (https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table) to support returning actual row numbers (e.g, row_group and row_index). 

      with the proposed enhancement, this can provide for faster reading of the data (e.g. by caching the return indices, and reading the full data when needed). 

      proposed implementation will be to add 2 pseudo columns, which can be requested in the columns list. E.g., columns=[ ‘$row_group’, ‘$row_index’, ‘dealid’, …] or similar.

      • $row_group - 0 based row group index
      • $row_index - 0  based position within the row group
      • $row_file_index - 0 based position in the file (not critical), can be constructed from the other two

       

      not sure if this requires change to the c++ interface, or just to the python part of pyarrow.

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Yair.lenga Yair Lenga
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: