[ARROW-13518] Identify selected row when using filters - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: C++, Parquet, Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/18768

Description

I created a proposed enhancement to speed up reading of specific rows arrow-13517 https://issues.apache.org/jira/browse/ARROW-13517

proposing extending the functions that provides filter parquet.read_table (https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table) to support returning actual row numbers (e.g, row_group and row_index).

with the proposed enhancement, this can provide for faster reading of the data (e.g. by caching the return indices, and reading the full data when needed).

proposed implementation will be to add 2 pseudo columns, which can be requested in the columns list. E.g., columns=[ ‘$row_group’, ‘$row_index’, ‘dealid’, …] or similar.

$row_group - 0 based row group index
$row_index - 0 based position within the row group
$row_file_index - 0 based position in the file (not critical), can be constructed from the other two

not sure if this requires change to the c++ interface, or just to the python part of pyarrow.

Attachments

Issue Links

Dependent

ARROW-13517 Selective reading of rows for parquet file

Open

Activity

People

Assignee:: Unassigned

Reporter:: Yair Lenga

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Aug/21 12:21

Updated:: 11/Jan/23 08:33