[ARROW-13517] Selective reading of rows for parquet file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: C++, Parquet, Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/29172

Description

The current interface for selective reading is to use filters https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html

The approach works well when the filters are simple (field in (v1, v2, v3, …), and when the number of columns in small. It does not work well for the folllowing conditions, which currently requires reading the complete set into (python) memory.

when condition is complex (e.g. condition between attributes: field1 + field2 > filed3)
When file as many columns (making it costly to create python structures).

I have a repository of large number of parquet files (thousands of files, 500 MB each, 200 column), where specific records had to be selected quickly based on logical condition that does not fit the filter condition. Very small numbers of rows (<500) have to be returned.

Proposed feature is to aextend read_row_group to support passing an array of rows to read (list of integer in ascending order).

pq =  pyarrow.parquet.ParquetFile(…)
dd = PY.read_row_group(…, rows=[ 5, 35, …. ]

Using this method will enable complex filtering in two stages, eliminitating the need to read all rows into memory.

First pass - read attributes for filtering, collect row numbers that match (complex) condition.
second pass - create a python table with matching rows using the proposed rows= parameter to read row group.

I believe possible to achieve something similar using the c++ stream_reader (https://github.com/apache/arrow/blob/master/cpp/src/parquet/stream_reader.cc), which is not exposed to python.

Attachments

Issue Links

Dependent

ARROW-13518 Identify selected row when using filters

Open

Activity

People

Assignee:: Unassigned

Reporter:: Yair Lenga

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Aug/21 10:40

Updated:: 11/Jan/23 08:33