[PARQUET-128] Optimize the parquet RecordReader implementation when: A. filterpredicate is pushed down , B. filterpredicate is pushed down on a flat schema - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.6.0
Fix Version/s: 1.6.0
Component/s: parquet-mr
Labels:
None

Description

The RecordReader implementation currently will read all the columns before applying the filter predicate and deciding whether to keep the row or discard it.
We can have a RecordReader which will only assemble the columns on which filters are applied (which are usually a few), then apply the filter and decide whether to keep the row or not , and then goes on to assemble the remaining columns or skip the remaining columns accordingly.

Also for applications like spark sql , the schema usually applied is a flat one with no repeating or nested columns. In such cases, its better to have a light-weight, faster RecordReader.

The performance improvement by this change is seen to be significant , and is better in case smaller number of rows are returned by filtering (which is usually the case) and there are many number of columns

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Yash Datta

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 08/Nov/14 14:47

Updated:: 23/Jun/24 03:27