[SPARK-17310] Disable Parquet's record-by-record filter in normal parquet reader and do it in Spark-side - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.3.0
Component/s: SQL
Labels:
None

Description

Currently, we are pushing filters down for normal Parquet reader which also filters record-by-record.

It seems Spark-side codegen row-by-row filtering might be faster than Parquet's one in general due to type-boxing and virtual function calls which Spark's one tries to avoid.

Maybe we should perform a benchmark and disable this. This ticket was from https://github.com/apache/spark/pull/14671

Please refer the discussion in the PR.

Attachments

Issue Links

links to

[Github] Pull Request #15049 (HyukjinKwon)

Activity

People

Assignee:: Hyukjin Kwon

Reporter:: Hyukjin Kwon

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 30/Aug/16 10:00

Updated:: 12/Dec/22 18:11

Resolved:: 14/Nov/17 11:34