[SPARK-37933] Limit push down for parquet datasource v2 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.4.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Based on ~~SPARK-37020~~, we can support limit push down to parquet datasource v2 reader. It can stop scanning parquet early, and reduce network and disk IO.

Current limit parse status for parquet

== Parsed Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down

== Analyzed Logical Plan ==
a: int, b: int
GlobalLimit 10
+- LocalLimit 10
   +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down

== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down

== Physical Plan ==
CollectLimit 10
+- *(1) ColumnarToRow
   +- BatchScan[a#0, b#1] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/datasources.db/test_push_down/par..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [], PushedGroupBy: [], ReadSchema: struct<a:int,b:int>, PushedFilters: [], PushedAggregation: [], PushedGroupBy: [] RuntimeFilters: []

Attachments

Issue Links

links to

[Github] Pull Request #35242 (stczwd)

[Github] Pull Request #35256 (stczwd)

Activity

People

Assignee:: Jackey Lee

Reporter:: Jackey Lee

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 17/Jan/22 11:43

Updated:: 12/Dec/22 18:11