Details
-
Improvement
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
3.4.0
-
None
-
None
Description
Based on SPARK-37020, we can support limit push down to parquet datasource v2 reader. It can stop scanning parquet early, and reduce network and disk IO.
Current limit parse status for parquet
== Parsed Logical Plan == GlobalLimit 10 +- LocalLimit 10 +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down == Analyzed Logical Plan == a: int, b: int GlobalLimit 10 +- LocalLimit 10 +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down == Optimized Logical Plan == GlobalLimit 10 +- LocalLimit 10 +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down == Physical Plan == CollectLimit 10 +- *(1) ColumnarToRow +- BatchScan[a#0, b#1] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/datasources.db/test_push_down/par..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [], PushedGroupBy: [], ReadSchema: struct<a:int,b:int>, PushedFilters: [], PushedAggregation: [], PushedGroupBy: [] RuntimeFilters: []