Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-37933

Limit push down for parquet datasource v2

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.4.0
    • None
    • SQL
    • None

    Description

      Based on SPARK-37020, we can support limit push down to parquet datasource v2 reader. It can stop scanning parquet early, and reduce network and disk IO.

      Current limit parse status for parquet

      == Parsed Logical Plan ==
      GlobalLimit 10
      +- LocalLimit 10
         +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down
      
      == Analyzed Logical Plan ==
      a: int, b: int
      GlobalLimit 10
      +- LocalLimit 10
         +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down
      
      == Optimized Logical Plan ==
      GlobalLimit 10
      +- LocalLimit 10
         +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down
      
      == Physical Plan ==
      CollectLimit 10
      +- *(1) ColumnarToRow
         +- BatchScan[a#0, b#1] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/datasources.db/test_push_down/par..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [], PushedGroupBy: [], ReadSchema: struct<a:int,b:int>, PushedFilters: [], PushedAggregation: [], PushedGroupBy: [] RuntimeFilters: [] 

      Attachments

        Activity

          People

            Jackey Lee Jackey Lee
            Jackey Lee Jackey Lee
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: