Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33207

Reduce the number of tasks launched after bucket pruning

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 3.2.0
    • SQL
    • None

    Description

      We only need to read 1 bucket, but it still launch 200 tasks.

      create table test_bucket using parquet clustered by (ID) sorted by (ID) into 200 buckets AS (SELECT id FROM range(1000) cluster by id)
      spark-sql> explain select * from test_bucket where id = 4;
      == Physical Plan ==
      *(1) Project [id#7L]
      +- *(1) Filter (isnotnull(id#7L) AND (id#7L = 4))
         +- *(1) ColumnarToRow
            +- FileScan parquet default.test_bucket[id#7L] Batched: true, DataFilters: [isnotnull(id#7L), (id#7L = 4)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/test_bucket], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,4)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 1 out of 200
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            yumwang Yuming Wang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment