Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33207

Reduce the number of tasks launched after bucket pruning

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 3.2.0
    • SQL
    • None

    Description

      We only need to read 1 bucket, but it still launch 200 tasks.

      create table test_bucket using parquet clustered by (ID) sorted by (ID) into 200 buckets AS (SELECT id FROM range(1000) cluster by id)
      spark-sql> explain select * from test_bucket where id = 4;
      == Physical Plan ==
      *(1) Project [id#7L]
      +- *(1) Filter (isnotnull(id#7L) AND (id#7L = 4))
         +- *(1) ColumnarToRow
            +- FileScan parquet default.test_bucket[id#7L] Batched: true, DataFilters: [isnotnull(id#7L), (id#7L = 4)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/test_bucket], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,4)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 1 out of 200
      

      Attachments

        1. image-2020-10-22-15-17-01-389.png
          58 kB
          Yuming Wang
        2. image-2020-10-22-15-17-26-956.png
          163 kB
          Yuming Wang
        3. Screen Shot 2021-02-05 at 11.44.12 AM.png
          83 kB
          Cheng Su

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yumwang Yuming Wang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: