[SPARK-33207] Reduce the number of tasks launched after bucket pruning - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.0
Fix Version/s: 3.2.0
Component/s: SQL
Labels:
None

Description

We only need to read 1 bucket, but it still launch 200 tasks.

create table test_bucket using parquet clustered by (ID) sorted by (ID) into 200 buckets AS (SELECT id FROM range(1000) cluster by id)
spark-sql> explain select * from test_bucket where id = 4;
== Physical Plan ==
*(1) Project [id#7L]
+- *(1) Filter (isnotnull(id#7L) AND (id#7L = 4))
   +- *(1) ColumnarToRow
      +- FileScan parquet default.test_bucket[id#7L] Batched: true, DataFilters: [isnotnull(id#7L), (id#7L = 4)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/test_bucket], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,4)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 1 out of 200