[SPARK-33207] Reduce the number of tasks launched after bucket pruning - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.0
Fix Version/s: 3.2.0
Component/s: SQL
Labels:
None

Description

We only need to read 1 bucket, but it still launch 200 tasks.

create table test_bucket using parquet clustered by (ID) sorted by (ID) into 200 buckets AS (SELECT id FROM range(1000) cluster by id)
spark-sql> explain select * from test_bucket where id = 4;
== Physical Plan ==
*(1) Project [id#7L]
+- *(1) Filter (isnotnull(id#7L) AND (id#7L = 4))
   +- *(1) ColumnarToRow
      +- FileScan parquet default.test_bucket[id#7L] Batched: true, DataFilters: [isnotnull(id#7L), (id#7L = 4)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/test_bucket], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,4)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 1 out of 200

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2020-10-22-15-17-01-389.png
22/Oct/20 07:17
58 kB
Yuming Wang
image-2020-10-22-15-17-26-956.png
22/Oct/20 07:17
163 kB
Yuming Wang
Screen Shot 2021-02-05 at 11.44.12 AM.png
05/Feb/21 19:46
83 kB
Cheng Su

Issue Links

is fixed by

SPARK-32985 Decouple bucket filter pruning and bucket table scan

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Yuming Wang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 21/Oct/20 13:20

Updated:: 06/Feb/21 09:51

Resolved:: 06/Feb/21 09:51