Description
We only need to read 1 bucket, but it still launch 200 tasks.
create table test_bucket using parquet clustered by (ID) sorted by (ID) into 200 buckets AS (SELECT id FROM range(1000) cluster by id) spark-sql> explain select * from test_bucket where id = 4; == Physical Plan == *(1) Project [id#7L] +- *(1) Filter (isnotnull(id#7L) AND (id#7L = 4)) +- *(1) ColumnarToRow +- FileScan parquet default.test_bucket[id#7L] Batched: true, DataFilters: [isnotnull(id#7L), (id#7L = 4)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/test_bucket], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,4)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 1 out of 200
Attachments
Attachments
Issue Links
- is fixed by
-
SPARK-32985 Decouple bucket filter pruning and bucket table scan
- Resolved