[SPARK-40499] Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Invalid
Affects Version/s: 3.2.1
Fix Version/s: None
Component/s: Shuffle
Labels:
None
Environment:

hadoop: 3.0.0

spark: 2.4.0 / 3.2.1

shuffle：spark 2.4.0

Description

spark.sql(
s"""
|SELECT
| Info ,
| PERCENTILE_APPROX(cost,0.5) cost_p50,
| PERCENTILE_APPROX(cost,0.9) cost_p90,
| PERCENTILE_APPROX(cost,0.95) cost_p95,
| PERCENTILE_APPROX(cost,0.99) cost_p99,
| PERCENTILE_APPROX(cost,0.999) cost_p999
|FROM
| textData
|""".stripMargin)

When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2 pull shuffle data very quick . but , when we use spark 3.2.1 and use old shuffle , 140M shuffle data cost 3 hours.

If we upgrade the Shuffle, will we get performance regression？

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Screenshot 2024-01-05 at 3.51.52 PM.png
06/Jan/24 01:01
43 kB
Joey Pereira
Screenshot 2024-01-05 at 3.53.10 PM.png
06/Jan/24 01:01
28 kB
Joey Pereira
spark2.4-shuffle-data.png
20/Sep/22 08:59
16 kB
xuanzhiang
spark3.2-shuffle-data.png
20/Sep/22 09:13
39 kB
xuanzhiang

Issue Links

is related to

SPARK-46706 percentile_approx regression since Spark 2.4

Open

Activity

People

Assignee:: Unassigned

Reporter:: xuanzhiang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 20/Sep/22 08:57

Updated:: 12/Jan/24 23:21

Resolved:: 12/Oct/22 17:49