[SPARK-35391] Memory leak in ExecutorAllocationListener breaks dynamic allocation under high load - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.1
Fix Version/s: 3.2.0, 3.1.3
Component/s: Spark Core
Labels:
None

Description

ExecutorAllocationListener doesn't clean up data properly. ExecutorAllocationListener performs progressively slower and eventually fails to process events in time.

There are two problems:

a bug (typo?) in totalRunningTasksPerResourceProfile() method
getOrElseUpdate() is used instead of getOrElse().
If spark-dynamic-executor-allocation thread calls schedule() after a SparkListenerTaskEnd event for the last task in a stage
but before SparkListenerStageCompleted event for the stage, then stageAttemptToNumRunningTask will not be cleaned up properly.
resourceProfileIdToStageAttempt clean-up is broken
If a SparkListenerTaskEnd event for the last task in a stage was processed before SparkListenerStageCompleted for that stage,
then resourceProfileIdToStageAttempt will not be cleaned up properly.

Bugs were introduced in this commit: https://github.com/apache/spark/commit/496f6ac86001d284cbfb7488a63dd3a168919c0f .

Steps to reproduce:

Launch standalone master and worker with 'spark.shuffle.service.enabled=true'
Run spark-shell with --conf 'spark.shuffle.service.enabled=true' --conf 'spark.dynamicAllocation.enabled=true' and paste this script
```
for (_ <- 0 until 10) {
    Seq(1, 2, 3, 4, 5).toDF.repartition(100).agg("value" -> "sum").show()
}
```

make a heap dump and examine ExecutorAllocationListener.totalRunningTasksPerResourceProfile and ExecutorAllocationListener.resourceProfileIdToStageAttempt fields

Expected: totalRunningTasksPerResourceProfile and resourceProfileIdToStageAttempt(defaultResourceProfileId) are empty
Actual: totalRunningTasksPerResourceProfile and resourceProfileIdToStageAttempt(defaultResourceProfileId) contain non-relevant data

Attachments

Issue Links

links to

[Github] Pull Request #32526 (VasilyKolpakov)

Activity

People

Assignee:: Vasily Kolpakov

Reporter:: Vasily Kolpakov

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/May/21 20:31

Updated:: 21/Jun/21 13:27

Resolved:: 21/Jun/21 13:25