Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35391

Memory leak in ExecutorAllocationListener breaks dynamic allocation under high load

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.1
    • 3.2.0, 3.1.3
    • Spark Core
    • None

    Description

      ExecutorAllocationListener doesn't clean up data properly. ExecutorAllocationListener performs progressively slower and eventually fails to process events in time.

      There are two problems:

      • a bug (typo?) in totalRunningTasksPerResourceProfile() method
        getOrElseUpdate() is used instead of getOrElse().
        If spark-dynamic-executor-allocation thread calls schedule() after a SparkListenerTaskEnd event for the last task in a stage
        but before SparkListenerStageCompleted event for the stage, then stageAttemptToNumRunningTask will not be cleaned up properly.
      • resourceProfileIdToStageAttempt clean-up is broken
        If a SparkListenerTaskEnd event for the last task in a stage was processed before SparkListenerStageCompleted for that stage,
        then resourceProfileIdToStageAttempt will not be cleaned up properly.

       

      Bugs were introduced in this commit: https://github.com/apache/spark/commit/496f6ac86001d284cbfb7488a63dd3a168919c0f .

      Steps to reproduce:

      1. Launch standalone master and worker with 'spark.shuffle.service.enabled=true'
      2. Run spark-shell with --conf 'spark.shuffle.service.enabled=true' --conf 'spark.dynamicAllocation.enabled=true' and paste this script
        for (_ <- 0 until 10) {
            Seq(1, 2, 3, 4, 5).toDF.repartition(100).agg("value" -> "sum").show()
        }
        
      1. make a heap dump and examine ExecutorAllocationListener.totalRunningTasksPerResourceProfile and ExecutorAllocationListener.resourceProfileIdToStageAttempt fields

      Expected: totalRunningTasksPerResourceProfile and resourceProfileIdToStageAttempt(defaultResourceProfileId) are empty
      Actual: totalRunningTasksPerResourceProfile and resourceProfileIdToStageAttempt(defaultResourceProfileId) contain non-relevant data

       

      Attachments

        Activity

          People

            vkolpakov Vasily Kolpakov
            vkolpakov Vasily Kolpakov
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: