Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17929

Deadlock when AM restart and send RemoveExecutor on reset

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 2.0.2, 2.1.0
    • Component/s: Spark Core
    • Labels:
      None

      Description

      We fix SPARK-10582, and add reset in CoarseGrainedSchedulerBackend.scala

        protected def reset(): Unit = synchronized {
          numPendingExecutors = 0
          executorsPendingToRemove.clear()
      
          // Remove all the lingering executors that should be removed but not yet. The reason might be
          // because (1) disconnected event is not yet received; (2) executors die silently.
          executorDataMap.toMap.foreach { case (eid, _) =>
            driverEndpoint.askWithRetry[Boolean](
              RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered.")))
          }
        }
      

      but on removeExecutor also need the lock "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, and send RPC will failed, and reset failed

          private def removeExecutor(executorId: String, reason: ExecutorLossReason): Unit = {
            logDebug(s"Asked to remove executor $executorId with reason $reason")
            executorDataMap.get(executorId) match {
              case Some(executorInfo) =>
                // This must be synchronized because variables mutated
                // in this block are read when requesting executors
                val killed = CoarseGrainedSchedulerBackend.this.synchronized {
                  addressToExecutorId -= executorInfo.executorAddress
                  executorDataMap -= executorId
                  executorsPendingLossReason -= executorId
                  executorsPendingToRemove.remove(executorId).getOrElse(false)
                }
           ...
      

        Attachments

          Activity

            People

            • Assignee:
              scwf Fei Wang
              Reporter:
              Sephiroth-Lin Weizhong
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: