Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17929

Deadlock when AM restart and send RemoveExecutor on reset

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.0.0
    • 2.0.2, 2.1.0
    • Spark Core
    • None

    Description

      We fix SPARK-10582, and add reset in CoarseGrainedSchedulerBackend.scala

        protected def reset(): Unit = synchronized {
          numPendingExecutors = 0
          executorsPendingToRemove.clear()
      
          // Remove all the lingering executors that should be removed but not yet. The reason might be
          // because (1) disconnected event is not yet received; (2) executors die silently.
          executorDataMap.toMap.foreach { case (eid, _) =>
            driverEndpoint.askWithRetry[Boolean](
              RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered.")))
          }
        }
      

      but on removeExecutor also need the lock "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, and send RPC will failed, and reset failed

          private def removeExecutor(executorId: String, reason: ExecutorLossReason): Unit = {
            logDebug(s"Asked to remove executor $executorId with reason $reason")
            executorDataMap.get(executorId) match {
              case Some(executorInfo) =>
                // This must be synchronized because variables mutated
                // in this block are read when requesting executors
                val killed = CoarseGrainedSchedulerBackend.this.synchronized {
                  addressToExecutorId -= executorInfo.executorAddress
                  executorDataMap -= executorId
                  executorsPendingLossReason -= executorId
                  executorsPendingToRemove.remove(executorId).getOrElse(false)
                }
           ...
      

      Attachments

        Activity

          People

            scwf Fei Wang
            Sephiroth-Lin Weizhong
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: