Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40320

When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 3.4.0
    • Scheduler
    • None

    Description

      Reproduce step:
      set `spark.plugins=ErrorSparkPlugin`
      `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the code to make it clearer):

      class ErrorSparkPlugin extends SparkPlugin {
        /**
         */
        override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()
      
        /**
         */
        override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
      }
      class ErrorExecutorPlugin extends ExecutorPlugin {
        private val checkingInterval: Long = 1
      
        override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): Unit = {
          if (checkingInterval == 1) {
            throw new UnsatisfiedLinkError("My Exception error")
          }
        }
      } 

      The Executor is active when we check in spark-ui, however it was broken and doesn't receive any task.

      Root Cause:

      I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it will throw fatal error (`UnsatisfiedLinkError` is fatal erro ) in method `dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM process  is active but the  communication thread is no longer working ( please see  `MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, so executor doesn't receive any message)

      Some ideas:
      I think it is very hard to know what happened here unless we check in the code. The Executor is active but it can't do anything. We will wonder if the driver is broken or the Executor problem.  I think at least the Executor status shouldn't be active here or the Executor can exitExecutor (kill itself)

       

      Attachments

        Activity

          People

            choko111 miracle
            miracle Mars
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: