Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29683

Job failed due to executor failures all available nodes are blacklisted

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 3.1.1
    • Spark Core, YARN
    • None

    Description

      My streaming job will fail due to executor failures all available nodes are blacklisted. This exception is thrown only when all node is blacklisted:

      def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >= numClusterNodes
      
      val allBlacklistedNodes = excludeNodes ++ schedulerBlacklist ++ allocatorBlacklist.keySet
      

      After diving into the code, I found some critical conditions not be handled properly:

      • unchecked `excludeNodes`: it comes from user config. If not set properly, it may lead to "currentBlacklistedYarnNodes.size >= numClusterNodes". For example, we may set some nodes not in Yarn cluster.
        excludeNodes = (invalid1, invalid2, invalid3)
        clusterNodes = (valid1, valid2)
        
      • `numClusterNodes` may equals 0: When HA Yarn failover, it will take some time for all NodeManagers to register ResourceManager again. In this case, `numClusterNode` may equals 0 or some other number, and Spark driver failed.
      • too strong condition check: Spark driver will fail as long as "currentBlacklistedYarnNodes.size >= numClusterNodes". This condition should not indicate a unrecovered fatal. For example, there are some NodeManagers restarting. So we can give some waiting time before job failed.

      Attachments

        Activity

          People

            Unassigned Unassigned
            uncleGen Genmao Yu
            Votes:
            4 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: