Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34109

Killing executors excluded on failure, results in additional executors being marked as excluded due to fetch failures

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0.0, 3.0.1
    • Fix Version/s: None
    • Component/s: Kubernetes, Shuffle, Spark Core
    • Labels:
      None
    • Flags:
      Important

      Description

      Configuration:

       

      spark.excludeOnFailure.enabled: true # aka deprecated spark.blacklist.enabled
      spark.excludeOnFailure.application.fetchFailure.enabled: true # aka deprecated spark.blacklist.application.fetchFailure.enabled
      spark.excludeOnFailure.killExcludedExecutors: true # aka deprecated spark.blacklist.killBlacklistedExecutors
      

       

       

       

      In this case, we have noticed when a few executors are excluded due to task failures (maybe due to host issues), then those executors are killed after being excluded.

      However, when other executors try to fetch shuffle blocks from these killed executors, then  these other executors also end up getting excluded due to `spark.excludeOnFailure.application.fetchFailure.enabled`.

      Instead, the fetch failures in case of fetch from these excluded executors should not be considered when excluding executors based on `spark.excludeOnFailure.application.fetchFailure.enabled`

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              aaruna Aaruna Godthi
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: