Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34109

Killing executors excluded on failure, results in additional executors being marked as excluded due to fetch failures

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0, 3.0.1
    • None
    • Kubernetes, Shuffle, Spark Core
    • None
    • Important

    Description

      Configuration:

       

      spark.excludeOnFailure.enabled: true # aka deprecated spark.blacklist.enabled
      spark.excludeOnFailure.application.fetchFailure.enabled: true # aka deprecated spark.blacklist.application.fetchFailure.enabled
      spark.excludeOnFailure.killExcludedExecutors: true # aka deprecated spark.blacklist.killBlacklistedExecutors
      

       

       

       

      In this case, we have noticed when a few executors are excluded due to task failures (maybe due to host issues), then those executors are killed after being excluded.

      However, when other executors try to fetch shuffle blocks from these killed executors, then  these other executors also end up getting excluded due to `spark.excludeOnFailure.application.fetchFailure.enabled`.

      Instead, the fetch failures in case of fetch from these excluded executors should not be considered when excluding executors based on `spark.excludeOnFailure.application.fetchFailure.enabled`

      Attachments

        Activity

          People

            Unassigned Unassigned
            aaruna Aaruna Godthi
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: