Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3289

Avoid job failures due to rescheduling of failing tasks on buggy machines

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • None
    • Spark Core
    • None

    Description

      Some users have reported issues where a task fails due to an environment / configuration issue on some machine, then the task is reattempted on that same buggy machine until the entire job failures because that single task has failed too many times.

      To guard against this, maybe we should add some randomization in how we reschedule failed tasks.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              joshrosen Josh Rosen
              Votes:
              3 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: