Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13369

Number of consecutive fetch failures for a stage before the job is aborted should be configurable

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.6.0
    • Fix Version/s: 2.2.0
    • Component/s: Spark Core
    • Labels:
      None

      Description

      The previously hardcoded max 4 retries per stage is not suitable for all cluster configurations. Since spark retries a stage at the sign of the first fetch failure, you can easily end up with many stage retries to discover all the failures. In particular, two scenarios this value should change are (1) if there are more than 4 executors per node; in that case, it may take 4 retries to discover the problem with each executor on the node and (2) during cluster maintenance on large clusters, where multiple machines are serviced at once, but you also cannot afford total cluster downtime. By making this value configurable, cluster managers can tune this value to something more appropriate to their cluster configuration.

        Attachments

          Activity

            People

            • Assignee:
              sitalkedia@gmail.com Sital Kedia
              Reporter:
              sitalkedia@gmail.com Sital Kedia
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: