Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4238

Perform network-level retry of shuffle file fetches

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Spark Core
    • Labels:
      None
    • Target Version/s:

      Description

      During periods of high network (or GC) load, it is not uncommon that IOExceptions crop up around connection failures when fetching shuffle files. Unfortunately, when such a failure occurs, it is interpreted as an inability to fetch the files, which causes us to mark the executor as lost and recompute all of its shuffle outputs.

      We should allow retrying at the network level in the event of an IOException in order to avoid this circumstance.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ilikerps Aaron Davidson
                Reporter:
                ilikerps Aaron Davidson
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: