Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4188

Shuffle fetches should be retried at a lower level

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.0
    • Fix Version/s: 1.2.0
    • Component/s: Spark Core
    • Labels:
      None

      Description

      During periods of high network (or GC) load, it is not uncommon that IOExceptions crop up around connection failures when fetching shuffle files. Unfortunately, when such a failure occurs, it is interpreted as an inability to fetch the files, which causes us to mark the executor as lost and recompute all of its shuffle outputs.
      We should allow retrying at the network level in the event of an IOException in order to avoid this circumstance.

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              adav Aaron Davidson
              Reporter:
              ilikerps Aaron Davidson

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment