Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19753

Remove all shuffle files on a host in case of slave lost of fetch failure

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.3.0
    • Component/s: Scheduler
    • Labels:
      None

      Description

      Currently, when we detect fetch failure, we only remove the shuffle files produced by the executor, while the host itself might be down and all the shuffle files are not accessible. In case we are running multiple executors on a host, any host going down currently results in multiple fetch failures and multiple retries of the stage, which is very inefficient. If we remove all the shuffle files on that host, on first fetch failure, we can rerun all the tasks on that host in a single stage retry.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                sitalkedia@gmail.com Sital Kedia
                Reporter:
                sitalkedia@gmail.com Sital Kedia
              • Votes:
                0 Vote for this issue
                Watchers:
                13 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: