Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17370

Shuffle service files not invalidated when a slave is lost

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0.1, 2.1.0
    • Component/s: Spark Core
    • Labels:
      None

      Description

      DAGScheduler invalidates shuffle files when an executor loss event occurs, but not when the external shuffle service is enabled. This is because when shuffle service is on, the shuffle file lifetime can exceed the executor lifetime.

      However, it doesn't invalidate shuffle files when the shuffle service itself is lost (due to whole slave loss). This can cause long hangs when slaves are lost since the file loss is not detected until a subsequent stage attempts to read the shuffle files.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ekhliang Eric Liang
                Reporter:
                ekhliang Eric Liang
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: