Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17370

Shuffle service files not invalidated when a slave is lost

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.1, 2.1.0
    • Spark Core
    • None

    Description

      DAGScheduler invalidates shuffle files when an executor loss event occurs, but not when the external shuffle service is enabled. This is because when shuffle service is on, the shuffle file lifetime can exceed the executor lifetime.

      However, it doesn't invalidate shuffle files when the shuffle service itself is lost (due to whole slave loss). This can cause long hangs when slaves are lost since the file loss is not detected until a subsequent stage attempts to read the shuffle files.

      Attachments

        Issue Links

          Activity

            People

              ekhliang Eric Liang
              ekhliang Eric Liang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: