Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20624 SPIP: Add better handling for node shutdown
  3. SPARK-32199

Clear shuffle state when decommissioned nodes/executors are finally lost

Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 3.1.0
    • Spark Core
    • None

    Description

      While every effort has been made to try to migrate the cached and shuffle blocks out of a decommissioned node – its still possible that there are lingering references for some blocks on a decommissioned node. These will result in a fetch failures – that will not only take time to detect but can also cause job failure.

      This is a bit tricky in terms of when to clear the shuffle state ? Ideally you want to clear it the millisecond before the shuffle service on the node dies (or the executor dies when there is no external shuffle service) – too soon and it could lead to some wastage and too late would lead to fetch failures.

      There are only very few cases where we precisely know when the shuffle data will start being unavailable – perhaps during a cloud spot kill that gives some advance warning. The next best thing is to clear this state lazily at the first sign: ie, when the first fetch failure is observed on a decommissioned entity (node or executor). We take that as a hint that finally the entity has gone away.

      What we care about here is whether the shuffle data is going away: ie, if there is an (external) shuffle service resident on the node being decommissioned, or when the shuffle service is embedded inside an executor and the executor is being decommissioned.

      This clearing need not be done if the shuffle data is truly remote in certain disaggregated environments.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            dagrawal3409 Devesh Agrawal
            dagrawal3409 Devesh Agrawal
            Holden Karau Holden Karau
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment