Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20624 SPIP: Add better handling for node shutdown
  3. SPARK-32217

Track whether the worker is also being decommissioned along with an executor

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersStop watchingWatchersConvert to IssueLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 3.1.0
    • Spark Core
    • None

    Description

      When an executor is decommissioned, we would like to know if its shuffle data is truly going to be lost. In the case of external shuffle service, this means knowing that the worker (or the node that the executor is on) is also going to be lost. 

       

      ( I don't think we need to worry about disaggregated remote shuffle storage at present since those are only used in a couple of web companies – but when there is remote shuffle then yes the shuffle won't be lost with a decommissioned executor )

       

      We know for sure that a worker is being decommissioned when the Master is asked to decommission a worker. In case of other schedulers:

      • Yarn support for decommissioning isn't implemented yet. But the idea would be for Yarn preeemption to not mark that the worker is being lost, but machine level decommissioning (like for kernel upgrades) to do mark such.
      • K8s isn't quite working with external shuffle service as yet, so when the executor is lost, the worker isn't quite lost with it. 

       

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            dagrawal3409 Devesh Agrawal Assign to me
            dagrawal3409 Devesh Agrawal
            Votes:
            0 Vote for this issue
            Watchers:
            2 Stop watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment