Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20624

SPIP: Add better handling for node shutdown

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Reopened
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0.0
    • Fix Version/s: None
    • Component/s: Spark Core
    • Labels:
      None

      Description

      While we've done some good work with better handling when Spark is choosing to decommission nodes (SPARK-7955), it might make sense in environments where we get preempted without our own choice (e.g. YARN over-commit, EC2 spot instances, GCE Preemptiable instances, etc.) to do something for the data on the node (or at least not schedule any new tasks).

        Attachments

        Issue Links

        1.
        Keep track of nodes which are going to be shut down & avoid scheduling new tasks Sub-task Resolved Holden Karau Actions
        2.
        Copy shuffle data when nodes are being shut down Sub-task Resolved Holden Karau Actions
        3.
        Copy cache data when node is being shut down Sub-task Resolved Prakhar Jain Actions
        4.
        On executor/worker decommission consider speculatively re-launching current tasks Sub-task Resolved Prakhar Jain Actions
        5.
        Add support for YARN decommissioning & pre-emption Sub-task Open Unassigned Actions
        6.
        Improve the decommissioning K8s integration tests Sub-task Resolved Holden Karau Actions
        7.
        Exit the executor once all tasks & migrations are finished Sub-task Resolved Holden Karau Actions
        8.
        Use graceful decommissioning as part of dynamic scaling Sub-task Resolved Holden Karau Actions
        9.
        Improve cache block migration Sub-task Open Unassigned Actions
        10.
        Failed to register SIGPWR handler on MacOS Sub-task Resolved wuyi Actions
        11.
        Don't fail running jobs when decommissioned executors finally go away Sub-task Resolved Devesh Agrawal Actions
        12.
        Clear shuffle state when decommissioned nodes/executors are finally lost Sub-task Resolved Devesh Agrawal Actions
        13.
        Expose end point on Master so that it can be informed about decommissioned workers out of band Sub-task Resolved Devesh Agrawal Actions
        14.
        Track whether the worker is also being decommissioned along with an executor Sub-task Resolved Devesh Agrawal Actions
        15.
        DecommissionWorkerSuite has started failing sporadically again Sub-task Resolved Devesh Agrawal Actions
        16.
        [Cleanup] Consolidate state kept in ExecutorDecommissionInfo with TaskSetManager.tidToExecutorKillTimeMapping Sub-task Resolved Devesh Agrawal Actions
        17.
        decommission switch configuration should have the highest hierarchy Sub-task Resolved wuyi Actions
        18.
        Decommissioned host/executor should be considered as inactive in TaskSchedulerImpl Sub-task Resolved wuyi Actions
        19.
        Add an option to reject block migrations when under disk pressure Sub-task Open Unassigned Actions
        20.
        Simply the RPC message flow of decommission Sub-task Resolved wuyi Actions
        21.
        Improve ExecutorDecommissionInfo and ExecutorDecommissionState for different use cases Sub-task In Progress Unassigned Actions
        22.
        BlockManagerDecommissioner cleanup Sub-task Resolved wuyi Actions
        23.
        Rename all decommission configurations to use the same namespace "spark.decommission.*" Sub-task In Progress Unassigned Actions
        24.
        Do not drop cached RDD blocks to accommodate blocks from decommissioned block manager if enough memory is not available Sub-task In Progress Unassigned Actions
        25.
        Decommission executors in batches to avoid overloading network by block migrations. Sub-task In Progress Unassigned Actions
        26.
        Put blocks only on disk while migrating RDD cached data Sub-task In Progress Unassigned Actions

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              holden Holden Karau

              Dates

              • Created:
                Updated:

                Issue deployment