Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20624

SPIP: Add better handling for node shutdown

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Reopened
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0.0
    • Fix Version/s: None
    • Component/s: Spark Core
    • Labels:
      None

      Description

      While we've done some good work with better handling when Spark is choosing to decommission nodes (SPARK-7955), it might make sense in environments where we get preempted without our own choice (e.g. YARN over-commit, EC2 spot instances, GCE Preemptiable instances, etc.) to do something for the data on the node (or at least not schedule any new tasks).

        Attachments

          Issue Links

          1.
          Keep track of nodes which are going to be shut down & avoid scheduling new tasks Sub-task Resolved Holden Karau
          2.
          Copy shuffle data when nodes are being shut down Sub-task Resolved Holden Karau
          3.
          Copy cache data when node is being shut down Sub-task Resolved Prakhar Jain
          4.
          On executor/worker decommission consider speculatively re-launching current tasks Sub-task Resolved Prakhar Jain
          5.
          Add support for YARN decommissioning & pre-emption Sub-task Open Unassigned
          6.
          Improve the decommissioning K8s integration tests Sub-task Resolved Holden Karau
          7.
          Exit the executor once all tasks & migrations are finished Sub-task Resolved Holden Karau
          8.
          Use graceful decommissioning as part of dynamic scaling Sub-task Resolved Holden Karau
          9.
          Improve cache block migration Sub-task Open Unassigned
          10.
          Failed to register SIGPWR handler on MacOS Sub-task Resolved wuyi
          11.
          Don't fail running jobs when decommissioned executors finally go away Sub-task Resolved Devesh Agrawal
          12.
          Clear shuffle state when decommissioned nodes/executors are finally lost Sub-task Resolved Devesh Agrawal
          13.
          Expose end point on Master so that it can be informed about decommissioned workers out of band Sub-task Resolved Devesh Agrawal
          14.
          Track whether the worker is also being decommissioned along with an executor Sub-task Resolved Devesh Agrawal
          15.
          DecommissionWorkerSuite has started failing sporadically again Sub-task Resolved Devesh Agrawal
          16.
          [Cleanup] Consolidate state kept in ExecutorDecommissionInfo with TaskSetManager.tidToExecutorKillTimeMapping Sub-task Resolved Devesh Agrawal
          17.
          decommission switch configuration should have the highest hierarchy Sub-task Resolved wuyi
          18.
          Decommissioned host/executor should be considered as inactive in TaskSchedulerImpl Sub-task Resolved wuyi
          19.
          Add an option to reject block migrations when under disk pressure Sub-task Open Unassigned
          20.
          Simply the RPC message flow of decommission Sub-task Resolved wuyi
          21.
          Improve ExecutorDecommissionInfo and ExecutorDecommissionState for different use cases Sub-task In Progress Unassigned
          22.
          BlockManagerDecommissioner cleanup Sub-task In Progress Unassigned
          23.
          Rename all decommission configurations to use the same namespace "spark.decommission.*" Sub-task In Progress Unassigned

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                holden Holden Karau
              • Votes:
                0 Vote for this issue
                Watchers:
                26 Start watching this issue

                Dates

                • Created:
                  Updated: