Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20624

SPIP: Add better handling for node shutdown

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.0.0
    • None
    • Spark Core
    • None

    Description

      While we've done some good work with better handling when Spark is choosing to decommission nodes (SPARK-7955), it might make sense in environments where we get preempted without our own choice (e.g. YARN over-commit, EC2 spot instances, GCE Preemptiable instances, etc.) to do something for the data on the node (or at least not schedule any new tasks).

      Attachments

        Issue Links

          1.
          Keep track of nodes which are going to be shut down & avoid scheduling new tasks Sub-task Resolved Holden Karau
          2.
          Copy shuffle data when nodes are being shut down Sub-task Resolved Holden Karau
          3.
          Copy cache data when node is being shut down Sub-task Resolved Prakhar Jain
          4.
          On executor/worker decommission consider speculatively re-launching current tasks Sub-task Resolved Prakhar Jain
          5.
          Add support for YARN decommissioning & pre-emption Sub-task Resolved Abhishek Dixit
          6.
          Improve the decommissioning K8s integration tests Sub-task Resolved Holden Karau
          7.
          Exit the executor once all tasks & migrations are finished Sub-task Resolved Holden Karau
          8.
          Use graceful decommissioning as part of dynamic scaling Sub-task Resolved Holden Karau
          9.
          Improve cache block migration Sub-task Open Unassigned
          10.
          Failed to register SIGPWR handler on MacOS Sub-task Resolved wuyi
          11.
          Don't fail running jobs when decommissioned executors finally go away Sub-task Resolved Devesh Agrawal
          12.
          Clear shuffle state when decommissioned nodes/executors are finally lost Sub-task Resolved Devesh Agrawal
          13.
          Expose end point on Master so that it can be informed about decommissioned workers out of band Sub-task Resolved Devesh Agrawal
          14.
          Track whether the worker is also being decommissioned along with an executor Sub-task Resolved Devesh Agrawal
          15.
          DecommissionWorkerSuite has started failing sporadically again Sub-task Resolved Devesh Agrawal
          16.
          [Cleanup] Consolidate state kept in ExecutorDecommissionInfo with TaskSetManager.tidToExecutorKillTimeMapping Sub-task Resolved Devesh Agrawal
          17.
          decommission switch configuration should have the highest hierarchy Sub-task Resolved wuyi
          18.
          Decommissioned host/executor should be considered as inactive in TaskSchedulerImpl Sub-task Resolved wuyi
          19.
          Add an option to reject block migrations when under disk pressure Sub-task Open Unassigned
          20.
          Simply the RPC message flow of decommission Sub-task Resolved wuyi
          21.
          Improve ExecutorDecommissionInfo and ExecutorDecommissionState for different use cases Sub-task In Progress Unassigned
          22.
          BlockManagerDecommissioner cleanup Sub-task Resolved wuyi
          23.
          Rename all decommission configurations to use the same namespace "spark.decommission.*" Sub-task In Progress Unassigned
          24.
          Do not drop cached RDD blocks to accommodate blocks from decommissioned block manager if enough memory is not available Sub-task In Progress Unassigned
          25.
          Decommission executors in batches to avoid overloading network by block migrations. Sub-task In Progress Unassigned
          26.
          Put blocks only on disk while migrating RDD cached data Sub-task In Progress Unassigned
          27.
          Decommission logs too frequent when waiting migration to finish Sub-task In Progress Apache Spark
          28.
          Executor loss reason shows "worker lost" rather "Executor decommission" Sub-task Resolved wuyi
          29.
          Add support for YARN decommissioning when ESS is Disabled Sub-task Resolved Unassigned
          30.
          Add support for YARN decommissioning when ESS is Enabled Sub-task In Progress Unassigned
          31.
          Stream is corrupted Exception while fetching the blocks from fallback storage system Sub-task Resolved Frank Yin

          Activity

            People

              Unassigned Unassigned
              holden Holden Karau
              Votes:
              2 Vote for this issue
              Watchers:
              38 Start watching this issue

              Dates

                Created:
                Updated: