[SPARK-20624] SPIP: Add better handling for node shutdown - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

While we've done some good work with better handling when Spark is choosing to decommission nodes (~~SPARK-7955~~), it might make sense in environments where we get preempted without our own choice (e.g. YARN over-commit, EC2 spot instances, GCE Preemptiable instances, etc.) to do something for the data on the node (or at least not schedule any new tasks).

Attachments

Issue Links

is related to

SPARK-46957 Migrated shuffle data files from the decommissioned node should be removed when job completed

Open

SPARK-7955 Dynamic allocation: longer timeout for executors with cached blocks

Closed

SPARK-3174 Provide elastic scaling within a Spark application

Closed

SPARK-33005 Kubernetes GA Preparation

Resolved

SPARK-41550 Dynamic Allocation on K8S GA

Resolved

links to

[Github] Pull Request #35094 (sungpeo)

(2 links to)

Sub-Tasks

1.	Keep track of nodes which are going to be shut down & avoid scheduling new tasks	Resolved	Holden Karau
2.	Copy shuffle data when nodes are being shut down	Resolved	Holden Karau
3.	Copy cache data when node is being shut down	Resolved	Prakhar Jain
4.	On executor/worker decommission consider speculatively re-launching current tasks	Resolved	Prakhar Jain
5.	Add support for YARN decommissioning & pre-emption	Resolved	Abhishek Dixit
6.	Improve the decommissioning K8s integration tests	Resolved	Holden Karau
7.	Exit the executor once all tasks & migrations are finished	Resolved	Holden Karau
8.	Use graceful decommissioning as part of dynamic scaling	Resolved	Holden Karau
9.	Improve cache block migration	Open	Unassigned
10.	Failed to register SIGPWR handler on MacOS	Resolved	wuyi
11.	Don't fail running jobs when decommissioned executors finally go away	Resolved	Devesh Agrawal
12.	Clear shuffle state when decommissioned nodes/executors are finally lost	Resolved	Devesh Agrawal
13.	Expose end point on Master so that it can be informed about decommissioned workers out of band	Resolved	Devesh Agrawal
14.	Track whether the worker is also being decommissioned along with an executor	Resolved	Devesh Agrawal
15.	DecommissionWorkerSuite has started failing sporadically again	Resolved	Devesh Agrawal
16.	[Cleanup] Consolidate state kept in ExecutorDecommissionInfo with TaskSetManager.tidToExecutorKillTimeMapping	Resolved	Devesh Agrawal
17.	decommission switch configuration should have the highest hierarchy	Resolved	wuyi
18.	Decommissioned host/executor should be considered as inactive in TaskSchedulerImpl	Resolved	wuyi
19.	Add an option to reject block migrations when under disk pressure	Open	Unassigned
20.	Simply the RPC message flow of decommission	Resolved	wuyi
21.	Improve ExecutorDecommissionInfo and ExecutorDecommissionState for different use cases	In Progress	Unassigned
22.	BlockManagerDecommissioner cleanup	Resolved	wuyi
23.	Rename all decommission configurations to use the same namespace "spark.decommission.*"	In Progress	Unassigned
24.	Do not drop cached RDD blocks to accommodate blocks from decommissioned block manager if enough memory is not available	In Progress	Unassigned
25.	Decommission executors in batches to avoid overloading network by block migrations.	In Progress	Unassigned
26.	Put blocks only on disk while migrating RDD cached data	In Progress	Unassigned
27.	Decommission logs too frequent when waiting migration to finish	In Progress	Apache Spark
28.	Executor loss reason shows "worker lost" rather "Executor decommission"	Resolved	wuyi
29.	Add support for YARN decommissioning when ESS is Disabled	Resolved	Unassigned
30.	Add support for YARN decommissioning when ESS is Enabled	In Progress	Unassigned
31.	Stream is corrupted Exception while fetching the blocks from fallback storage system	Resolved	Frank Yin

Activity

People

Assignee:: Unassigned

Reporter:: Holden Karau

Votes:: 2 Vote for this issue

Watchers:: 38 Start watching this issue

Dates

Created:: 06/May/17 22:24

Updated:: 02/Feb/24 09:34