Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.4.0
-
None
Description
Removed executor due to decommission should be kept in a separate set. To avoid OOM, set size will be limited to 1K or 10K.
FetchFailed caused by decom executor could be divided into 2 categories:
- When FetchFailed reached DAGScheduler, the executor is still alive or is lost but the lost info hasn't reached TaskSchedulerImpl. This is already handled inĀ
SPARK-40979 - FetchFailed is caused by decom executor loss, so the decom info is already removed in TaskSchedulerImpl. If we keep such info in a short period, it is good enough. Even we limit the size of removed executors to 10K, it could be only at most 10MB memory usage. In real case, it's rare to have cluster size of over 10K and the chance that all these executors decomed and lost at the same time would be small.