This is a step along the way to
SPARK-8425 – see the design doc on that jira for a complete discussion of blacklisting.
To enable incremental review, the first step proposed here is to expand the blacklisting within tasksets. In particular, this will enable blacklisting for
- (task, executor) pairs (this already exists via an undocumented config)
- (task, node)
- (taskset, executor)
- (taskset, node)
In particular, adding (task, node) is critical to making spark fault-tolerant of one-bad disk in a cluster, without requiring careful tuning of "spark.task.maxFailures". The other additions are also important to avoid many misleading task failures and long scheduling delays when there is one bad node on a large cluster.