There is a frequent need or desire in Spark to cancel already running Tasks. This has been recognized for a very long time (see, e.g., the ancient TODO comment in the DAGScheduler: "Cancel running tasks in the stage"), but we've never had more than an incomplete solution. Killing running Tasks at the Executor level has been implemented by interrupting the threads running the Tasks (taskThread.interrupt in o.a.s.scheduler.Task#kill.) Since https://github.com/apache/spark/commit/432201c7ee9e1ea1d70a6418cbad1c5ad2653ed3 addressing https://issues.apache.org/jira/browse/SPARK-1582, interrupting Task threads in this way has only been possible if interruptThread is true, and that typically comes from the setting of the interruptOnCancel property in the JobGroup, which in turn typically comes from the setting of spark.job.interruptOnCancel. Because of concerns over https://issues.apache.org/jira/browse/HDFS-1208 and the possibility of nodes being marked dead when a Task thread is interrupted, the default value of the boolean has been "false" – i.e. by default we do not interrupt Tasks already running on Executor even when the Task has been canceled in the DAGScheduler, or the Stage has been abort, or the Job has been killed, etc.
There are several issues resulting from this current state of affairs, and they each probably need to spawn their own JIRA issue and PR once we decide on an overall strategy here. Among those issues:
- Is HDFS-1208 still an issue, or has it been resolved adequately in the HDFS versions that Spark now supports so that we can set the default value of spark.job.interruptOnCancel to "true" or eliminate this boolean flag entirely?
- Even if interrupting Task threads is no longer an issue for HDFS, is it still enough of an issue for non-HDFS usage (e.g. Cassandra) so that we still need protection similar to what the current default value of spark.job.interruptOnCancel provides?
- If interrupting Task threads isn't safe enough, what should we do instead?
- Once we have a safe mechanism to stop and clean up after already executing Tasks, there is still the question of whether we should end executing Tasks. While that is likely a good thing to do in cases where individual Tasks are lightweight in terms of resource usage, at least in some cases not all running Tasks should be ended: https://github.com/apache/spark/pull/12436 That means that we probably need to continue to make allowing Task interruption configurable at the Job or JobGroup level (and we need better documentation explaining how and when to allow interruption or not.)
- There is one place in the current code (TaskSetManager#handleSuccessfulTask) that hard codes interruptThread to "true". This should be fixed, and similar misuses of killTask be denied in pull requests until this issue is adequately resolved.