Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.0.0
Description
Today it is difficult to determine why a job, stage, or job group was canceled. We should leverage existing Spark functionality to provide a reason string explaining the cancellation cause, and should add new APIs to let us provide this reason when canceling job groups.
Details:
- Since
SPARK-19549Allow providing reasons for stage/job cancelling - ASF JIRA (Spark 2.20), Spark’s cancelJob and cancelStage methods accept an optional reason: String that is added to logging output and user-facing error messages when jobs or stages are canceled. In our internal calls to these methods, we should always supply a reason. For example, we should set an appropriate reason when the “kill” links are clicked in the Spark UI (see code).
- Other APIs currently lack a reason field. For example, cancelJobGroup and cancelJobsWithTag don’t provide any way to specify a reason, so we only see generic logs like “asked to cancel job group <group name>”. We should add an ability to pass in a group cancellation reason and thread that through into the scheduler’s logging and job failure reasons.
This feature can be implemented in two PRs:
- Modify the current SparkContext and its downstream APIs to add the reason string, such as cancelJobGroup and cancelJobsWithTag
2. Add reasons for all internal calls to these methods