Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.1.0
-
None
Description
Looking through the code handling for https://github.com/apache/spark/pull/21577, I was looking to see how we are killing task attempts. I don't any where that we actually kill task attempts for stage attempts not in the one that completed successfully.
For instance:
stage 0.0 . (stage id 0, attempt 0)
- task 1.0 (task 1, attempt 0)
Stage 0.1 (stage id 0, attempt 1) started due to fetch failure for instance
- task 1.0 (task 1, attempt 0) . Equivalent task for stage 0.0, task 1.0 because task 1.0 in stage 0.0 didn't finish and didn't fail.
Now if task 1.0 in stage 0.0 succeeds, it gets committed and marked as successful. We will mark the task in stage 0.1 as completed but there is no where in the code that I see it actually kill task 1.0 in stage 0.1.
Note that the scheduler does handle the case where we have 2 attempts (speculation) in a single stage attempt. It will kill the other attempt when one of them succeeds. See TaskSetManager.handleSuccessfulTask
Attachments
Issue Links
- is related to
-
SPARK-2666 Always try to cancel running tasks when a stage is marked as zombie
- Resolved
- relates to
-
SPARK-25250 Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times
- Resolved
-
SPARK-25773 Cancel zombie tasks in a result stage when the job finishes
- Resolved