Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.18.0, 1.17.1
Description
First of all, thanks to mapohl for helping double-check in advance that this was indeed a bug .
Displaying exception history in WebUI is supported in FLINK-6042.
What's the concurrentExceptions?
When an execution fails due to an exception, other executions in the same region will also restart, and the first Exception is rootException. If other restarted executions also report Exception at this time, we hope to collect these exceptions and Displayed to the user as concurrentExceptions.
What's this bug?
The concurrentExceptions is always empty in production, even if other executions report exception at very close times.
Why doesn't it work?
If one job has all-to-all shuffle, this job only has one region, and this region has a lot of executions. If one execution throw exception:
- JobMaster will mark the state as FAILED for this execution.
- The rest of executions of this region will be marked to CANCELING.
- This call stack can be found at FLIP-364 part-4.2.3
When these executions throw exception as well, it JobMaster will mark the state from CANCELING to CANCELED instead of FAILED.
The CANCELED execution won't call FAILED logic, so their exceptions are ignored.
Note: all reports are executed inside of JobMaster RPC thread, it's single thread. So these reports are executed serially. So only one execution is marked to FAILED, and the rest of executions will be marked to CANCELED later.
How to fix it?
Offline discuss with mapohl , we need to discuss with community should we keep the concurrentExceptions first.
- If no, we can remove related logic directly
- If yew, we discuss how to fix it later.
Attachments
Attachments
Issue Links
- relates to
-
FLINK-33121 Failed precondition in JobExceptionsHandler due to concurrent global failures
- Closed
- links to