XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.18.0, 1.17.1
Fix Version/s: 1.19.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Description

First of all, thanks to mapohl for helping double-check in advance that this was indeed a bug .

Displaying exception history in WebUI is supported in ~~FLINK-6042~~.

What's the concurrentExceptions?

When an execution fails due to an exception, other executions in the same region will also restart, and the first Exception is rootException. If other restarted executions also report Exception at this time, we hope to collect these exceptions and Displayed to the user as concurrentExceptions.

What's this bug?

The concurrentExceptions is always empty in production, even if other executions report exception at very close times.

Why doesn't it work?

If one job has all-to-all shuffle, this job only has one region, and this region has a lot of executions. If one execution throw exception:

JobMaster will mark the state as FAILED for this execution.
The rest of executions of this region will be marked to CANCELING.
- This call stack can be found at FLIP-364 part-4.2.3

When these executions throw exception as well, it JobMaster will mark the state from CANCELING to CANCELED instead of FAILED.

The CANCELED execution won't call FAILED logic, so their exceptions are ignored.

Note: all reports are executed inside of JobMaster RPC thread, it's single thread. So these reports are executed serially. So only one execution is marked to FAILED, and the rest of executions will be marked to CANCELED later.

How to fix it?

Offline discuss with mapohl , we need to discuss with community should we keep the concurrentExceptions first.

If no, we can remove related logic directly
If yew, we discuss how to fix it later.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

screenshot-1.png
18/Dec/23 04:17
364 kB
Rui Fan

Issue Links

relates to

FLINK-33121 Failed precondition in JobExceptionsHandler due to concurrent global failures

Closed

links to

GitHub Pull Request #24003

Activity

People

Assignee:: Rui Fan

Reporter:: Rui Fan

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/Nov/23 02:23

Updated:: 23/Jan/24 10:10

Resolved:: 23/Jan/24 10:10