Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
1.18.0
-
None
Description
We make the assumption that Global Failures (with null Task name) may only be RootExceptions and and Local/Task exception may be part of concurrent exceptions List (see JobExceptionsHandler#createRootExceptionInfo).
However, when the Adaptive scheduler is in a Restarting phase due to an existing failure (that is now the new Root) we can still, in rare occasions, capture new Global failures, violating this condition (with an assertion is thrown as part of assertLocalExceptionInfo) seeing something like:
The taskName must not be null for a non-global failure.
We want to ignore Global failures while being in a Restarting phase on the Adaptive scheduler until we properly support multiple Global failures in the Exception History as part of https://issues.apache.org/jira/browse/FLINK-34922
Note: DefaultScheduler does not suffer from this issue as it treats failures directly as HistoryEntries (no conversion step)
Attachments
Issue Links
- is fixed by
-
FLINK-34922 Exception History should support multiple Global failures
- Closed
- is related to
-
FLINK-33565 The concurrentExceptions doesn't work
- Resolved
-
FLINK-34922 Exception History should support multiple Global failures
- Closed
- links to