[FLINK-13452] Pipelined region failover strategy does not recover Job if checkpoint cannot be read - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.9.0, 1.10.0
Fix Version/s: 1.9.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Description

The job does not recover if a checkpoint cannot be read and jobmanager.execution.failover-strategy is set to "region".

Analysis

The RestartCallback created by AdaptedRestartPipelinedRegionStrategyNG throws a RuntimeException if no checkpoints could be read. When the restart is invoked in a separate thread pool, the exception is swallowed. See:

https://github.com/apache/flink/blob/21621fbcde534969b748f21e9f8983e3f4e0fb1d/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/AdaptedRestartPipelinedRegionStrategyNG.java#L117-L119

https://github.com/apache/flink/blob/21621fbcde534969b748f21e9f8983e3f4e0fb1d/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/restart/FixedDelayRestartStrategy.java#L65

Expected behavior

Job should restart

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

jobmanager.log
28/Jul/19 18:03
2.21 MB
Gary Yao

Issue Links

is caused by

FLINK-13060 FailoverStrategies should respect restart constraints

Closed

links to

GitHub Pull Request #9268

GitHub Pull Request #9376

Activity

People

Assignee:: Yun Tang

Reporter:: Gary Yao

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 28/Jul/19 18:13

Updated:: 02/Oct/19 17:50

Resolved:: 07/Aug/19 07:56

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: