[FLINK-12858] Potential distributed deadlock in case of synchronous savepoint failure - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.9.0
Fix Version/s: 1.9.0
Component/s: Runtime / Checkpointing
Labels:
- pull-request-available

Description

Current implementation of stop-with-savepoint (~~FLINK-11458~~) would lock the thread (on syncSavepointLatch) that carries StreamTask.performCheckpoint(). For non-source tasks, this thread is implied to be the task's main thread (stop-with-savepoint deliberately stops any activity in the task's main thread).

Unlocking happens either when the task is cancelled or when the corresponding checkpoint is acknowledged.

It's possible, that other downstream tasks of the same Flink job "soft" fail the checkpoint/savepoint due to various reasons (for example, due to max buffered bytes BarrierBuffer.checkSizeLimit(). In such case, the checkpoint abortion would be notified to JM . But it looks like, the checkpoint coordinator would handle such abortion as usual and assume that the Flink job continues running.

Attachments

Issue Links

links to

GitHub Pull Request #9131

Sub-Tasks

Add test that fails job when sync savepoint is discarded.

Closed

Kostas Kloudas

100%

Activity

People

Assignee:: Alex

Reporter:: Alex

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 15/Jun/19 12:02

Updated:: 31/Jul/19 09:12

Resolved:: 31/Jul/19 08:48

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h

Include sub-tasks