[FLINK-17869] Fix the race condition of aborting unaligned checkpoint - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.11.0, 1.12.0
Component/s: Runtime / Checkpointing
Labels:
- pull-request-available

Description

On ChannelStateWriter side, the lifecycle of checkpoint should be as follows:

start -> in progress/abort -> stop.

The ChannelStateWriteResult is created during #start, and removed by #abort or #stop processes. There are some potential race conditions here:

#start is called while receiving the first barrier by netty thread and schedule to execute the checkpoint
The task thread might process cancel checkpoint and call #abort before performing the above respective checkpoint
The checkpoint can still be executed by task thread afterwards even thought the above abort happened before, because we can not remove the checkpoint action from mailbox during aborting.
While checkpoint executing, it will call `ChannelStateWriter#getWriteResult` then it would cause `IllegalStateException` because the respective result was already removed in advance during handling #abort method before.
Therefore it will cause unnecessary task failure during performing checkpoint

I guess we do not want to fail the task when one checkpoint is aborted by design. And the illegal state check during ChannelStateWriter#getWriteResult was mainly proposed for normal process validation I guess.

If we do not remove the ChannelStateWriteResult while handling #abort and rely on #stop to remove it, then it might probably exist another scenario that the checkpoint will never be performed after #start (we have another mechanism to exit the triggering checkpoint in advance if the abort is sent by CheckpointCoordinator), then the legacy ChannelStateWriteResult will be retained inside ChannelStateWriter long time.

Maybe the potential option to fix this issue is to let SubtaskCheckpointCoordinatorImpl handle the exception from ChannelStateWriter#getWriteResult properly to not fail the task in the aborted case.

Attachments

Issue Links

causes

FLINK-17768 UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointOnLocalAndRemoteChannel is instable

Closed

links to

GitHub Pull Request #12478

GitHub Pull Request #12550

Activity

People

Assignee:: Roman Khachatryan

Reporter:: Zhijiang

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 21/May/20 15:36

Updated:: 11/Jun/20 03:18

Resolved:: 10/Jun/20 20:34