[FLINK-36733] Don't transition task to RUNNING until the inputs are recovered (UC) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.20.0, 1.19.1
Fix Version/s: 2.0.0, 1.19.3, 1.20.2
Component/s: Runtime / Task
Labels:
None

Description

When recovering from an Unaligned Checkpoint, a task transitions to RUNNING after restoring:

Output channel state
Operator state
Input channel state

However, the upstream task(s) might not yet send all the recovered buffers; therefore, in case of rescaling, downstream task must keep the virtual channel infrastructure up (RescalingStreamTaskNetworkInput).

{{}}

That means in particular that checkpoints might be triggered by the `CheckpointCoordinator` but declined by the downstream task (because RescalingStreamTaskNetworkInput doesn't support checkpointing).

In case of long recovery, many declined checkpoints might exhaust some resources, e.g. transaction ID pools in our case.

It's confusing (for humans and observability tools) to see tasks switched to RUNNING but still not able to checkpoint due to recovery.

The proposal is to transition task to RUNNING only after all the inputs are recovered.

Attachments

Activity

People

Assignee:: Roman Khachatryan

Reporter:: Roman Khachatryan

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 15/Nov/24 13:40

Updated:: 07/Dec/24 09:17