Details
-
Improvement
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
1.20.0, 1.19.1
-
None
Description
When recovering from an Unaligned Checkpoint, a task transitions to RUNNING after restoring:
- Output channel state
- Operator state
- Input channel state
However, the upstream task(s) might not yet send all the recovered buffers; therefore, in case of rescaling, downstream task must keep the virtual channel infrastructure up (RescalingStreamTaskNetworkInput).
{{}}
That means in particular that checkpoints might be triggered by the `CheckpointCoordinator` but declined by the downstream task (because RescalingStreamTaskNetworkInput doesn't support checkpointing).
In case of long recovery, many declined checkpoints might exhaust some resources, e.g. transaction ID pools in our case.
It's confusing (for humans and observability tools) to see tasks switched to RUNNING but still not able to checkpoint due to recovery.
The proposal is to transition task to RUNNING only after all the inputs are recovered.