Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-36733

Don't transition task to RUNNING until the inputs are recovered (UC)

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 1.20.0, 1.19.1
    • 1.19.2, 1.20.1
    • Runtime / Task
    • None

    Description

      When recovering from an Unaligned Checkpoint, a task transitions to RUNNING after restoring:

      1. Output channel state
      2. Operator state
      3. Input channel state 

      However, the upstream task(s) might not yet send all the recovered buffers; therefore, in case of rescaling, downstream task must keep the virtual channel infrastructure up (RescalingStreamTaskNetworkInput).

      {{}}

      That means in particular that checkpoints might be triggered by the `CheckpointCoordinator` but declined by the downstream task (because RescalingStreamTaskNetworkInput doesn't support checkpointing).

       

      In case of long recovery, many declined checkpoints might exhaust some resources, e.g. transaction ID pools in our case.

      It's confusing (for humans and observability tools) to see tasks switched to RUNNING but still not able to checkpoint due to recovery.

       

      The proposal is to transition task to RUNNING only after all the inputs are recovered.

      Attachments

        Activity

          People

            roman Roman Khachatryan
            roman Roman Khachatryan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: