Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-26029 Generalize the checkpoint protocol of OperatorCoordinator.
  3. FLINK-28639

Preserve distributed consistency of OperatorEvents from subtasks to OperatorCoordinator

    XMLWordPrintableJSON

Details

    Description

      This is the second step to solving the consistency issue of OC communications. In this step, we would also guarantee the consistency of operator events sent from subtasks to OCs. Combined with the other subtask to preserve the consistency of communications in the reverse direction, all communications between OC and subtasks would be consistent across checkpoints and global failovers.

      To achieve the goal of this step, we need to add closing/reopening functions to the subtasks' gateways and make the subtasks aware of a checkpoint before they receive the checkpoint barriers. The general process would be as follows.

      1. When the OC starts checkpoint, it notifies all subtasks about this information.
      2. After being notified about the ongoing checkpoint in OC, a subtask sends a special operator event to its OC, which is the last operator event the OC could receive from the subtask before the subtask completes the checkpoint. Then the subtask closes its gateway.
      3. After receiving this special event from all subtasks, the OC finishes its checkpoint and closes its gateway. Then the checkpoint coordinator sends checkpoint barriers to the sources.
      4. If the subtask or the OC generate any event to send to each other, they buffer the events locally.
      5. When a subtask starts checkpointing, it also stores the buffered events in the checkpoint.
      6. After the subtask completes the checkpoint, communications in both directions are recovered and the buffered events are sent out.

      Attachments

        Activity

          People

            yunfengzhou Yunfeng Zhou
            yunfengzhou Yunfeng Zhou
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: