Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-10751

Checkpoints should be retained when job reaches suspended state

    XMLWordPrintableJSON

Details

    Description

      CheckpointProperties define in which terminal job status a checkpoint should be disposed.

      I've noticed that the properties for CHECKPOINT_NEVER_RETAINED, CHECKPOINT_RETAINED_ON_FAILURE prescribe checkpoint disposal in (locally) terminal job status SUSPENDED.

      Since a job reaches the SUSPENDED state when its JobMaster looses leadership, this would result in the checkpoint to be cleaned up and not being available for recovery by the new leader. Therefore, we should rather retain checkpoints when reachingĀ job status SUSPENDED.

      BUT: Because we special case this terminal state in the only highly available CompletedCheckpointStore implementation (seeĀ ZooKeeperCompletedCheckpointStore) and don't use regular checkpoint disposal, this issue has not surfaced yet.

      I think we should proactively fix the properties to indicate to retain checkpoints in SUSPENDED state. We might actually completely remove this case since with this change, all properties will indicate to retain on suspension.

      Attachments

        Issue Links

          Activity

            People

              uce Ufuk Celebi
              uce Ufuk Celebi
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: