Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-34131

Checkpoint check window should take in account checkpoint job configuration

    XMLWordPrintableJSON

Details

    Description

      When enabling checkpoint progress check (kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to define cluster health the operator rely detect if a checkpoint has been performed during the kubernetes.operator.cluster.health-check.checkpoint-progress.window

      As indicated in the doc it must be bigger to checkpointing interval.

      But this is a manual configuration which can leads to misconfiguration and unwanted restart of the flink cluster if the checkpointing interval is bigger than the window one.

      The operator must check that the config is healthy before to rely on this check. If it is not well set it should not execute the check (return true on evaluateCheckpoints) and log a WARN message.

      Also flink jobs have other checkpointing parameters that should be taken in account for this window configuration which are execution.checkpointing.timeout and execution.checkpointing.tolerable-failed-checkpoints

      The idea would be to check that kubernetes.operator.cluster.health-check.checkpoint-progress.window >= max(execution.checkpointing.interval * execution.checkpointing.tolerable-failed-checkpoints, execution.checkpointing.timeout * execution.checkpointing.tolerable-failed-checkpoints)

      Attachments

        Issue Links

          Activity

            People

              nfraison.datadog Nicolas Fraison
              nfraison.datadog Nicolas Fraison
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: