Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-32890

Flink app rolled back with old savepoints (3 hours back in time) while some checkpoints have been taken in between

    XMLWordPrintableJSON

Details

    Description

      Here are all details about the issue:

      • Deployed new release of a flink app with a new operator
      • Flink Operator set the app as stable
      • After some time the app failed and stay in failed state (due to some issue with our kafka clusters)
      • Finally decided to rollback the new release just in case it could be the root cause of the issue on kafka
      • Operator detect: Job is not running but HA metadata is available for last state restore, ready for upgrade, Deleting JobManager deployment while preserving HA metadata.  -> rely on last-state (as we do not disable fallback), no savepoint taken
      • Flink start JM and deployment of the app. It well find the 3 checkpoints
      • Using '/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico' as Zookeeper namespace.
      • Initializing job 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).
      • Recovering checkpoints from ZooKeeperStateHandleStore{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.
      • Found 3 checkpoints in ZooKeeperStateHandleStore{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.
      • Restoring job 6b24a364c1905e924a69f3dbff0d26a6 from Checkpoint 19 @ 1692268003920 for 6b24a364c1905e924a69f3dbff0d26a6 located at }}{{{}s3p://.../flink-kafka-job-apache-nico/checkpoints/6b24a364c1905e924a69f3dbff0d26a6/chk-19.
      • Job failed because of the missing operator
      Job 6b24a364c1905e924a69f3dbff0d26a6 reached terminal state FAILED.
      org.apache.flink.runtime.client.JobInitializationException: Could not start the JobMaster.
      Caused by: java.util.concurrent.CompletionException: java.lang.IllegalStateException: There is no operator for the state f298e8715b4d85e6f965b60e1c848cbe * Job 6b24a364c1905e924a69f3dbff0d26a6 has been registered for cleanup in the JobResultStore after reaching a terminal state.
      • Clean up the high availability data for job 6b24a364c1905e924a69f3dbff0d26a6.
      • Removed job graph 6b24a364c1905e924a69f3dbff0d26a6 from ZooKeeperStateHandleStore{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobgraphs'}.
      • JobManager restart and try to resubmit the job but the job was already submitted so finished
      • Received JobGraph submission 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).
      • Ignoring JobGraph submission 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6) because the job already reached a globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous execution.
      • Application completed SUCCESSFULLY
      • Finally the operator rollback the deployment and still indicate that Job is not running but HA metadata is available for last state restore, ready for upgrade
      • But the job metadata are not anymore there (clean previously)

       

      (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints
      Path /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints doesn't exist
      (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs
      (CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico
      jobgraphs
      jobs
      leader
      

       

      The rolled back app from flink operator finally take the last provided savepoint as no metadata/checkpoints are available. But this last savepoint is an old one as during the upgrade the operator decided to rely on last-state (The old savepoint taken is a scheduled one)

      Attachments

        Activity

          People

            Unassigned Unassigned
            nfraison.datadog Nicolas Fraison
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: