[FLINK-32890] Flink app rolled back with old savepoints (3 hours back in time) while some checkpoints have been taken in between - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: kubernetes-operator-1.7.0, kubernetes-operator-1.6.1
Component/s: Kubernetes Operator
Labels:
- pull-request-available

Description

Here are all details about the issue:

Deployed new release of a flink app with a new operator
Flink Operator set the app as stable
After some time the app failed and stay in failed state (due to some issue with our kafka clusters)
Finally decided to rollback the new release just in case it could be the root cause of the issue on kafka
Operator detect: Job is not running but HA metadata is available for last state restore, ready for upgrade, Deleting JobManager deployment while preserving HA metadata. -> rely on last-state (as we do not disable fallback), no savepoint taken
Flink start JM and deployment of the app. It well find the 3 checkpoints

Using '/flink-kafka-job-apache-nico/flink-kafka-job-apache-nico' as Zookeeper namespace.
Initializing job 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).
Recovering checkpoints from ZooKeeperStateHandleStore{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.
Found 3 checkpoints in ZooKeeperStateHandleStore{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints'}.
Restoring job 6b24a364c1905e924a69f3dbff0d26a6 from Checkpoint 19 @ 1692268003920 for 6b24a364c1905e924a69f3dbff0d26a6 located at }}{{{}s3p://.../flink-kafka-job-apache-nico/checkpoints/6b24a364c1905e924a69f3dbff0d26a6/chk-19.

Job failed because of the missing operator

Job 6b24a364c1905e924a69f3dbff0d26a6 reached terminal state FAILED.
org.apache.flink.runtime.client.JobInitializationException: Could not start the JobMaster.
Caused by: java.util.concurrent.CompletionException: java.lang.IllegalStateException: There is no operator for the state f298e8715b4d85e6f965b60e1c848cbe * Job 6b24a364c1905e924a69f3dbff0d26a6 has been registered for cleanup in the JobResultStore after reaching a terminal state.

Clean up the high availability data for job 6b24a364c1905e924a69f3dbff0d26a6.
Removed job graph 6b24a364c1905e924a69f3dbff0d26a6 from ZooKeeperStateHandleStore{namespace='flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobgraphs'}.

JobManager restart and try to resubmit the job but the job was already submitted so finished

Received JobGraph submission 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6).
Ignoring JobGraph submission 'flink-kafka-job' (6b24a364c1905e924a69f3dbff0d26a6) because the job already reached a globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous execution.
Application completed SUCCESSFULLY

Finally the operator rollback the deployment and still indicate that Job is not running but HA metadata is available for last state restore, ready for upgrade
But the job metadata are not anymore there (clean previously)

(CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints
Path /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs/6b24a364c1905e924a69f3dbff0d26a6/checkpoints doesn't exist
(CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico/jobs
(CONNECTED [zookeeper-data-eng-multi-cloud.zookeeper-flink.svc:2181]) /> ls /flink-kafka-job-apache-nico/flink-kafka-job-apache-nico
jobgraphs
jobs
leader

The rolled back app from flink operator finally take the last provided savepoint as no metadata/checkpoints are available. But this last savepoint is an old one as during the upgrade the operator decided to rely on last-state (The old savepoint taken is a scheduled one)

Attachments

Issue Links

links to

GitHub Pull Request #654

GitHub Pull Request #658

Activity

People

Assignee:: Unassigned

Reporter:: Nicolas Fraison

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 17/Aug/23 12:49

Updated:: 30/Oct/23 10:37

Resolved:: 22/Aug/23 20:52