Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
1.17.2, 1.19.0, 1.18.1, 1.20.0
Description
If a JobManager failover occurs while the Job is in a Restarting state, the HA metadata is deleted (as if it was a globally terminal state) and the job restarts from an empty state after the JM comes back up:
Jobmanager killed after killing Taskmanager (restarting phase):
2024-02-26 10:10:12,147 DEBUG org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Ignore TaskManager pod that is already added: autoscaling-example-taskmanager-3-2 2024-02-26 10:10:13,799 DEBUG org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Trigger heartbeat request. 2024-02-26 10:10:13,799 DEBUG org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Trigger heartbeat request. 2024-02-26 10:10:13,799 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] - Received heartbeat request from 9b7e17b75812ab60ecf028e02368d0c2. 2024-02-26 10:10:13,799 DEBUG org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Received heartbeat from 251c25cf794e3c9396fc02306613507b. 2024-02-26 10:10:14,091 DEBUG org.apache.pekko.remote.transport.netty.NettyTransport [] - Remote connection to [/10.244.0.120:55647] was disconnected because of [id: 0x4a61a791, /10.244.0.120:55647 :> /10.244.0.118:6123] DISCONNECTED 2024-02-26 10:10:14,091 DEBUG org.apache.pekko.remote.transport.ProtocolStateActor [] - Association between local [tcp://flink@10.244.0.118:6123] and remote [tcp://flink@10.244.0.120:55647] was disassociated because the ProtocolStateActor failed: Unknown 2024-02-26 10:10:14,092 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested. 2024-02-26 10:10:14,094 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Shutting KubernetesApplicationClusterEntrypoint down with application status UNKNOWN. Diagnostics Cluster entrypoint has been closed externally.. 2024-02-26 10:10:14,095 INFO org.apache.flink.runtime.blob.BlobServer [] - Stopped BLOB server at 0.0.0.0:6124 2024-02-26 10:10:14,095 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Shutting down rest endpoint. 2024-02-26 10:10:14,315 DEBUG org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.Watcher [] - Watcher closed 2024-02-26 10:10:14,511 DEBUG org.apache.pekko.actor.CoordinatedShutdown [] - Performing task [terminate-system] in CoordinatedShutdown phase [actor-system-terminate] 2024-02-26 10:10:14,595 INFO org.apache.pekko.remote.RemoteActorRefProvider$RemotingTerminator [] - Shutting down remote daemon. 2024-02-26 10:10:14,596 INFO org.apache.pekko.remote.RemoteActorRefProvider$RemotingTerminator [] - Remote daemon shut down; proceeding with flushing remote transports.
Then the new JM comes back it doesn't find any checkpoints as the HA metadata was deleted (we couldn't see this in the logs of the shutting down JM):
2024-02-26 10:10:30,294 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - Recovering checkpoints from KubernetesStateHandleStore{configMapName='autoscaling-example-5ddd0b1ba346d3bfd5ef53a63772e43c-config-map'}.2024-02-26 10:10:30,394 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - Found 0 checkpoints in KubernetesStateHandleStore{configMapName='autoscaling-example-5ddd0b1ba346d3bfd5ef53a63772e43c-config-map'}.
Even the main method is re-run and the jobgraph is regenerated (which is expected given the HA metadata was removed incorrectly)
Attachments
Issue Links
- blocks
-
FLINK-34451 [Kubernetes Operator] Job with restarting TaskManagers uses wrong/misleading fallback approach
- Open
- links to