[FLINK-34518] Adaptive Scheduler restores from empty state if JM fails during restarting state - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.17.2, 1.19.0, 1.18.1, 1.20.0
Fix Version/s: 1.19.0, 1.17.3, 1.18.2
Component/s: Runtime / Checkpointing, Runtime / Coordination
Labels:
- pull-request-available

Description

If a JobManager failover occurs while the Job is in a Restarting state, the HA metadata is deleted (as if it was a globally terminal state) and the job restarts from an empty state after the JM comes back up:

Jobmanager killed after killing Taskmanager (restarting phase):

2024-02-26 10:10:12,147 DEBUG org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Ignore TaskManager pod that is already added: autoscaling-example-taskmanager-3-2
2024-02-26 10:10:13,799 DEBUG org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Trigger heartbeat request.
2024-02-26 10:10:13,799 DEBUG org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Trigger heartbeat request.
2024-02-26 10:10:13,799 DEBUG org.apache.flink.runtime.jobmaster.JobMaster                 [] - Received heartbeat request from 9b7e17b75812ab60ecf028e02368d0c2.
2024-02-26 10:10:13,799 DEBUG org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Received heartbeat from 251c25cf794e3c9396fc02306613507b.
2024-02-26 10:10:14,091 DEBUG org.apache.pekko.remote.transport.netty.NettyTransport       [] - Remote connection to [/10.244.0.120:55647] was disconnected because of [id: 0x4a61a791, /10.244.0.120:55647 :> /10.244.0.118:6123] DISCONNECTED
2024-02-26 10:10:14,091 DEBUG org.apache.pekko.remote.transport.ProtocolStateActor         [] - Association between local [tcp://flink@10.244.0.118:6123] and remote [tcp://flink@10.244.0.120:55647] was disassociated because the ProtocolStateActor failed: Unknown
2024-02-26 10:10:14,092 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2024-02-26 10:10:14,094 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Shutting KubernetesApplicationClusterEntrypoint down with application status UNKNOWN. Diagnostics Cluster entrypoint has been closed externally..
2024-02-26 10:10:14,095 INFO  org.apache.flink.runtime.blob.BlobServer                     [] - Stopped BLOB server at 0.0.0.0:6124
2024-02-26 10:10:14,095 INFO  org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Shutting down rest endpoint.
2024-02-26 10:10:14,315 DEBUG org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.Watcher [] - Watcher closed
2024-02-26 10:10:14,511 DEBUG org.apache.pekko.actor.CoordinatedShutdown                   [] - Performing task [terminate-system] in CoordinatedShutdown phase [actor-system-terminate]
2024-02-26 10:10:14,595 INFO  org.apache.pekko.remote.RemoteActorRefProvider$RemotingTerminator [] - Shutting down remote daemon.
2024-02-26 10:10:14,596 INFO  org.apache.pekko.remote.RemoteActorRefProvider$RemotingTerminator [] - Remote daemon shut down; proceeding with flushing remote transports.

Then the new JM comes back it doesn't find any checkpoints as the HA metadata was deleted (we couldn't see this in the logs of the shutting down JM):

2024-02-26 10:10:30,294 INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - Recovering checkpoints from KubernetesStateHandleStore{configMapName='autoscaling-example-5ddd0b1ba346d3bfd5ef53a63772e43c-config-map'}.2024-02-26 10:10:30,394 INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils [] - Found 0 checkpoints in KubernetesStateHandleStore{configMapName='autoscaling-example-5ddd0b1ba346d3bfd5ef53a63772e43c-config-map'}.

Even the main method is re-run and the jobgraph is regenerated (which is expected given the HA metadata was removed incorrectly)