[FLINK-21008] Residual HA related Kubernetes ConfigMaps and ZooKeeper nodes when cluster entrypoint received SIGTERM in shutdown - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.11.3, 1.12.1, 1.13.0
Fix Version/s: 1.11.4, 1.13.0, 1.12.3
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Description

Recently, in our internal use case for native K8s integration with K8s HA enabled, we found that the leader related ConfigMaps could be residual in some corner situations.

After some investigations, I think it is possibly caused by the inappropriate shutdown process.

In ClusterEntrypoint#shutDownAsync, we first call the closeClusterComponent, which also includes deregistering the Flink application from cluster management(e.g. Yarn, K8s). Then we call the stopClusterServices and cleanupDirectories. Imagine that the cluster management do the deregister very fast, the JobManager process receives SIGNAL 15 before or is being executing the stopClusterServices and cleanupDirectories. The jvm process will directly exit then. So the two methods may not be executed.

Attachments

Issue Links

relates to

FLINK-26772 Application and Job Mode does not wait for job cleanup during shutdown

Open

links to

GitHub Pull Request #15396

Activity

People

Assignee:: Yang Wang

Reporter:: Yang Wang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 18/Jan/21 09:49

Updated:: 20/Apr/22 10:21

Resolved:: 09/Apr/21 10:15