[FLINK-21942] KubernetesLeaderRetrievalDriver not closed after terminated which lead to connection leak - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.12.2, 1.13.0
Fix Version/s: 1.13.0, 1.12.3
Component/s: Runtime / Coordination
Labels:
- k8s-ha
- pull-request-available

Description

Looks like KubernetesLeaderRetrievalDriver is not closed even if the KubernetesLeaderElectionDriver is closed and job reach globally terminated.
This will lead to many configmap watching be still active with connections to K8s.

When the connections exceeds max concurrent requests, those new configmap watching can not be started. Finally leads to all new jobs submitted timeout.

fly_in_gis trohrmann This may be related to ~~FLINK-20695~~, could you confirm this issue?
But when many jobs are running in same session cluster, the config map watching is required to be active. Maybe we should merge all config maps watching?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2021-03-24-18-08-30-196.png
24/Mar/21 10:08
360 kB
Yi Tang
image-2021-03-24-18-08-42-116.png
24/Mar/21 10:08
363 kB
Yi Tang
jstack.l
25/Mar/21 02:27
303 kB
Yi Tang

Issue Links

is related to

FLINK-22006 Could not run more than 20 jobs in a native K8s session when K8s HA enabled

Closed

FLINK-22054 Using a shared watcher for ConfigMap watching

Closed

relates to

FLINK-20695 Zookeeper node under leader and leaderlatch is not deleted after job finished

Closed

links to

GitHub Pull Request #15407

Activity

People

Assignee:: Yang Wang

Reporter:: Yi Tang

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 24/Mar/21 03:57

Updated:: 28/Aug/21 11:13

Resolved:: 01/Apr/21 17:25