[FLINK-33998] Flink Job Manager restarted after kube-apiserver connection intermittent - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 1.13.6
Fix Version/s: None
Component/s: Deployment / Kubernetes
Labels:
None
Environment:

Kubernetes 1.24

Flink Operator 1.4

Flink 1.13.6

Description

We are running Flink on AWS EKS and experienced Job Manager restart issue when EKS control plane scaled up/in.

I can reproduce this issue in my local environment too.

Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster by my own with below setup:

Two kube-apiserver, only one is running at a time;
Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
Enable Flink Job Manager HA;
Configure Job Manager leader election timeout;

high-availability.kubernetes.leader-election.lease-duration: "60s"
high-availability.kubernetes.leader-election.renew-deadline: "60s"

For testing, I switch the running kube-apiserver from one instance to another each time. When the kube-apiserver is switching, I can see that some Job Managers restart, but some are still running normally.

Here is an example. When kube-apiserver swatched over at 05:53:08, both JM lost connection to kube-apiserver. But there is no more connection error within a few seconds. I guess the connection recovered by retry.

However, one of the JM (the 2nd one in the attached screen shot) reported "DefaultDispatcherRunner was revoked the leadership" error after the leader election timeout (at 05:54:08) and then restarted itself. While the other JM was still running normally.

From kube-apiserver audit logs, the normal JM was able to renew leader lease after the interruption. But there is no any lease renew request from the failed JM until it restarted.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

connection timeout.png
05/Jan/24 00:09
207 kB
Xiangyan
jm-no-restart4.log
05/Jan/24 00:09
204 kB
Xiangyan
jm-restart4.log
05/Jan/24 00:10
161 kB
Xiangyan
audit-log-no-restart.txt
05/Jan/24 00:11
6 kB
Xiangyan
audit-log-restart.txt
05/Jan/24 00:11
8 kB
Xiangyan

Activity

People

Assignee:: Unassigned

Reporter:: Xiangyan

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/Jan/24 00:12

Updated:: 17/Jan/24 10:49

Resolved:: 17/Jan/24 10:49