[FLINK-33998] Flink Job Manager restarted after kube-apiserver connection intermittent - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 1.13.6
Fix Version/s: None
Component/s: Deployment / Kubernetes
Labels:
None
Environment:

Kubernetes 1.24

Flink Operator 1.4

Flink 1.13.6

Description

We are running Flink on AWS EKS and experienced Job Manager restart issue when EKS control plane scaled up/in.

I can reproduce this issue in my local environment too.

Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster by my own with below setup:

Two kube-apiserver, only one is running at a time;
Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
Enable Flink Job Manager HA;
Configure Job Manager leader election timeout;

high-availability.kubernetes.leader-election.lease-duration: "60s"
high-availability.kubernetes.leader-election.renew-deadline: "60s"

For testing, I switch the running kube-apiserver from one instance to another each time. When the kube-apiserver is switching, I can see that some Job Managers restart, but some are still running normally.

Here is an example. When kube-apiserver swatched over at 05:53:08, both JM lost connection to kube-apiserver. But there is no more connection error within a few seconds. I guess the connection recovered by retry.

However, one of the JM (the 2nd one in the attached screen shot) reported "DefaultDispatcherRunner was revoked the leadership" error after the leader election timeout (at 05:54:08) and then restarted itself. While the other JM was still running normally.

From kube-apiserver audit logs, the normal JM was able to renew leader lease after the interruption. But there is no any lease renew request from the failed JM until it restarted.

Attachments

audit-log-no-restart.txt
05/Jan/24 00:11
6 kB
Xiangyan
audit-log-restart.txt
05/Jan/24 00:11
8 kB
Xiangyan
connection timeout.png
05/Jan/24 00:09
207 kB
Xiangyan
jm-no-restart4.log
05/Jan/24 00:09
204 kB
Xiangyan
jm-restart4.log
05/Jan/24 00:10
161 kB
Xiangyan

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Unassigned

Reporter:: Xiangyan

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/Jan/24 00:12

Updated:: 17/Jan/24 10:49

Resolved:: 17/Jan/24 10:49

Agile

View on Board

Flink Job Manager restarted after kube-apiserver connection intermittent

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment