Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Not A Problem
-
1.13.6
-
None
-
None
-
Kubernetes 1.24
Flink Operator 1.4
Flink 1.13.6
Description
We are running Flink on AWS EKS and experienced Job Manager restart issue when EKS control plane scaled up/in.
I can reproduce this issue in my local environment too.
Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster by my own with below setup:
- Two kube-apiserver, only one is running at a time;
- Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
- Enable Flink Job Manager HA;
- Configure Job Manager leader election timeout;
high-availability.kubernetes.leader-election.lease-duration: "60s" high-availability.kubernetes.leader-election.renew-deadline: "60s"
For testing, I switch the running kube-apiserver from one instance to another each time. When the kube-apiserver is switching, I can see that some Job Managers restart, but some are still running normally.
Here is an example. When kube-apiserver swatched over at 05:53:08, both JM lost connection to kube-apiserver. But there is no more connection error within a few seconds. I guess the connection recovered by retry.
However, one of the JM (the 2nd one in the attached screen shot) reported "DefaultDispatcherRunner was revoked the leadership" error after the leader election timeout (at 05:54:08) and then restarted itself. While the other JM was still running normally.
From kube-apiserver audit logs, the normal JM was able to renew leader lease after the interruption. But there is no any lease renew request from the failed JM until it restarted.