[FLINK-34333] Fix FLINK-34007 LeaderElector bug in 1.18 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.18.1
Fix Version/s: 1.18.2
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Release Note:
Fixes a bug where the leader election wasn't able to pick up leadership again after renewing the lease token caused a leadership loss. This required fabric8io:kubernetes-client to be upgraded from v6.6.2 to v6.9.0.

Description

~~FLINK-34007~~ revealed a bug in the k8s client v6.6.2 which we're using since Flink 1.18. This issue was fixed with ~~FLINK-34007~~ for Flink 1.19 which required an update of the k8s client to v6.9.0.

This Jira issue is about finding a solution in Flink 1.18 for the very same problem ~~FLINK-34007~~ covered. It's a dedicated Jira issue because we want to unblock the release of 1.19 by resolving ~~FLINK-34007~~.

Just to summarize why the upgrade to v6.9.0 is desired: There's a bug in v6.6.2 which might prevent the leadership lost event being forwarded to the client (#5463). An initial proposal where the release call was handled in Flink's KubernetesLeaderElector didn't work due to the leadership lost event being triggered twice (see FLINK-34007 PR comment)

Attachments

Issue Links

is caused by

FLINK-31997 Update to Fabric8 6.5.1+ in flink-kubernetes

Closed

split from

FLINK-34007 Flink Job stuck in suspend state after losing leadership in HA Mode

Resolved

links to

GitHub Pull Request #24245

Activity

People

Assignee:: Matthias Pohl

Reporter:: Matthias Pohl

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 01/Feb/24 09:42

Updated:: 24/May/24 16:01

Resolved:: 13/Feb/24 16:20