We have noticed an issue with leaders being retrieved after reconnecting to zookeeper. The steps to reproduce this issue are to break the connection between a job manager that is not the leader and zookeeper. Wait for the session to be lost between the two. At this point, flink notifies for a loss of leader. After the loss of leader has occured, reconnect the job manager to zookeeper. At this point, the leader will still be the same as it was before, but when trying to access the rest API, you will see this
I have been using `stress -t 60 -m 2048` (which spins up 2048 threads continuously alloc and freeing 256MB, to swap out the job manager and cause the connection loss.
I have done some amount of digging on this. The ZooKeeperLeaderRetrievalService has this code block for handling state changes
It calls notifyLeaderLoss() when the connection is lost, but it doesn't do anything when the connection is reconnected. It appears that curator's NodeCache will retrieve the value of the leader znode after reconnect, but it won't notify the listeners if the value is the same as before the connection loss. So, unless a leader election happens after a zookeeper connection loss, the job managers that are not the leader will never know that there is a leader.
The method that is called for NodeCache when a new value is retrieved
seems to be preventing the job managers from getting the leader after a zookeeper connection loss.