[KAFKA-12252] Distributed herder tick thread loops rapidly when worker loses leadership - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.6.3, 2.7.2, 2.8.1, 3.0.0
Component/s: connect
Labels:
None

Description

When a new session key is read from the config topic, if the worker is the leader, it schedules a new key rotation. The time between key rotations is configurable but defaults to an hour.

The herder then continues its tick loop, which usually ends with a long poll for rebalance activity. However, when a key rotation is scheduled, it will limit the time spent polling at the end of the tick loop in order to be able to perform the rotation.

Once woken up, the worker checks to see if a key rotation is necessary and, if so, sets the expected key rotation time to Long.MAX_VALUE, then writes a new session key to the config topic. The problem is, the worker only ever decides a key rotation is necessary if it is still the leader. If the worker is no longer the leader at the time of the key rotation (likely due to falling out of the cluster after losing contact with the group coordinator), its key expiration time won’t be reset, and the long poll for rebalance activity at the end of the tick loop will be given a timeout of 0 ms and result in the tick loop being immediately restarted. Even if the worker reads a new session key from the config topic, it’ll continue looping like this since its scheduled key rotation won’t be updated. At this point, the only thing that would help the worker get back into a healthy state would be if it were made the leader of the cluster again.

One possible fix could be to add a conditional check in the tick thread to only limit the time spent on rebalance polling if the worker is currently the leader.

Attachments

Issue Links

links to

GitHub Pull Request #10014

Activity

People

Assignee:: Chris Egerton

Reporter:: Chris Egerton

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/Jan/21 15:25

Updated:: 18/Jun/21 17:09

Resolved:: 06/May/21 21:00