[KAFKA-9307] Transaction coordinator could be left in unknown state after ZK session timeout - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.2.0, 2.3.0, 2.2.1, 2.2.2, 2.4.0, 2.3.1
Fix Version/s: 2.2.3, 2.3.2, 2.4.1
Component/s: core
Labels:
None

Description

We observed a case where the transaction coordinator could not load transaction state from __transaction-state topic partition. Clients would continue seeing COORDINATOR_LOAD_IN_PROGRESS exceptions until the broker hosting the coordinator is restarted.

This is the sequence of events that leads to the issue:

The broker is the leader of one (or more) transaction state topic partitions.
The broker loses its ZK session due to a network issue.
Broker reestablishes session with ZK, though there are still transient network issues.
Broker is made follower of the transaction state topic partition it was leading earlier.
During the become-follower transition, the broker loses its ZK session again.
The become-follower transition for this broker fails in-between, leaving us in a partial leader / partial follower state for the transaction topic. This meant that we could not unload the transaction metadata. However, the broker successfully caches the leader epoch of associated with the LeaderAndIsrRequest.
Later, when the ZK session is finally established successfully, the broker ignores the become-follower transition as the leader epoch was same as the one it had cached. This prevented the transaction metadata from being unloaded.
Because this partition was a partial follower, we had setup replica fetchers. The partition continued to fetch from the leader until it was made part of the ISR.
Once it was part of the ISR, preferred leader election kicked in and elected this broker as the leader.
When processing the become-leader transition, the transaction state load operation failed as we already had transaction metadata loaded at a previous epoch.
This meant that this partition was left in the "loading" state and we thus returned COORDINATOR_LOAD_IN_PROGRESS errors.

Restarting the broker that hosts the transaction state coordinator is the only way to recover from this situation.

Attachments

Issue Links

causes

KAFKA-8803 Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

Resolved

relates to

KAFKA-8374 KafkaApis.handleLeaderAndIsrRequest not robust to ZooKeeper exceptions

Open

links to

GitHub Pull Request #7840

Activity

People

Assignee:: Dhruvil Shah

Reporter:: Dhruvil Shah

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/Dec/19 04:21

Updated:: 26/Jan/21 20:26

Resolved:: 23/Dec/19 23:51