[KAFKA-9144] Early expiration of producer state can cause coordinator epoch to regress - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.1, 2.1.1, 2.2.2, 2.4.0, 2.3.1
Fix Version/s: 2.2.3, 2.3.2, 2.4.1
Component/s: None
Labels:
None

Description

Transaction markers are written by the transaction coordinator. In order to fence zombie coordinators, we use the leader epoch associated with the coordinator partition. Partition leaders verify the epoch in the WriteTxnMarker request and ensure that it can only increase. However, when producer state expires, we stop tracking the epoch and it is possible for monotonicity to be violated. Generally we expect expiration to be on the order of days, so it should be unlikely for this to be a problem.

At least that is the theory. We observed a case where a coordinator epoch decreased between nearly consecutive writes within a couple minutes of each other. Upon investigation, we found that producer state had been incorrectly expired. We believe the sequence of events is the following:

Producer writes transactional data and fails before committing
Coordinator times out the transaction and writes ABORT markers
Upon seeing the ABORT and the bumped epoch, the partition leader deletes state from the last epoch, which effectively resets the last timestamp for the producer to -1.
The coordinator becomes a zombie before getting a successful response and continues trying to send
The new coordinator notices the incomplete transaction and also sends markers
The partition leader accepts the write from the new coordinator
The producer state is expired because the last timestamp was -1
The partition leader accepts the write from the old coordinator

Basically it takes an alignment of planets to hit this bug, but it is possible. If you hit it, then the broker may be unable to start because we validate epoch monotonicity during log recovery. The problem is in 3 when the timestamp gets reset. We should use the timestamp from the marker instead.

Attachments

Issue Links

causes

KAFKA-7698 Kafka Broker fail to start: ProducerFencedException thrown from producerstatemanager.scala!checkProducerEpoch

Resolved

KAFKA-8803 Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

Resolved

fixes

KAFKA-8242 Exception in ReplicaFetcher blocks replication of all other partitions

Resolved

is related to

KAFKA-13375 Kafka streams apps w/EOS unable to start at InitProducerId

Open

links to

GitHub Pull Request #7687

GitHub Pull Request #8960

(1 links to)

Activity

People

Assignee:: Jason Gustafson

Reporter:: Jason Gustafson

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 05/Nov/19 17:17

Updated:: 26/Aug/23 11:54

Resolved:: 09/Jan/20 21:59