[KAFKA-9803] Allow producers to recover gracefully from transaction timeouts - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: producer , streams
Labels:
- needs-kip

Description

Transaction timeouts are detected by the transaction coordinator. When the coordinator detects a timeout, it bumps the producer epoch and aborts the transaction. The epoch bump is necessary in order to prevent the current producer from being able to begin writing to a new transaction which was not started through the coordinator.

Transactions may also be aborted if a new producer with the same `transactional.id` starts up. Similarly this results in an epoch bump. Currently the coordinator does not distinguish these two cases. Both will end up as a `ProducerFencedException`, which means the producer needs to shut itself down.

We can improve this with the new APIs from KIP-360. When the coordinator times out a transaction, it can remember that fact and allow the existing producer to claim the bumped epoch and continue. Roughly the logic would work like this:

1. When a transaction times out, set lastProducerEpoch to the current epoch and do the normal bump.
2. Any transactional requests from the old epoch result in a new TRANSACTION_TIMED_OUT error code, which is propagated to the application.
3. The producer recovers by sending InitProducerId with the current epoch. The coordinator returns the bumped epoch.

One issue that needs to be addressed is how to handle INVALID_PRODUCER_EPOCH from Produce requests. Partition leaders will not generally know if a bumped epoch was the result of a timed out transaction or a fenced producer. Possibly the producer can treat these errors as abortable when they come from Produce responses. In that case, the user would try to abort the transaction and then we can see if it was due to a timeout or otherwise.

Attachments

Issue Links

requires

KAFKA-8436 Replace AddOffsetsToTxn request/response with automated protocol

Resolved

KAFKA-8639 Replace AddPartitionsToTxn request/response with automated protocol

Resolved

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...

(1 mentioned in)

Sub-Tasks

1.	Implement new transaction timed out error	Open	HaiyuanZhao
2.	Implement new producer fenced error	Resolved	Boyang Chen
3.	Make INVALID_PRODUCER_EPOCH abortable from Produce response	Resolved	Boyang Chen
4.	Save unnecessary end txn call when the transaction is confirmed to be done	Open	Unassigned

Activity

People

Assignee:: Unassigned

Reporter:: Jason Gustafson

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 02/Apr/20 00:27

Updated:: 23/Feb/24 10:17