Transaction timeouts are detected by the transaction coordinator. When the coordinator detects a timeout, it bumps the producer epoch and aborts the transaction. The epoch bump is necessary in order to prevent the current producer from being able to begin writing to a new transaction which was not started through the coordinator.
Transactions may also be aborted if a new producer with the same `transactional.id` starts up. Similarly this results in an epoch bump. Currently the coordinator does not distinguish these two cases. Both will end up as a `ProducerFencedException`, which means the producer needs to shut itself down.
We can improve this with the new APIs from KIP-360. When the coordinator times out a transaction, it can remember that fact and allow the existing producer to claim the bumped epoch and continue. Roughly the logic would work like this:
1. When a transaction times out, set lastProducerEpoch to the current epoch and do the normal bump.
2. Any transactional requests from the old epoch result in a new TRANSACTION_TIMED_OUT error code, which is propagated to the application.
3. The producer recovers by sending InitProducerId with the current epoch. The coordinator returns the bumped epoch.
One issue that needs to be addressed is how to handle INVALID_PRODUCER_EPOCH from Produce requests. Partition leaders will not generally know if a bumped epoch was the result of a timed out transaction or a fenced producer. Possibly the producer can treat these errors as abortable when they come from Produce responses. In that case, the user would try to abort the transaction and then we can see if it was due to a timeout or otherwise.
|Implement new transaction timed out error||Open|
|Implement new producer fenced error||Resolved|
|Make INVALID_PRODUCER_EPOCH abortable from Produce response||Resolved|
|Save unnecessary end txn call when the transaction is confirmed to be done||Open||Unassigned|