[KAFKA-5477] TransactionalProducer sleeps unnecessarily long during back to back transactions - ASF JIRA

XML

Word

Printable

JSON

I am running some perf tests for EOS and there is a severe perf impact with our default configs.

Here is the issue.

When we do a commit transaction, the producer sends an `EndTxn` request to the coordinator. The coordinator writes the `PrepareCommit` message to the transaction log and then returns the response the client. It writes the transaction markers and the final 'CompleteCommit' message asynchronously.
In the mean time, if the client starts another transaction, it will send an `AddPartitions` request on the next `Sender.run` loop. If the markers haven't been written yet, then the coordinator will return a retriable `CONCURRENT_TRANSACTIONS` error to the client.
The current behavior in the producer is to sleep for `retryBackoffMs` before retrying the request. The current default for this is 100ms. So the producer will sleep for 100ms before sending the `AddPartitions` again. This puts a floor on the latency for back to back transactions.

The impact: Back to back transactions (the typical usecase for streams) would have a latency floor of 100ms.

Ideally, we don't want to sleep the full 100ms in this particular case, because the retry is 'expected'.

The options are:

do nothing, let streams override the retry.backoff.ms in their producer to 10 when EOS is enabled (since they have a HOTFIX patch out anyway).
Introduce a special 'transactionRetryBackoffMs' non-configurable variable and hard code that to a low value which applies to all transactional requests.
do nothing and fix it properly in 0.11.0.1

Option 2 as stated is a 1 line fix. If we want to lower the retry just for this particular error, it would be a slightly bigger change (10-15 lines).

links to

GitHub Pull Request #3377