Affects Version/s: 0.11.0.0
Fix Version/s: 0.11.0.0
I am running some perf tests for EOS and there is a severe perf impact with our default configs.
Here is the issue.
- When we do a commit transaction, the producer sends an `EndTxn` request to the coordinator. The coordinator writes the `PrepareCommit` message to the transaction log and then returns the response the client. It writes the transaction markers and the final 'CompleteCommit' message asynchronously.
- In the mean time, if the client starts another transaction, it will send an `AddPartitions` request on the next `Sender.run` loop. If the markers haven't been written yet, then the coordinator will return a retriable `CONCURRENT_TRANSACTIONS` error to the client.
- The current behavior in the producer is to sleep for `retryBackoffMs` before retrying the request. The current default for this is 100ms. So the producer will sleep for 100ms before sending the `AddPartitions` again. This puts a floor on the latency for back to back transactions.
The impact: Back to back transactions (the typical usecase for streams) would have a latency floor of 100ms.
Ideally, we don't want to sleep the full 100ms in this particular case, because the retry is 'expected'.
The options are:
- do nothing, let streams override the retry.backoff.ms in their producer to 10 when EOS is enabled (since they have a HOTFIX patch out anyway).
- Introduce a special 'transactionRetryBackoffMs' non-configurable variable and hard code that to a low value which applies to all transactional requests.
- do nothing and fix it properly in 0.11.0.1
Option 2 as stated is a 1 line fix. If we want to lower the retry just for this particular error, it would be a slightly bigger change (10-15 lines).