Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.6.0
-
None
Description
I am facing an issue on the remote data path for uncommitted reads.
As mentioned in the original PR, if a transaction spans over a long sequence of segments, the time taken to retrieve the producer snapshots from the remote storage can, in the worst case, become redhibitory and block the reads if it consistently exceed the deadline of fetch requests (fetch.max.wait.ms).
Essentially, the method used to compute the uncommitted records to return with the fetch response has an asymptotic complexity proportional to the number of segments in the log. This is not a problem with local storage since the constant factor to traverse the producer snapshot files is small enough, but that is not the case with a remote storage which exhibits higher read latency.
An aggravating factor was the lock contention in the remote index cache which was mitigated by KAFKA-15084 since then. But unfortunately, despite the improvements observed without the said contention, the algorithmic complexity of the current method used to compute uncommitted records can always defeat any optimisation made on the remote read path.
Maybe we could start thinking (if not already) about a different construct which would reduce that complexity to O(1) - i.e. to make the computation independent from the number of segments and irrespective of the spans of transactions.
Attachments
Issue Links
- is a child of
-
KAFKA-16947 Kafka Tiered Storage V2
- Open
- is related to
-
KAFKA-16780 Txn consumer exerts pressure on remote storage when collecting aborted transactions
- In Progress