Details
-
Sub-task
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
3.6.0
-
None
-
None
Description
KAFKA-14844 showed the destructive nature of a timeout on the first produce request for a topic partition (ie one that has no state in psm)
Since we currently don't validate the first sequence (we will in part 2 of kip-890), any transient error on the first produce can lead to out of order sequences that never recover.
Originally, KAFKA-14561 relied on the producer's retry mechanism for these transient issues, but until that is fixed, we may need to retry from in the AddPartitionsManager instead. We addressed the concurrent transactions, but there are other errors like coordinator loading that we could run into and see increased out of order issues.
由于我们目前尚未验证第一个序列(我们将在 kip-890 的第 2 部分中),因此第一个产品上的任何瞬态错误都可能导致永远无法恢复的无序序列。
最初,KAFKA-14561 依赖于生产者的重试机制来解决这些暂时性问题,但在修复之前,我们可能需要从 AddPartitionsManager 中重试。我们解决了并发事务,但还有其他错误,例如协调器加载,我们可能会遇到这些错误,并看到更多的乱序问题。
Attachments
Issue Links
- links to