Description
I was looking at the logs of the system test failure of ReassignPartitionsTest.
Logs show produce error ReplicaNotAvailableException for two records in the producer log, but the data logs of all the brokers contain the records. The offsets of these records are returned as successful produce for two subsequent records which don't appear in the logs and hence the test failed.
Broker logs of the leader at the time of the reassignment and leader change show:
{{[2019-11-11 07:23:17,727] ERROR [ReplicaManager broker=3] Error processing append operation on partition test_topic-17 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.ReplicaNotAvailableException: Partition test_topic-5 is not available}}
This is failing the append operation on `test_topic-17` when a different partition `test_topic-5` was unavailable for fetch. I think it is fetch since produce would have thrown NotLeaderForPartitionException rather than ReplicaNotAvailableException.
We don't expect DelayedFetch to throw exceptions and it looks like we are not handling `ReplicaNotAvailableException`.
I am not sure if this fixes the issues with ReassignPartitionsTest, but this seems to a scenario that we should fix.
Attachments
Issue Links
- links to