Details
Description
When migrating producer ID blocks from ZK to KRaft, we are taking the current producer ID block from ZK and writing it's "firstProducerId" into the producer IDs KRaft record. However, in KRaft we store the next producer ID block in the log rather than storing the current block like ZK does. The end result is that the first block given to a caller of AllocateProducerIds is a duplicate of the last block allocated in ZK mode.
This can result in duplicate producer IDs being given to transactional or idempotent producers. In the case of transactional producers, this can cause long term problems since the producer IDs are persisted and reused for a long time.
The time between the last producer ID block being allocated by the ZK controller and all the brokers being restarted following the metadata migration is when this bug is possible.
Symptoms of this bug will include ReplicaManager OutOfOrderSequenceException and possibly some producer epoch validation errors. To see if a cluster is affected by this bug, search for the offending producer ID and see if it is being used by more than one producer.
For example, the following error was observed
Out of order sequence number for producer 376000 at offset 381338 in partition REDACTED: 0 (incoming seq. number), 21 (current end sequence number)
Then searching for "376000" on org.apache.kafka.clients.producer.internals.TransactionManager logs, two brokers both show the same producer ID being provisioned
Broker 0 [Producer clientId=REDACTED-0] ProducerId set to 376000 with epoch 1 Broker 5 [Producer clientId=REDACTED-1] ProducerId set to 376000 with epoch 1
Attachments
Issue Links
- links to