[KAFKA-10706] Liveness bug in truncation protocol can lead to indefinite URP - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.4.2, 2.5.2, 2.6.1, 2.7.1
Component/s: None
Labels:
None

Description

We hit an interesting liveness condition in the truncation protocol. Broker A was leader in epoch 7, broker B was leader in epoch 8, and then broker A was leader in epoch 9 again.

On broker A, we had the following state in the epoch cache:

epoch 4, start offset 3953
epoch 7, start offset 3983
epoch 9, start offset 3988

On broker B, we had the following:

epoch 4, start offset 3953
epoch 8, start offset 3983

After A was elected, broker B sent epoch 8 in OffsetsForLeaderEpoch. Broker A correctly responded with epoch 7 ending at offset 3988. The end offset on broker B was in fact 3983, so this truncation had no effect. Broker B then retried with epoch 8 again and replication was stuck.

When a replica becomes leader, it first inserts an entry into the epoch cache with the current log end offset. This ensures that that it has a larger epoch in the cache than any epoch that could be requested by a valid replica. However, I think it is incorrect to turn around and use this epoch when becoming a follower. It seems like we need symmetric logic after becoming a follower to remove this epoch entry.

Attachments

Issue Links

links to

GitHub Pull Request #9633

Activity

People

Assignee:: Jason Gustafson

Reporter:: Jason Gustafson

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Nov/20 05:54

Updated:: 07/Dec/23 11:47

Resolved:: 21/Nov/20 17:56