Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-10706

Liveness bug in truncation protocol can lead to indefinite URP

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.4.2, 2.5.2, 2.6.1, 2.7.1
    • None
    • None

    Description

      We hit an interesting liveness condition in the truncation protocol. Broker A was leader in epoch 7, broker B was leader in epoch 8, and then broker A was leader in epoch 9 again.

      On broker A, we had the following state in the epoch cache:

      epoch 4, start offset 3953
      epoch 7, start offset 3983
      epoch 9, start offset 3988
      

      On broker B, we had the following:

      epoch 4, start offset 3953
      epoch 8, start offset 3983
      

      After A was elected, broker B sent epoch 8 in OffsetsForLeaderEpoch. Broker A correctly responded with epoch 7 ending at offset 3988. The end offset on broker B was in fact 3983, so this truncation had no effect. Broker B then retried with epoch 8 again and replication was stuck.

      When a replica becomes leader, it first inserts an entry into the epoch cache with the current log end offset. This ensures that that it has a larger epoch in the cache than any epoch that could be requested by a valid replica. However, I think it is incorrect to turn around and use this epoch when becoming a follower. It seems like we need symmetric logic after becoming a follower to remove this epoch entry.

      Attachments

        Issue Links

          Activity

            People

              hachikuji Jason Gustafson
              hachikuji Jason Gustafson
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: