Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-7128

Lagging high watermark can lead to committed data loss after ISR expansion

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.1.2, 2.0.1, 2.1.0
    • None
    • None

    Description

      Some model checking exposed a weakness in the ISR expansion logic. We know that the high watermark can go backwards after a leader failover, but we may not have known that this can lead to the loss of committed data.

      Say we have three replicas: r1, r2, and r3. Initially, the ISR consists of (r1, r2) and the leader is r1. r3 is a new replica which has not begun fetching. The data up to offset 10 has been committed to the ISR. Here is the initial state:

      State 1
      ISR: (r1, r2)
      Leader: r1
      r1: [hw=10, leo=10]
      r2: [hw=5, leo=10]
      r3: [hw=0, leo=0]

      Replica 1 then initiates shutdown (or fails) and leaves the ISR, which makes r2 the new leader. The high watermark is still lagging r1.

      State 2
      ISR: (r2)
      Leader: r2
      r1 (offline): [hw=10, leo=10]
      r2: [hw=5, leo=10]
      r3: [hw=0, leo=0]

      Replica 3 then catch up to the high watermark on r2 and joins the ISR. Perhaps it's high watermark is lagging behind r2, but this is unimportant.

      State 3
      ISR: (r2, r3)
      Leader: r2
      r1 (offline): [hw=10, leo=10]
      r2: [hw=5, leo=10]
      r3: [hw=0, leo=5]

      Now r2 fails and r3 is elected leader and is the only member of the ISR. The committed data from offsets 5 to 10 has been lost.

      State 4
      ISR: (r3)
      Leader: r3
      r1 (offline): [hw=10, leo=10]
      r2 (offline): [hw=5, leo=10]
      r3: [hw=0, leo=5]

      The bug is the fact that we allowed r3 into the ISR after the local high watermark had been reached. Since the follower does not know the true high watermark for the previous leader's epoch, it should not allow a replica to join the ISR until it has caught up to an offset within its own epoch.

      Note this is related to https://cwiki.apache.org/confluence/display/KAFKA/KIP-207%3A+Offsets+returned+by+ListOffsetsResponse+should+be+monotonically+increasing+even+during+a+partition+leader+change

      Attachments

        Issue Links

          Activity

            People

              apovzner Anna Povzner
              hachikuji Jason Gustafson
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: