[KAFKA-7128] Lagging high watermark can lead to committed data loss after ISR expansion - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.1.2, 2.0.1, 2.1.0
Component/s: None
Labels:
None

Description

Some model checking exposed a weakness in the ISR expansion logic. We know that the high watermark can go backwards after a leader failover, but we may not have known that this can lead to the loss of committed data.

Say we have three replicas: r1, r2, and r3. Initially, the ISR consists of (r1, r2) and the leader is r1. r3 is a new replica which has not begun fetching. The data up to offset 10 has been committed to the ISR. Here is the initial state:

State 1
ISR: (r1, r2)
Leader: r1
r1: [hw=10, leo=10]
r2: [hw=5, leo=10]
r3: [hw=0, leo=0]

Replica 1 then initiates shutdown (or fails) and leaves the ISR, which makes r2 the new leader. The high watermark is still lagging r1.

State 2
ISR: (r2)
Leader: r2
r1 (offline): [hw=10, leo=10]
r2: [hw=5, leo=10]
r3: [hw=0, leo=0]

Replica 3 then catch up to the high watermark on r2 and joins the ISR. Perhaps it's high watermark is lagging behind r2, but this is unimportant.

State 3
ISR: (r2, r3)
Leader: r2
r1 (offline): [hw=10, leo=10]
r2: [hw=5, leo=10]
r3: [hw=0, leo=5]

Now r2 fails and r3 is elected leader and is the only member of the ISR. The committed data from offsets 5 to 10 has been lost.

State 4
ISR: (r3)
Leader: r3
r1 (offline): [hw=10, leo=10]
r2 (offline): [hw=5, leo=10]
r3: [hw=0, leo=5]

The bug is the fact that we allowed r3 into the ISR after the local high watermark had been reached. Since the follower does not know the true high watermark for the previous leader's epoch, it should not allow a replica to join the ISR until it has caught up to an offset within its own epoch.

Note this is related to https://cwiki.apache.org/confluence/display/KAFKA/KIP-207%3A+Offsets+returned+by+ListOffsetsResponse+should+be+monotonically+increasing+even+during+a+partition+leader+change

Attachments

Issue Links

links to

GitHub Pull Request #5557

Activity

People

Assignee:: Anna Povzner

Reporter:: Jason Gustafson

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 02/Jul/18 19:58

Updated:: 28/Aug/18 18:09

Resolved:: 28/Aug/18 18:09