We have observed an unusual case in which a new replica became leader before it had received an initial high watermark from the previous leader. This resulted in an OffsetOutOfRangeException being raised while looking up the segment position of the uninitialized high watermark, since it was lower than the log start offset. The error was raised while handle the fetch request from one of the followers and prevented it from making progress.
Here is what we have observed from the logs. The initial state of the partition for the relevant sequence of events is the following:
Initial state: replicas=[4,1,2,3], leader=1, isr=[1,2,3], adding=, removing=, epoch=5, logStartOffset=20, logEndOffset=20
We see the following events:
t0: Replica 4 becomes follower and initializes log with hw=0, logStartOffset=0
t1: Replica 4 begins fetching from offset 0 and receives an out of range error
t2: After a ListOffset request to the leader, replica 4 initializes logStartOffset to 20.
t3: Replica 4 sends fetch request to the leader at start offset 20
t4: Upon receiving the fetch request, the leader adds 4 to the ISR (i.e. isr=[1,2,3,4])
t5: The controller notices the ISR addition and makes 4 the leader since 1 is to be removed and 4 is the new preferred leader
t6: Replica 4 stops fetchers and becomes leader
t7: We begin seeing the out of range errors as the other replicas begin fetching from 4.
We know from analysis of a heap dump from broker 4, that the high watermark was still set to 0 some time after it had become leader. We also know that broker 1 was under significant load. The time between events t4 and t6 was less than 10ms. We don't know when the fetch response sent at t3 returned to broker 4, but we speculate that it happened after t6 due to the heavy load on the leader, which is why broker 4 had an uninitialized high watermark.
A more mundane possibility is that there is a bug in the fetch session logic and the partition was simply not included in the fetch response. However, the code appears to anticipate this case. When a partition has an error, we set the cached high watermark to -1 to ensure that it gets updated as soon as the error clears.
Regardless how we got there, the fix should be straightforward. When a broker becomes leader, it should ensure its high watermark is at least as large as the log start offset.