[KAFKA-8733] Offline partitions occur when leader's disk is slow in reads while responding to follower fetch requests. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 1.1.2, 2.4.0, 2.8.0
Fix Version/s: None
Component/s: core
Labels:
None

Description

We found offline partitions issue multiple times on some of the hosts in our clusters. After going through the broker logs and hosts’s disk stats, it looks like this issue occurs whenever the read/write operations take more time on that disk. In a particular case where read time is more than the replica.lag.time.max.ms, follower replicas will be out of sync as their earlier fetch requests are stuck while reading the local log and their fetch status is not yet updated as mentioned in the below code of `ReplicaManager`. If there is an issue in reading the data from the log for a duration more than replica.lag.time.max.ms then all the replicas will be out of sync and partition becomes offline if min.isr.replicas > 1 and unclean.leader.election is false.

def readFromLog(): Seq[(TopicPartition, LogReadResult)] = {
  val result = readFromLocalLog( // this call took more than `replica.lag.time.max.ms`
  replicaId = replicaId,
  fetchOnlyFromLeader = fetchOnlyFromLeader,
  readOnlyCommitted = fetchOnlyCommitted,
  fetchMaxBytes = fetchMaxBytes,
  hardMaxBytesLimit = hardMaxBytesLimit,
  readPartitionInfo = fetchInfos,
  quota = quota,
  isolationLevel = isolationLevel)
  if (isFromFollower) updateFollowerLogReadResults(replicaId, result). // fetch time gets updated here, but mayBeShrinkIsr should have been already called and the replica is removed from isr
 else result
 }

val logReadResults = readFromLog()

Attached the graphs of disk weighted io time stats when this issue occurred.

I will raise KIP-501 describing options on how to handle this scenario.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

weighted-io-time-2.png
30/Jul/19 17:47
36 kB
Satish Duggana
wio-time.png
30/Jul/19 17:47
37 kB
Satish Duggana

Issue Links

links to

GitHub Pull Request #7802

Activity

People

Assignee:: Satish Duggana

Reporter:: Satish Duggana

Votes:: 4 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 30/Jul/19 17:03

Updated:: 22/Jun/21 11:03