[KAFKA-10371] Partition reassignments can result in crashed ReplicaFetcherThreads. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: core
Labels:
None

Description

A Kafka system doing partition reassignments got stuck with the reassignment partially done and the system with a non-zero number of URPs and increasing max lag.

Looking in the logs, we see:

[ERROR] 2020-07-31 21:22:23,984 [ReplicaFetcherThread-0-3] kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=4, leaderId=3, fetcherId=0] Error due to
org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while fetching partition state for foo
[INFO] 2020-07-31 21:22:23,986 [ReplicaFetcherThread-0-3] kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=4, leaderId=3, fetcherId=0] Stopped

Investigating further and with some helpful changes to the exception (which was not generating a stack trace because it was a client-side exception), we see on a test run:

[2020-08-06 19:58:21,592] ERROR [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Error due to (kafka.server.ReplicaFetcherThread)
org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while fetching partition state for topic-test-topic-85
        at org.apache.kafka.common.protocol.Errors.exception(Errors.java:415)
        at kafka.server.ReplicaManager.getPartitionOrException(ReplicaManager.scala:645)
        at kafka.server.ReplicaManager.localLogOrException(ReplicaManager.scala:672)
        at kafka.server.ReplicaFetcherThread.logStartOffset(ReplicaFetcherThread.scala:133)
        at kafka.server.ReplicaFetcherThread.$anonfun$buildFetch$1(ReplicaFetcherThread.scala:316)
        at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553)
        at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:920)
        at kafka.server.ReplicaFetcherThread.buildFetch(ReplicaFetcherThread.scala:309)

It appears that the fetcher is attempting to fetch for a partition that has been getting reassigned away. From further investigation, it seems that in ~~KAFKA-10002~~ the StopReplica code was changed from:
1. Remove partition from fetcher
2. Remove partition from partition map
to the other way around, but now the fetcher may race and attempt to build a fetch for a partition that's no longer mapped. In particular, since the logOrException code is being called from logStartOffset which isn't protected against NotLeaderOrFollowerException, just against KafkaStorageException, the exception isn't caught and throws all the way out, killing the replica fetcher thread.
We need to switch this back.

Attachments

Issue Links

links to

GitHub Pull Request #9140

Activity

People

Assignee:: David Jacot

Reporter:: Steve Rodrigues

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Aug/20 02:10

Updated:: 07/Aug/20 22:30

Resolved:: 07/Aug/20 22:30