[KAFKA-10301] Partition#remoteReplicasMap can be empty in certain race conditions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

In Partition#updateAssignmentAndIsr, we would previously update the `partition#remoteReplicasMap` by adding the new replicas to the map and then removing the old ones ([source](https://github.com/apache/kafka/blob/7f9187fe399f3f6b041ca302bede2b3e780491e7/core/src/main/scala/kafka/cluster/Partition.scala#L657)

During a recent refactoring, we changed it to first clear the map and then add all the replicas to it ([source](https://github.com/apache/kafka/blob/2.6/core/src/main/scala/kafka/cluster/Partition.scala#L663))

While this is done in a write lock (`inWriteLock(leaderIsrUpdateLock)`), not all callers that access the map structure use a lock. Some examples:

Partition#updateFollowerFetchState
DelayedDeleteRecords#tryComplete
Partition#getReplicaOrException - called in `checkEnoughReplicasReachOffset` without a lock, which itself is called by DelayedProduce. I think this can fail a `ReplicaManager#appendRecords` call.

While we want to polish the code to ensure these sort of race conditions become harder (or impossible) to introduce, it sounds safest to revert to the previous behavior given the timelines regarding the 2.6 release. Jira https://issues.apache.org/jira/browse/KAFKA-10302 tracks further modifications to the code.

Attachments

Issue Links

links to

GitHub Pull Request #9065

Activity

People

Assignee:: Stanislav Kozlovski

Reporter:: Stanislav Kozlovski

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 23/Jul/20 15:04

Updated:: 27/Jul/20 18:04

Resolved:: 27/Jul/20 07:09