[KAFKA-10002] Improve performances of StopReplicaRequest with large number of partitions to be deleted - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.7.0
Component/s: None
Labels:
None

Description

I have noticed that StopReplicaRequests with partitions to be deleted are extremely slow when there is more than 2000 partitions which leads to hitting the request timeout in the controller. A request with 2000 partitions to be deleted still works but performances degrades significantly with the number increases. For examples, a request with 3000 partitions to be deletes takes appox. 60 seconds to be processed.

A CPU profile shows that most of the time is spent in checkpointing log start offsets and recovery offsets. Almost 90% of the time is there. See attached. When a partition is deleted, the replica manager calls `ReplicaManager#asyncDelete` that checkpoints recovery offsets and log start offsets. As the checkpoints are per data directory, the checkpointing is made for all the partitions in the directory of the partition to be deleted. In our case where we have only one data directory, if you deletes 1000 partitions, we end up checkpointing the same things 1000 times which is not efficient.

Attachments

Issue Links

links to

GitHub Pull Request #8672

Activity

People

Assignee:: David Jacot

Reporter:: David Jacot

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 15/May/20 07:12

Updated:: 14/Jul/20 00:50

Resolved:: 14/Jul/20 00:50