[KAFKA-17061] KafkaController takes long time to connect to newly added broker after registration on large cluster - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.9.0
Component/s: None
Labels:
None

Description

Environment

Kafka version: 3.3.2
Cluster: 200~ brokers
Total num partitions: 40k
ZK-based cluster

Phenomenon

When a broker left the cluster once due to the long STW and came back after a while, the controller took 6 seconds until connecting to the broker after znode registration, it caused significant message delivery delay.

[2024-06-22 23:59:38,202] INFO [Controller id=1] Newly added brokers: 2, deleted brokers: , bounced brokers: , all live brokers: 1,... (kafka.controller.KafkaController)
[2024-06-22 23:59:38,203] DEBUG [Channel manager on controller 1]: Controller 1 trying to connect to broker 2 (kafka.controller.ControllerChannelManager)
[2024-06-22 23:59:38,205] INFO [RequestSendThread controllerId=1] Starting (kafka.controller.RequestSendThread)
[2024-06-22 23:59:38,205] INFO [Controller id=1] New broker startup callback for 2 (kafka.controller.KafkaController)
[2024-06-22 23:59:44,524] INFO [RequestSendThread controllerId=1] Controller 1 connected to broker-2:9092 (id: 2 rack: rack-2) for sending state change requests (kafka.controller.RequestSendThread)

Analysis

From the flamegraph at that time, we can see that liveBrokerIds called by `isReplicaOnline` takes significant time in `addUpdateMetadataRequestForBrokers` invocation on broker startup.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

flame.html
11/Jul/24 14:50
138 kB
Haruki Okada
flame-patched.html
11/Jul/24 14:50
156 kB
Haruki Okada
image-2024-07-02-17-22-06-100.png
02/Jul/24 08:22
1.51 MB
Haruki Okada
image-2024-07-02-17-24-11-861.png
02/Jul/24 08:24
1.50 MB
Haruki Okada
screenshot-flame.png
11/Jul/24 14:51
1.49 MB
Haruki Okada
screenshot-flame-patched.png
11/Jul/24 14:53
931 kB
Haruki Okada

Issue Links

links to

GitHub Pull Request #16529

Activity

People

Assignee:: Haruki Okada

Reporter:: Haruki Okada

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 02/Jul/24 08:28

Updated:: 12/Jul/24 04:53

Resolved:: 12/Jul/24 04:53