Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
Environment
- Kafka version: 3.3.2
- Cluster: 200~ brokers
- Total num partitions: 40k
- ZK-based cluster
Phenomenon
When a broker left the cluster once due to the long STW and came back after a while, the controller took 6 seconds until connecting to the broker after znode registration, it caused significant message delivery delay.
[2024-06-22 23:59:38,202] INFO [Controller id=1] Newly added brokers: 2, deleted brokers: , bounced brokers: , all live brokers: 1,... (kafka.controller.KafkaController) [2024-06-22 23:59:38,203] DEBUG [Channel manager on controller 1]: Controller 1 trying to connect to broker 2 (kafka.controller.ControllerChannelManager) [2024-06-22 23:59:38,205] INFO [RequestSendThread controllerId=1] Starting (kafka.controller.RequestSendThread) [2024-06-22 23:59:38,205] INFO [Controller id=1] New broker startup callback for 2 (kafka.controller.KafkaController) [2024-06-22 23:59:44,524] INFO [RequestSendThread controllerId=1] Controller 1 connected to broker-2:9092 (id: 2 rack: rack-2) for sending state change requests (kafka.controller.RequestSendThread)
Analysis
From the flamegraph at that time, we can see that liveBrokerIds called by `isReplicaOnline` takes significant time in `addUpdateMetadataRequestForBrokers` invocation on broker startup.
Attachments
Attachments
Issue Links
- links to