Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.11.0.1
-
None
-
None
Description
- enviroment
apache- 0.11.0.1
5 nodes
3 replicator
mean message per sec : 4k
Prometheus & jmxProt & grafana
cosumer : spring boot& Doris routineLoad
producer: spring boo& Log
- encounter with
we encounter with a broker (id : 4)which is also controller (epoch 90)having much CLOSE_WAITE at a time
controller.log
Controller 4 epoch 90 fails to send request (type: UpdateMetadataRequest ... java.io.IOException: Connection to 4 was disconnected before the response was read
It will be retried many, many times, but the WARNING will not change
At the same time
another broker 6 fetching message from the broker 4 also encounter with the problem
[2021-04-13 16:35:06,942] WARN [ReplicaFetcherThread-0-4]: Error in fetch to broker 4, request (type=FetchRequest, replicaId=6, maxWait=500, minBytes=1, maxBytes=10485760, java.io.IOException: Connection to 4 was disconnected before the response was read
doris routineLoad(consume from kafka) time out
2021-04-13 16:35:11,397 WARN (Routine load scheduler|42) [KafkaUtil.getAllKafkaPartitions():91] failed to get partitions. org.apache.doris.common.UserException: errCode = 2, detailMessage = failed to get kafka partition info: [failed to get partition meta: Local: Timed out]
broker 4( controller 90) fs.file
Most of the CLOSE_WAITE is generated by the consumer application
At 16:49, the broker was restarted and returned to normal
-
- speculation*
The TCP connection is closed passively, but the processing of the Controller Broker machine is not responding
Are there any bugs in this version?