I have describe a mystery problem (https://issues.apache.org/jira/browse/KAFKA-9211). This issue I found kafka server will trigger tcp Congestion Control in some condition. finally we found the root cause.
when kafka server restart for any reason and then execute preferred replica leader, lots of replica leader will give back to it & trigger cluster metadata update. then all clients will establish connection to this server. at the monment many tcp estable request are waiting in the tcp sync queue , and then to accept queue.
kafka create serversocket in SocketServer.scala
this method has second parameter "backlog", min(backlog,tcp_max_syn_backlog) will decide the queue length.beacues kafka haven't set ,it is default value 50.
if this queue is full, and tcp_syncookies = 0, then new connection request will be rejected. If tcp_syncookies=1, it will trigger the tcp synccookie mechanism. this mechanism could allow linux handle more tcp sync request, but it would lose many tcp external parameter, include "wscale", the one that allow tcp connection to send much more bytes per tcp package. because syncookie triggerd, wscale has lost, and this tcp connection will handle network very slow, forever,until this connection is closed and establish another tcp connection.
so after a preferred repilca executed, lots of new tcp connection will establish without set wscale,and many network traffic to this server will have a very slow speed.
i'm not sure whether new linux version have resolved this problem, but kafka also should set backlog a larger value. we now have modify this to 512, seems everything is ok.