Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
1.14.3
-
None
-
None
Description
In out production environment, we encounter one abnormal case:
upstreamTask is backpressured but its all donwStreamTask is idle, job will keep this status until chk is timeout(use aligned chk); After we analyse this case, we found Half-opend-socket (see https://en.wikipedia.org/wiki/TCP_half-open ) which is already closed on server side but established on client side,lead to this:
1. NettyServer encounter ReadTimeoutException when read data from channel, then it will release the NetworkSequenceViewReader (which is responsable to send data to PartitionRequestClient) and write ErrorResponse to PartitionRequestClient. After writing ErrorResponse success, server will close the channel (socket will be transformed to fin_wait1 status)
2. PartitionRequestClient doesn't receive the ErrorResponse and server's FIN, so client will keep socket be establised status and waiting for BufferResponse from server (maybe our machine's kernel-bug lead to ErrorResponse and FIN lost )
3. Server machine will release the socket if it keep fin_wait1 status for two long time, but the socket on client machine is also under established status, and so lead to Half-opened-socket
To avoid this case,I think there are two methods:
1. Client enable TCP keep alive(flink is already enabled): this way should also need adjust machine's tcp-keep-alive time (tcp-keep-alive's default time is 7200 seconds, which is two long).
2. Client use netty‘s IdleStateHandler to detect whether channel is idle(read or write), if channel is idle, client will try to write pingMsg to server to detect whether channel is really ok.
For the two methods, i recommend the method-2, because adjustment of machine's tcp-keep-alive time will have an impact on other service running on the same machine
Attachments
Issue Links
- duplicates
-
FLINK-19249 Detect broken connections in case TCP Timeout takes too long.
- Open
- is related to
-
FLINK-19249 Detect broken connections in case TCP Timeout takes too long.
- Open