[FLINK-26080] PartitionRequest client use Netty's IdleStateHandler to monitor channel's status - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.14.3
Fix Version/s: None
Component/s: Runtime / Network
Labels:
None

Description

In out production environment, we encounter one abnormal case:

upstreamTask is backpressured but its all donwStreamTask is idle, job will keep this status until chk is timeout(use aligned chk); After we analyse this case, we found Half-opend-socket (see https://en.wikipedia.org/wiki/TCP_half-open ) which is already closed on server side but established on client side，lead to this:

1. NettyServer encounter ReadTimeoutException when read data from channel, then it will release the NetworkSequenceViewReader (which is responsable to send data to PartitionRequestClient) and write ErrorResponse to PartitionRequestClient. After writing ErrorResponse success, server will close the channel (socket will be transformed to fin_wait1 status)

2. PartitionRequestClient doesn't receive the ErrorResponse and server's FIN, so client will keep socket be establised status and waiting for BufferResponse from server (maybe our machine's kernel-bug lead to ErrorResponse and FIN lost )

3. Server machine will release the socket if it keep fin_wait1 status for two long time, but the socket on client machine is also under established status, and so lead to Half-opened-socket

To avoid this case，I think there are two methods:

1. Client enable TCP keep alive(flink is already enabled): this way should also need adjust machine's tcp-keep-alive time (tcp-keep-alive's default time is 7200 seconds, which is two long).

2. Client use netty‘s IdleStateHandler to detect whether channel is idle(read or write), if channel is idle, client will try to write pingMsg to server to detect whether channel is really ok.

For the two methods, i recommend the method-2, because adjustment of machine's tcp-keep-alive time will have an impact on other service running on the same machine

Attachments

Issue Links

duplicates

FLINK-19249 Detect broken connections in case TCP Timeout takes too long.

Open

is related to

FLINK-19249 Detect broken connections in case TCP Timeout takes too long.

Open

Activity

People

Assignee:: Unassigned

Reporter:: Cai Liuyang

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 11/Feb/22 02:41

Updated:: 21/Feb/22 12:58

Resolved:: 21/Feb/22 12:58