Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-26080

PartitionRequest client use Netty's IdleStateHandler to monitor channel's status

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 1.14.3
    • None
    • Runtime / Network
    • None

    Description

      In out production environment, we encounter one abnormal case:

          upstreamTask is backpressured but its all donwStreamTask is idle, job will keep this status until chk is timeout(use aligned chk); After we analyse this case, we found Half-opend-socket (see https://en.wikipedia.org/wiki/TCP_half-open ) which is  already closed on server side but established on client side,lead to this: 

          1. NettyServer encounter ReadTimeoutException when read data from channel, then it will release the NetworkSequenceViewReader (which is responsable to send data to PartitionRequestClient) and write ErrorResponse to PartitionRequestClient. After writing ErrorResponse success, server will close the channel (socket will be transformed to fin_wait1 status)

          2. PartitionRequestClient doesn't receive the ErrorResponse and server's FIN, so client will keep socket be establised status and waiting for BufferResponse from server (maybe our machine's kernel-bug lead to ErrorResponse and FIN lost )

          3. Server machine will release the socket if it keep fin_wait1 status for two long time, but the socket on client machine is also under established status, and so lead to Half-opened-socket

      To avoid this case,I think there are two methods:

          1. Client enable TCP keep alive(flink is already enabled): this way should also need adjust machine's tcp-keep-alive time (tcp-keep-alive's default time is 7200 seconds, which is two long).

          2. Client use netty‘s IdleStateHandler to detect whether channel is idle(read or write), if channel is idle, client will try to write pingMsg to server to detect whether channel is really ok.

      For the two methods, i recommend the method-2, because adjustment of machine's tcp-keep-alive time will have an impact on other service running on the same machine

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              cailiuyang Cai Liuyang
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: