Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 5.2.0
    • 5.2.1, 5.3.0
    • Core
    • None

    Description

      The patch in TS-3196 seems to result in fd leak in our prod. There are a bunch of hung sockets (in close-wait state), stuck forever (remain leaked for days after stopping the traffic). Debugging further, it seems that the InactivityCop is broken by this patch. [NOTE: We have spdy enabled in prod, but, I am not entirely sure, if this bug only affects spdy connections]

      Some info below for the leaked sockets (in close_wait state):

      $ ss -s ; sudo traffic_line -r
      proxy.process.net.connections_currently_open; sudo traffic_line -r
      proxy.process.http.current_client_connections; sudo traffic_line -r
      proxy.process.http.current_server_connections; sudo ls -l /proc/$(pidof
      traffic_server)/fd/ 3>/dev/null| wc -l
      Total: 29367 (kernel 29437)
      TCP:   78235 (estab 5064, closed 46593, orphaned 0, synrecv 0, timewait 15/0),
      ports 918
      
      Transport Total     IP        IPv6
      *      29437     -         -        
      RAW      0         0         0        
      UDP      16        13        3        
      TCP      31642     31637     5        
      INET      31658     31650     8        
      FRAG      0         0         0        
      
      Password: 
      27689
      1
      1
      27939
      
      A snippet from "lsof -p $(pidof traffic_server)"
      
      [ET_NET 10024 nobody  240u  IPv4         2138575429        0t0        TCP 67.195.33.62:443->66.87.139.114:29381 (CLOSE_WAIT)
      [ET_NET 10024 nobody  241u  IPv4         2137093945        0t0        TCP 67.195.33.62:443->201.122.209.180:49274 (CLOSE_WAIT)
      [ET_NET 10024 nobody  243u  IPv4         2136018789        0t0        TCP 67.195.33.62:443->173.225.111.9:38133 (CLOSE_WAIT)
      [ET_NET 10024 nobody  245u  IPv4         2135996293        0t0        TCP 67.195.33.62:443->172.243.79.180:52701 (CLOSE_WAIT)
      [ET_NET 10024 nobody  248u  IPv4         2136468896        0t0        TCP 67.195.33.62:443->173.225.111.82:42273 (CLOSE_WAIT)
      [ET_NET 10024 nobody  253u  IPv4         2140213864        0t0        TCP 67.195.33.62:443->174.138.185.120:34936 (CLOSE_WAIT)
      [ET_NET 10024 nobody  259u  IPv4         2137861176        0t0        TCP 67.195.33.62:443->76.199.250.133:60631 (CLOSE_WAIT)
      [ET_NET 10024 nobody  260u  IPv4         2139081493        0t0        TCP 67.195.33.62:443->187.74.154.214:58800 (CLOSE_WAIT)
      [ET_NET 10024 nobody  261u  IPv4         2134948565        0t0        TCP 67.195.33.62:443->23.242.49.117:4127 (CLOSE_WAIT)
      [ET_NET 10024 nobody  262u  IPv4         2135708046        0t0        TCP 67.195.33.62:443->66.241.71.243:50318 (CLOSE_WAIT)
      [ET_NET 10024 nobody  263u  IPv4         2138896897        0t0        TCP 67.195.33.62:443->73.35.151.106:52414 (CLOSE_WAIT)
      [ET_NET 10024 nobody  264u  IPv4         2135589029        0t0        TCP 67.195.33.62:443->96.251.12.27:62426 (CLOSE_WAIT)
      [ET_NET 10024 nobody  265u  IPv4         2134930235        0t0        TCP 67.195.33.62:443->207.118.3.196:50690 (CLOSE_WAIT)
      [ET_NET 10024 nobody  267u  IPv4         2137837515        0t0        TCP 67.195.33.62:443->98.112.195.98:52028 (CLOSE_WAIT)
      [ET_NET 10024 nobody  269u  IPv4         2135272855        0t0        TCP 67.195.33.62:443->24.1.230.25:57265 (CLOSE_WAIT)
      [ET_NET 10024 nobody  270u  IPv4         2135820802        0t0        TCP 67.195.33.62:443->24.75.122.66:14345 (CLOSE_WAIT)
      [ET_NET 10024 nobody  271u  IPv4         2135475042        0t0        TCP 67.195.33.62:443->65.102.35.112:49188 (CLOSE_WAIT)
      [ET_NET 10024 nobody  272u  IPv4         2135328974        0t0        TCP 67.195.33.62:443->209.242.195.252:54890 (CLOSE_WAIT)
      [ET_NET 10024 nobody  273u  IPv4         2137542791        0t0        TCP 67.195.33.62:443->76.79.183.188:47048 (CLOSE_WAIT)
      [ET_NET 10024 nobody  274u  IPv4         2134806135        0t0        TCP 67.195.33.62:443->189.251.149.36:58106 (CLOSE_WAIT)
      [ET_NET 10024 nobody  275u  IPv4         2140126017        0t0        TCP 67.195.33.62:443->68.19.173.44:1397 (CLOSE_WAIT)
      [ET_NET 10024 nobody  276u  IPv4         2134636089        0t0        TCP 67.195.33.62:443->67.44.192.72:22112 (CLOSE_WAIT)
      [ET_NET 10024 nobody  278u  IPv4         2134708339        0t0        TCP 67.195.33.62:443->107.220.216.155:51242 (CLOSE_WAIT)
      [ET_NET 10024 nobody  279u  IPv4         2134580888        0t0        TCP 67.195.33.62:443->50.126.116.209:59432 (CLOSE_WAIT)
      [ET_NET 10024 nobody  281u  IPv4         2134868131        0t0        TCP 67.195.33.62:443->108.38.255.44:4612 (CLOSE_WAIT)
      [ET_NET 10024 nobody  282u  IPv4         2139275601        0t0        TCP 67.195.33.62:443->168.99.78.5:63345 (CLOSE_WAIT)
      [ET_NET 10024 nobody  283u  IPv4         2138019644        0t0        TCP 67.195.33.62:443->209.242.204.18:47547 (CLOSE_WAIT)
      [ET_NET 10024 nobody  284u  IPv4         2134873211        0t0        TCP 67.195.33.62:443->99.47.247.0:58672 (CLOSE_WAIT)
      [ET_NET 10024 nobody  285u  IPv4         2137679393        0t0        TCP 67.195.33.62:443->174.138.184.227:43908 (CLOSE_WAIT)
      [ET_NET 10024 nobody  287u  IPv4         2137417894        0t0        TCP 67.195.33.62:443->50.46.200.8:4880 (CLOSE_WAIT)
      [ET_NET 10024 nobody  288u  IPv4         2135541215        0t0        TCP 67.195.33.62:443->216.201.190.60:41016 (CLOSE_WAIT)
      [ET_NET 10024 nobody  289u  IPv4         2137315400        0t0        TCP 67.195.33.62:443->98.180.195.225:53754 (CLOSE_WAIT)
      [ET_NET 10024 nobody  291u  IPv4         2136931822        0t0        TCP 67.195.33.62:443->12.199.206.210:7271 (CLOSE_WAIT)
      [ET_NET 10024 nobody  292u  IPv4         2137892940        0t0        TCP 67.195.33.62:443->166.173.60.100:16766 (CLOSE_WAIT)
      [ET_NET 10024 nobody  293u  IPv4         2136627603        0t0        TCP 67.195.33.62:443->71.131.5.198:51552 (CLOSE_WAIT)
      [ET_NET 10024 nobody  294u  IPv4         2135059554        0t0        TCP 67.195.33.62:443->209.95.183.181:39707 (CLOSE_WAIT)
      

      Below are the traces showing how the inactivity cop is stuck in a loop, basically resetting the timer every time it expires, and not closing the vc.

      [Jan 13 23:08:32.120] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose)
      vc: 0x2ae574f13730 timeout at: 1421190525 timeout in: 300
      [Jan 13 23:08:33.164] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose)
      vc: 0x2ae574f13730 timeout at: 1421190525 timeout in: 300
      .....
      [Jan 13 23:13:46.833] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose) vc: 0x2ae574f13730 timeout at: 1421190827 timeout in: 300
      [Jan 13 23:13:47.845] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose) vc: 0x2ae574f13730 timeout at: 1421190827 timeout in: 300
      [Jan 13 23:08:47.639] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop) vc:
      0x2ae574f13730 inactivity timeout not set, setting a default of 300
      [Jan 13 23:08:48.666] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose)
      vc: 0x2ae574f13730 timeout at: 1421190827 timeout in: 300
      ........
      

      Attached gdb to track who's resetting the inactivity timer - It seems that, when the inactivity timer expires for these connections, the audit correctly generates a signal to the vc's continuation. However, since, the patch in TS-3196 reset the read continuation to NULL, this event simply is ignored (in read_signal_and_update()), while at the same time, the inactivity timer is reset. This keeps looping forever resulting in never releasing these inactive connections. Below's the debug output from gdb.

      (gdb) cont
      Continuing.
      Hardware watchpoint 22: *(0x2ae564011a28)
      
      Old value = 0
      New value = -794113713
      UnixNetVConnection::set_inactivity_timeout (this=0x2ae564011830,
      timeout=300000000000) at P_UnixNetVConnection.h:324
      324    P_UnixNetVConnection.h: No such file or directory.
          in P_UnixNetVConnection.h
      (gdb) cont
      Continuing.
      Hardware watchpoint 22: *(0x2ae564011a28)
      
      Old value = -794113713
      New value = 0
      UnixNetVConnection::mainEvent (this=0x2ae564011830, event=1, e=0x1f37c90) at
      UnixNetVConnection.cc:1060
      1060    UnixNetVConnection.cc: No such file or directory.
          in UnixNetVConnection.cc
      (gdb) break 1068
      Breakpoint 23 at 0x738358: file UnixNetVConnection.cc, line 1068.
      
      Breakpoint 23, UnixNetVConnection::mainEvent (this=0x2ae575679c90, event=1,
      e=0x1f34a50) at UnixNetVConnection.cc:1068
      1068    in UnixNetVConnection.cc
      (gdb) p closed
      $24 = 0
      (gdb) p read.vio
      $25 = {_cont = 0x0, nbytes = 0, ndone = 0, op = 1, buffer = {mbuf = 0x0, entry
      = 0x0}, vc_server = 0x2ae575679c90, mutex = {m_ptr = 0x2ae6241a3720}}
      
      Breakpoint 24, read_signal_and_update (event=105, vc=0x2ae5758ffab0) at
      UnixNetVConnection.cc:137
      137    in UnixNetVConnection.cc
      (gdb) p vc->read.vio
      $33 = {_cont = 0x0, nbytes = 0, ndone = 0, op = 1, buffer = {mbuf = 0x0, entry
      = 0x0}, vc_server = 0x2ae5758ffab0, mutex = {m_ptr = 0x2ae6100a3a20}}
      (gdb) cont
      Continuing.
      [Switching to Thread 0x2ae55e928700 (LWP 10088)]
      

      Attachments

        Issue Links

          Activity

            sudheerv Sudheer Vinukonda added a comment - - edited

            Below's a possible fix (yet to be tested):

            % diff --git a/iocore/net/UnixNetVConnection.cc
            b/iocore/net/UnixNetVConnection.cc
            index 92cb7c4..104f0ea 100644
            --- a/iocore/net/UnixNetVConnection.cc
            +++ b/iocore/net/UnixNetVConnection.cc
            @@ -137,6 +137,9 @@ read_signal_and_update(int event, UnixNetVConnection *vc)
               vc->recursion++;
               if(vc->read.vio._cont) {
                 vc->read.vio._cont->handleEvent(event, &vc->read.vio);
            +  } else {
            +    ink_assert (event == VC_EVENT_INACTIVITY_TIMEOUT || event == VC_EVENT_EOS || event == VC_EVENT_ERROR || event == VC_EVENT_ACTIVE_TIMEOUT);
            +    vc->closed = 1
               }
               if (!--vc->recursion && vc->closed) {
                 /* BZ  31932 */
            
            sudheerv Sudheer Vinukonda added a comment - - edited Below's a possible fix (yet to be tested): % diff --git a/iocore/net/UnixNetVConnection.cc b/iocore/net/UnixNetVConnection.cc index 92cb7c4..104f0ea 100644 --- a/iocore/net/UnixNetVConnection.cc +++ b/iocore/net/UnixNetVConnection.cc @@ -137,6 +137,9 @@ read_signal_and_update( int event, UnixNetVConnection *vc) vc->recursion++; if (vc->read.vio._cont) { vc->read.vio._cont->handleEvent(event, &vc->read.vio); + } else { + ink_assert (event == VC_EVENT_INACTIVITY_TIMEOUT || event == VC_EVENT_EOS || event == VC_EVENT_ERROR || event == VC_EVENT_ACTIVE_TIMEOUT); + vc->closed = 1 } if (!--vc->recursion && vc->closed) { /* BZ 31932 */

            Commit bb79d6334e445aee4200f0d5f58953f42c107c6a in trafficserver's branch refs/heads/master from sudheerv
            [ https://git-wip-us.apache.org/repos/asf?p=trafficserver.git;h=bb79d63 ]

            TS-3299 Fix leaking vc's during inactivitycop, due to null continuation

            jira-bot ASF subversion and git services added a comment - Commit bb79d6334e445aee4200f0d5f58953f42c107c6a in trafficserver's branch refs/heads/master from sudheerv [ https://git-wip-us.apache.org/repos/asf?p=trafficserver.git;h=bb79d63 ] TS-3299 Fix leaking vc's during inactivitycop, due to null continuation

            The fd leak issue was a side-effect of TS-3196, which went in 5.2.0, so, marking this as a back port to 5.2.0

            sudheerv Sudheer Vinukonda added a comment - The fd leak issue was a side-effect of TS-3196 , which went in 5.2.0, so, marking this as a back port to 5.2.0

            Commit 394e73b4ffe52629b4bea88d1115e22ec2145b6e in trafficserver's branch refs/heads/5.2.x from sudheerv
            [ https://git-wip-us.apache.org/repos/asf?p=trafficserver.git;h=394e73b ]

            TS-3299 Fix leaking vc's during inactivitycop, due to null continuation

            jira-bot ASF subversion and git services added a comment - Commit 394e73b4ffe52629b4bea88d1115e22ec2145b6e in trafficserver's branch refs/heads/5.2.x from sudheerv [ https://git-wip-us.apache.org/repos/asf?p=trafficserver.git;h=394e73b ] TS-3299 Fix leaking vc's during inactivitycop, due to null continuation

            Commit 802bfb08f9be3de0e1553c28bb99b0e5c9dcb00e in trafficserver's branch refs/heads/5.2.x from zwoop
            [ https://git-wip-us.apache.org/repos/asf?p=trafficserver.git;h=802bfb0 ]

            Added TS-3299

            jira-bot ASF subversion and git services added a comment - Commit 802bfb08f9be3de0e1553c28bb99b0e5c9dcb00e in trafficserver's branch refs/heads/5.2.x from zwoop [ https://git-wip-us.apache.org/repos/asf?p=trafficserver.git;h=802bfb0 ] Added TS-3299

            People

              sudheerv Sudheer Vinukonda
              sudheerv Sudheer Vinukonda
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: