Description
The patch in TS-3196 seems to result in fd leak in our prod. There are a bunch of hung sockets (in close-wait state), stuck forever (remain leaked for days after stopping the traffic). Debugging further, it seems that the InactivityCop is broken by this patch. [NOTE: We have spdy enabled in prod, but, I am not entirely sure, if this bug only affects spdy connections]
Some info below for the leaked sockets (in close_wait state):
$ ss -s ; sudo traffic_line -r proxy.process.net.connections_currently_open; sudo traffic_line -r proxy.process.http.current_client_connections; sudo traffic_line -r proxy.process.http.current_server_connections; sudo ls -l /proc/$(pidof traffic_server)/fd/ 3>/dev/null| wc -l Total: 29367 (kernel 29437) TCP: 78235 (estab 5064, closed 46593, orphaned 0, synrecv 0, timewait 15/0), ports 918 Transport Total IP IPv6 * 29437 - - RAW 0 0 0 UDP 16 13 3 TCP 31642 31637 5 INET 31658 31650 8 FRAG 0 0 0 Password: 27689 1 1 27939 A snippet from "lsof -p $(pidof traffic_server)" [ET_NET 10024 nobody 240u IPv4 2138575429 0t0 TCP 67.195.33.62:443->66.87.139.114:29381 (CLOSE_WAIT) [ET_NET 10024 nobody 241u IPv4 2137093945 0t0 TCP 67.195.33.62:443->201.122.209.180:49274 (CLOSE_WAIT) [ET_NET 10024 nobody 243u IPv4 2136018789 0t0 TCP 67.195.33.62:443->173.225.111.9:38133 (CLOSE_WAIT) [ET_NET 10024 nobody 245u IPv4 2135996293 0t0 TCP 67.195.33.62:443->172.243.79.180:52701 (CLOSE_WAIT) [ET_NET 10024 nobody 248u IPv4 2136468896 0t0 TCP 67.195.33.62:443->173.225.111.82:42273 (CLOSE_WAIT) [ET_NET 10024 nobody 253u IPv4 2140213864 0t0 TCP 67.195.33.62:443->174.138.185.120:34936 (CLOSE_WAIT) [ET_NET 10024 nobody 259u IPv4 2137861176 0t0 TCP 67.195.33.62:443->76.199.250.133:60631 (CLOSE_WAIT) [ET_NET 10024 nobody 260u IPv4 2139081493 0t0 TCP 67.195.33.62:443->187.74.154.214:58800 (CLOSE_WAIT) [ET_NET 10024 nobody 261u IPv4 2134948565 0t0 TCP 67.195.33.62:443->23.242.49.117:4127 (CLOSE_WAIT) [ET_NET 10024 nobody 262u IPv4 2135708046 0t0 TCP 67.195.33.62:443->66.241.71.243:50318 (CLOSE_WAIT) [ET_NET 10024 nobody 263u IPv4 2138896897 0t0 TCP 67.195.33.62:443->73.35.151.106:52414 (CLOSE_WAIT) [ET_NET 10024 nobody 264u IPv4 2135589029 0t0 TCP 67.195.33.62:443->96.251.12.27:62426 (CLOSE_WAIT) [ET_NET 10024 nobody 265u IPv4 2134930235 0t0 TCP 67.195.33.62:443->207.118.3.196:50690 (CLOSE_WAIT) [ET_NET 10024 nobody 267u IPv4 2137837515 0t0 TCP 67.195.33.62:443->98.112.195.98:52028 (CLOSE_WAIT) [ET_NET 10024 nobody 269u IPv4 2135272855 0t0 TCP 67.195.33.62:443->24.1.230.25:57265 (CLOSE_WAIT) [ET_NET 10024 nobody 270u IPv4 2135820802 0t0 TCP 67.195.33.62:443->24.75.122.66:14345 (CLOSE_WAIT) [ET_NET 10024 nobody 271u IPv4 2135475042 0t0 TCP 67.195.33.62:443->65.102.35.112:49188 (CLOSE_WAIT) [ET_NET 10024 nobody 272u IPv4 2135328974 0t0 TCP 67.195.33.62:443->209.242.195.252:54890 (CLOSE_WAIT) [ET_NET 10024 nobody 273u IPv4 2137542791 0t0 TCP 67.195.33.62:443->76.79.183.188:47048 (CLOSE_WAIT) [ET_NET 10024 nobody 274u IPv4 2134806135 0t0 TCP 67.195.33.62:443->189.251.149.36:58106 (CLOSE_WAIT) [ET_NET 10024 nobody 275u IPv4 2140126017 0t0 TCP 67.195.33.62:443->68.19.173.44:1397 (CLOSE_WAIT) [ET_NET 10024 nobody 276u IPv4 2134636089 0t0 TCP 67.195.33.62:443->67.44.192.72:22112 (CLOSE_WAIT) [ET_NET 10024 nobody 278u IPv4 2134708339 0t0 TCP 67.195.33.62:443->107.220.216.155:51242 (CLOSE_WAIT) [ET_NET 10024 nobody 279u IPv4 2134580888 0t0 TCP 67.195.33.62:443->50.126.116.209:59432 (CLOSE_WAIT) [ET_NET 10024 nobody 281u IPv4 2134868131 0t0 TCP 67.195.33.62:443->108.38.255.44:4612 (CLOSE_WAIT) [ET_NET 10024 nobody 282u IPv4 2139275601 0t0 TCP 67.195.33.62:443->168.99.78.5:63345 (CLOSE_WAIT) [ET_NET 10024 nobody 283u IPv4 2138019644 0t0 TCP 67.195.33.62:443->209.242.204.18:47547 (CLOSE_WAIT) [ET_NET 10024 nobody 284u IPv4 2134873211 0t0 TCP 67.195.33.62:443->99.47.247.0:58672 (CLOSE_WAIT) [ET_NET 10024 nobody 285u IPv4 2137679393 0t0 TCP 67.195.33.62:443->174.138.184.227:43908 (CLOSE_WAIT) [ET_NET 10024 nobody 287u IPv4 2137417894 0t0 TCP 67.195.33.62:443->50.46.200.8:4880 (CLOSE_WAIT) [ET_NET 10024 nobody 288u IPv4 2135541215 0t0 TCP 67.195.33.62:443->216.201.190.60:41016 (CLOSE_WAIT) [ET_NET 10024 nobody 289u IPv4 2137315400 0t0 TCP 67.195.33.62:443->98.180.195.225:53754 (CLOSE_WAIT) [ET_NET 10024 nobody 291u IPv4 2136931822 0t0 TCP 67.195.33.62:443->12.199.206.210:7271 (CLOSE_WAIT) [ET_NET 10024 nobody 292u IPv4 2137892940 0t0 TCP 67.195.33.62:443->166.173.60.100:16766 (CLOSE_WAIT) [ET_NET 10024 nobody 293u IPv4 2136627603 0t0 TCP 67.195.33.62:443->71.131.5.198:51552 (CLOSE_WAIT) [ET_NET 10024 nobody 294u IPv4 2135059554 0t0 TCP 67.195.33.62:443->209.95.183.181:39707 (CLOSE_WAIT)
Below are the traces showing how the inactivity cop is stuck in a loop, basically resetting the timer every time it expires, and not closing the vc.
[Jan 13 23:08:32.120] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose)
vc: 0x2ae574f13730 timeout at: 1421190525 timeout in: 300
[Jan 13 23:08:33.164] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose)
vc: 0x2ae574f13730 timeout at: 1421190525 timeout in: 300
.....
[Jan 13 23:13:46.833] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose) vc: 0x2ae574f13730 timeout at: 1421190827 timeout in: 300
[Jan 13 23:13:47.845] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose) vc: 0x2ae574f13730 timeout at: 1421190827 timeout in: 300
[Jan 13 23:08:47.639] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop) vc:
0x2ae574f13730 inactivity timeout not set, setting a default of 300
[Jan 13 23:08:48.666] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose)
vc: 0x2ae574f13730 timeout at: 1421190827 timeout in: 300
........
Attached gdb to track who's resetting the inactivity timer - It seems that, when the inactivity timer expires for these connections, the audit correctly generates a signal to the vc's continuation. However, since, the patch in TS-3196 reset the read continuation to NULL, this event simply is ignored (in read_signal_and_update()), while at the same time, the inactivity timer is reset. This keeps looping forever resulting in never releasing these inactive connections. Below's the debug output from gdb.
(gdb) cont Continuing. Hardware watchpoint 22: *(0x2ae564011a28) Old value = 0 New value = -794113713 UnixNetVConnection::set_inactivity_timeout (this=0x2ae564011830, timeout=300000000000) at P_UnixNetVConnection.h:324 324 P_UnixNetVConnection.h: No such file or directory. in P_UnixNetVConnection.h (gdb) cont Continuing. Hardware watchpoint 22: *(0x2ae564011a28) Old value = -794113713 New value = 0 UnixNetVConnection::mainEvent (this=0x2ae564011830, event=1, e=0x1f37c90) at UnixNetVConnection.cc:1060 1060 UnixNetVConnection.cc: No such file or directory. in UnixNetVConnection.cc (gdb) break 1068 Breakpoint 23 at 0x738358: file UnixNetVConnection.cc, line 1068. Breakpoint 23, UnixNetVConnection::mainEvent (this=0x2ae575679c90, event=1, e=0x1f34a50) at UnixNetVConnection.cc:1068 1068 in UnixNetVConnection.cc (gdb) p closed $24 = 0 (gdb) p read.vio $25 = {_cont = 0x0, nbytes = 0, ndone = 0, op = 1, buffer = {mbuf = 0x0, entry = 0x0}, vc_server = 0x2ae575679c90, mutex = {m_ptr = 0x2ae6241a3720}} Breakpoint 24, read_signal_and_update (event=105, vc=0x2ae5758ffab0) at UnixNetVConnection.cc:137 137 in UnixNetVConnection.cc (gdb) p vc->read.vio $33 = {_cont = 0x0, nbytes = 0, ndone = 0, op = 1, buffer = {mbuf = 0x0, entry = 0x0}, vc_server = 0x2ae5758ffab0, mutex = {m_ptr = 0x2ae6100a3a20}} (gdb) cont Continuing. [Switching to Thread 0x2ae55e928700 (LWP 10088)]
Attachments
Issue Links
- is depended upon by
-
TS-3313 New World order for connection management and timeouts
- Closed