Description
Background: We recently tried making use of inactivity-cop by setting it to 300s instead of the default one-day setting. This was to address an issue where, under heavy load, ATS would become un-responsive to client requests, and the condition would persist after traffic was stopped with the active queue saying 0 connections but 'netstat -na' showing a bunch of established connections (up to the throttle limit approximately).
Inactivity cop seemed to help ATS handle this situation, but we have since experienced a couple of core dumps over the last four day period. It seems occasionally the Log Collation Client State Machine will have event value 105 or VC_EVENT_INACTIVITY_TIMEOUT, but when it reaches read_signal_and_update() it tries to call the continuation handler which down the line does not know about this event thus causing core dump !"unexpcted state" [sic].
Here is the back-trace –
(gdb) bt
#0 0x00002b67cd5405f7 in raise () from /lib64/libc.so.6
#1 0x00002b67cd541e28 in abort () from /lib64/libc.so.6
#2 0x00002b67cb032921 in ink_die_die_die () at ink_error.cc:43
#3 0x00002b67cb0329da in ink_fatal_va (fmt=0x2b67cb0442dc "%s:%d: failed assert `%s`", ap=0x7ffc690e7ba8) at ink_error.cc:65
#4 0x00002b67cb032a79 in ink_fatal (message_format=0x2b67cb0442dc "%s:%d: failed assert `%s`") at ink_error.cc:73
#5 0x00002b67cb0305a6 in _ink_assert (expression=0x7fb422 "!\"unexpcted state\"", file=0x7fb35b "LogCollationClientSM.cc",
line=445) at ink_assert.cc:37
#6 0x000000000069c86b in LogCollationClientSM::client_idle (this=0x2b681400bb00, event=105) at LogCollationClientSM.cc:445
#7 0x000000000069b427 in LogCollationClientSM::client_handler (this=0x2b681400bb00, event=105, data=0x2b680c017020)
at LogCollationClientSM.cc:119
#8 0x0000000000502cc6 in Continuation::handleEvent (this=0x2b681400bb00, event=105, data=0x2b680c017020)
at ../iocore/eventsystem/I_Continuation.h:153
#9 0x0000000000783d40 in read_signal_and_update (event=105, vc=0x2b680c016f00) at UnixNetVConnection.cc:150
#10 0x0000000000787a22 in UnixNetVConnection::mainEvent (this=0x2b680c016f00, event=1, e=0x127ad60) at UnixNetVConnection.cc:1188
#11 0x0000000000502cc6 in Continuation::handleEvent (this=0x2b680c016f00, event=1, data=0x127ad60)
at ../iocore/eventsystem/I_Continuation.h:153
#12 0x000000000077d943 in InactivityCop::check_inactivity (this=0x1209a00, event=2, e=0x127ad60) at UnixNet.cc:102
#13 0x0000000000502cc6 in Continuation::handleEvent (this=0x1209a00, event=2, data=0x127ad60)
at ../iocore/eventsystem/I_Continuation.h:153
#14 0x00000000007a5df6 in EThread::process_event (this=0x2b67cf7bb010, e=0x127ad60, calling_code=2) at UnixEThread.cc:128
#15 0x00000000007a61f5 in EThread::execute (this=0x2b67cf7bb010) at UnixEThread.cc:207
#16 0x0000000000534430 in main (argv=0x7ffc690e82e8) at Main.cc:1918
I believe it takes a wrong turn here –
#9 0x0000000000783d40 in read_signal_and_update (event=105, vc=0x2b680c016f00) at UnixNetVConnection.cc:150
150 vc->read.vio._cont->handleEvent(event, &vc->read.vio);
(gdb) list
145 static inline int
146 read_signal_and_update(int event, UnixNetVConnection *vc)
147 {
148 vc->recursion++;
149 if (vc->read.vio._cont)
else {
152 switch (event)
Note: I understand that there were several issues related to TS-3196 concerning inactivity_cop and this section of code.