Traffic Server
  1. Traffic Server
  2. TS-880

Major performance problem with second request on same keep-alive connection

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0, 2.1.4
    • Fix Version/s: 3.1.0, 3.0.4
    • Component/s: HTTP
    • Labels:
      None
    • Environment:

      32-bit TS on linux, using epoll. With proxy.config.http.keep_alive_enabled_out = 1

      Description

      If I load the same URL through TS twice in a row to a server with keep-alives turned on I get really slow performance on the second request.
      E.g. (loading 212M over loopback with uncachable content, but results are similar with cachable content):
      % Total % Received % Xferd Average Speed Time Time Time Current
      Dload Upload Total Spent Left Speed
      100 212M 100 212M 0 0 140M 0 0:00:01 0:00:01 -::- 142M*
      % Total % Received % Xferd Average Speed Time Time Time Current
      Dload Upload Total Spent Left Speed
      100 212M 100 212M 0 0 6397k 0 0:00:34 0:00:34 -::- 6400k*

      If I turn off proxy.config.http.keep_alive_enabled_out the problem goes away, it can also be partially solved by making proxy.config.io.max_buffer_size really big (but it needs to be bigger for larger content, and that supports the comment below, and proves that it isn't the origin server's fault.)

      When I turn on debug messages it looks like the IO loop for the second request only wakes up every 10ms (when an epoll_wait times out) and then it reads and writes, for the first request it goes much faster without those pauses. My theory is that the issue is related to the the fact that the outgoing connection fd gets added to the epoll fd for the thread handling the first request and the first user agent fd is added to the same epoll fd, and it stays on that epoll fd. But the new user agent request is on a different epoll fd which I assume means is tied to a different thread.

      1. performance_2nd_req_another_version.patch
        2 kB
        mohan_zl
      2. performance.try3.diff
        0.5 kB
        weijin
      3. performance_2nd_req_try2.diff
        3 kB
        William Bardwell
      4. performance_2nd_req.diff
        4 kB
        William Bardwell

        Activity

        Hide
        Igor Galić added a comment -

        backported in fcf1f824ea16c50d72d24f0e9b169e5201eba713

        Show
        Igor Galić added a comment - backported in fcf1f824ea16c50d72d24f0e9b169e5201eba713
        Hide
        Brian Geffon added a comment -

        try3 cleanly applies on 3.0.x

        Show
        Brian Geffon added a comment - try3 cleanly applies on 3.0.x
        Hide
        Dzmitry Markovich added a comment -

        Is anyone going to backport it to 3.0.x branch?

        Show
        Dzmitry Markovich added a comment - Is anyone going to backport it to 3.0.x branch?
        Hide
        Leif Hedstrom added a comment -

        Perhaps try

        CONFIG proxy.config.http.share_server_sessions INT 2

        This is available with 3.1.1 (and soon, 3.1.2). I don't think it was backported to the 3.0.x branch.

        Show
        Leif Hedstrom added a comment - Perhaps try CONFIG proxy.config.http.share_server_sessions INT 2 This is available with 3.1.1 (and soon, 3.1.2). I don't think it was backported to the 3.0.x branch.
        Hide
        Dzmitry Markovich added a comment -

        We have the same issues described above with 3.0.2 version. Did you guys fix it in this version?
        Second request on KA connection gets delayed every time, so we have to set proxy.config.http.keep_alive_enabled_out = 0 to achieve the performance expectations.
        Any ideas?

        Show
        Dzmitry Markovich added a comment - We have the same issues described above with 3.0.2 version. Did you guys fix it in this version? Second request on KA connection gets delayed every time, so we have to set proxy.config.http.keep_alive_enabled_out = 0 to achieve the performance expectations. Any ideas?
        Hide
        weijin added a comment -

        The may be never NULL if the thread is an ET_NET thread, after the netHandler is calls back startNetEvent.

        Show
        weijin added a comment - The may be never NULL if the thread is an ET_NET thread, after the netHandler is calls back startNetEvent.
        Hide
        mohan_zl added a comment -

        @William
        1. I think function initialize_thread_for_net will assign value for trigger_event and signal_hook.
        2. When i test your first patch, i found it will crash, and from the debug info, i found when the second time a thread use the connection which the last thread created, it will cause two thread switch, so i just add a "thread = t" in your do_reown_to_this_thread function, and change some function name. In my test environment, it do solve the problem. It seems odd, what is your test version? I use the latest trunk.
        3. Agree with you, try3 is the perfect patch so far. weijin always solve one problem with only a few lines, i admire him...

        Show
        mohan_zl added a comment - @William 1. I think function initialize_thread_for_net will assign value for trigger_event and signal_hook. 2. When i test your first patch, i found it will crash, and from the debug info, i found when the second time a thread use the connection which the last thread created, it will cause two thread switch, so i just add a "thread = t" in your do_reown_to_this_thread function, and change some function name. In my test environment, it do solve the problem. It seems odd, what is your test version? I use the latest trunk. 3. Agree with you, try3 is the perfect patch so far. weijin always solve one problem with only a few lines, i admire him...
        Hide
        Leif Hedstrom added a comment -

        After talking to William, I think we should commit "try3", and then file a new bug for a possibly improvement based on the thread reallocation ideas (which don't seem to be stable). If I hear no objections, I'll commit try3 tomorrow, and create a new bug for later.

        Show
        Leif Hedstrom added a comment - After talking to William, I think we should commit "try3", and then file a new bug for a possibly improvement based on the thread reallocation ideas (which don't seem to be stable). If I hear no objections, I'll commit try3 tomorrow, and create a new bug for later.
        Hide
        William Bardwell added a comment -

        With 'performance_2nd_req_another_version.patch' I get similar crashes to with my original patches, so I think 'try3' is the best option so far.

        [Aug 17 15:32:31.704] Server

        {4134497168}

        NOTE: cache enabled
        FATAL: UnixNetVConnection.cc:968: failed assert `thread == this_ethread()`
        /usr/local/akamai/ts7/proxy/traffic_server - STACK TRACE:
        FATAL: UnixNetVConnection.cc:968: failed assert `thread == this_ethread()`
        /usr/local/akamai/ts7/proxy/traffic_server - STACK TRACE:
        /usr/local/akamai/ts7/proxy/traffic_server(ink_fatal+0x1a)[0x844621a]
        /usr/local/akamai/ts7/proxy/traffic_server(ink_fatal+0x1a)[0x844621a]
        /usr/local/akamai/ts7/proxy/traffic_server(_ink_assert+0x6a)[0x844593a]
        /usr/local/akamai/ts7/proxy/traffic_server(_ink_assert+0x6a)[0x844593a]
        /usr/local/akamai/ts7/proxy/traffic_server(_ZN18UnixNetVConnection9mainEventEiP5Event+0x5b)[0x835c97b]
        /usr/local/akamai/ts7/proxy/traffic_server(_ZN18UnixNetVConnection9mainEventEiP5Event+0x5b)[0x835c97b]
        /usr/local/akamai/ts7/proxy/traffic_server(_ZN13InactivityCop16check_inactivityEiP5Event+0xd1)[0x83518b1]
        /usr/local/akamai/ts7/proxy/traffic_server(_ZN13InactivityCop16check_inactivityEiP5Event+0xd1)[0x83518b1]
        /usr/local/akamai/ts7/proxy/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x3c3)[0x8385233]
        /usr/local/akamai/ts7/proxy/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x3c3)[0x8385233]
        /usr/local/akamai/ts7/proxy/traffic_server(_ZN7EThread7executeEv+0x4cf)[0x8385d4f]
        /usr/local/akamai/ts7/proxy/traffic_server(_ZN7EThread7executeEv+0x4cf)[0x8385d4f]
        /usr/local/akamai/ts7/proxy/traffic_server[0x8384c53]
        /usr/local/akamai/ts7/proxy/traffic_server[0x8384c53]
        /lib/tls/i686/cmov/libpthread.so.0[0xf77264fb]
        /lib/tls/i686/cmov/libpthread.so.0[0xf77264fb]
        /lib/tls/i686/cmov/libc.so.6(clone+0x5e)[0xf741de5e]
        /lib/tls/i686/cmov/libc.so.6(clone+0x5e)[0xf741de5e]

        Program received signal SIGABRT, Aborted.
        [Switching to Thread 0xf66f6b90 (LWP 9967)]
        0xffffe425 in __kernel_vsyscall ()
        (gdb) where
        #0 0xffffe425 in __kernel_vsyscall ()
        #1 0xf7372085 in raise () from /lib/tls/i686/cmov/libc.so.6
        #2 0xf7373a01 in abort () from /lib/tls/i686/cmov/libc.so.6
        #3 0x084461da in ink_fatal_va (return_code=1,
        message_format=0xf66f5f54 "UnixNetVConnection.cc:968: failed assert `thread == this_ethread()`",
        ap=0xf66f5f48 "") at ink_error.cc:43
        #4 0x0844621a in ink_fatal (return_code=1,
        message_format=0xf66f5f54 "UnixNetVConnection.cc:968: failed assert `thread == this_ethread()`")
        at ink_error.cc:73
        #5 0x0844593a in _ink_assert (a=0x84abe39 "thread == this_ethread()", f=0x84abd9d "UnixNetVConnection.cc", l=968)
        at ink_assert.cc:44
        #6 0x0835c97b in UnixNetVConnection::mainEvent (this=0x8a9edb0, event=1, e=0x8a829a0)
        at UnixNetVConnection.cc:968
        #7 0x083518b1 in InactivityCop::check_inactivity (this=0xf5400898, event=2, e=0x8a829a0)
        at ../../iocore/eventsystem/I_Continuation.h:146
        #8 0x08385233 in EThread::process_event (this=0xf67f8008, e=0x8a829a0, calling_code=2) at I_Continuation.h:146
        #9 0x08385d4f in EThread::execute (this=0xf67f8008) at UnixEThread.cc:226
        #10 0x08384c53 in spawn_thread_internal (a=0x8a5c320) at Thread.cc:88
        #11 0xf77264fb in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
        #12 0xf741de5e in clone () from /lib/tls/i686/cmov/libc.so.6

        Show
        William Bardwell added a comment - With 'performance_2nd_req_another_version.patch' I get similar crashes to with my original patches, so I think 'try3' is the best option so far. [Aug 17 15:32:31.704] Server {4134497168} NOTE: cache enabled FATAL: UnixNetVConnection.cc:968: failed assert `thread == this_ethread()` /usr/local/akamai/ts7/proxy/traffic_server - STACK TRACE: FATAL: UnixNetVConnection.cc:968: failed assert `thread == this_ethread()` /usr/local/akamai/ts7/proxy/traffic_server - STACK TRACE: /usr/local/akamai/ts7/proxy/traffic_server(ink_fatal+0x1a) [0x844621a] /usr/local/akamai/ts7/proxy/traffic_server(ink_fatal+0x1a) [0x844621a] /usr/local/akamai/ts7/proxy/traffic_server(_ink_assert+0x6a) [0x844593a] /usr/local/akamai/ts7/proxy/traffic_server(_ink_assert+0x6a) [0x844593a] /usr/local/akamai/ts7/proxy/traffic_server(_ZN18UnixNetVConnection9mainEventEiP5Event+0x5b) [0x835c97b] /usr/local/akamai/ts7/proxy/traffic_server(_ZN18UnixNetVConnection9mainEventEiP5Event+0x5b) [0x835c97b] /usr/local/akamai/ts7/proxy/traffic_server(_ZN13InactivityCop16check_inactivityEiP5Event+0xd1) [0x83518b1] /usr/local/akamai/ts7/proxy/traffic_server(_ZN13InactivityCop16check_inactivityEiP5Event+0xd1) [0x83518b1] /usr/local/akamai/ts7/proxy/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x3c3) [0x8385233] /usr/local/akamai/ts7/proxy/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x3c3) [0x8385233] /usr/local/akamai/ts7/proxy/traffic_server(_ZN7EThread7executeEv+0x4cf) [0x8385d4f] /usr/local/akamai/ts7/proxy/traffic_server(_ZN7EThread7executeEv+0x4cf) [0x8385d4f] /usr/local/akamai/ts7/proxy/traffic_server [0x8384c53] /usr/local/akamai/ts7/proxy/traffic_server [0x8384c53] /lib/tls/i686/cmov/libpthread.so.0 [0xf77264fb] /lib/tls/i686/cmov/libpthread.so.0 [0xf77264fb] /lib/tls/i686/cmov/libc.so.6(clone+0x5e) [0xf741de5e] /lib/tls/i686/cmov/libc.so.6(clone+0x5e) [0xf741de5e] Program received signal SIGABRT, Aborted. [Switching to Thread 0xf66f6b90 (LWP 9967)] 0xffffe425 in __kernel_vsyscall () (gdb) where #0 0xffffe425 in __kernel_vsyscall () #1 0xf7372085 in raise () from /lib/tls/i686/cmov/libc.so.6 #2 0xf7373a01 in abort () from /lib/tls/i686/cmov/libc.so.6 #3 0x084461da in ink_fatal_va (return_code=1, message_format=0xf66f5f54 "UnixNetVConnection.cc:968: failed assert `thread == this_ethread()`", ap=0xf66f5f48 "") at ink_error.cc:43 #4 0x0844621a in ink_fatal (return_code=1, message_format=0xf66f5f54 "UnixNetVConnection.cc:968: failed assert `thread == this_ethread()`") at ink_error.cc:73 #5 0x0844593a in _ink_assert (a=0x84abe39 "thread == this_ethread()", f=0x84abd9d "UnixNetVConnection.cc", l=968) at ink_assert.cc:44 #6 0x0835c97b in UnixNetVConnection::mainEvent (this=0x8a9edb0, event=1, e=0x8a829a0) at UnixNetVConnection.cc:968 #7 0x083518b1 in InactivityCop::check_inactivity (this=0xf5400898, event=2, e=0x8a829a0) at ../../iocore/eventsystem/I_Continuation.h:146 #8 0x08385233 in EThread::process_event (this=0xf67f8008, e=0x8a829a0, calling_code=2) at I_Continuation.h:146 #9 0x08385d4f in EThread::execute (this=0xf67f8008) at UnixEThread.cc:226 #10 0x08384c53 in spawn_thread_internal (a=0x8a5c320) at Thread.cc:88 #11 0xf77264fb in start_thread () from /lib/tls/i686/cmov/libpthread.so.0 #12 0xf741de5e in clone () from /lib/tls/i686/cmov/libc.so.6
        Hide
        William Bardwell added a comment -

        Oh, I was testing 'try3', performance_2nd_req_another_version.patch looks like my original patch, so I would expect it to have the same problems...

        Show
        William Bardwell added a comment - Oh, I was testing 'try3', performance_2nd_req_another_version.patch looks like my original patch, so I would expect it to have the same problems...
        Hide
        William Bardwell added a comment -

        I like that new patch. It works for me, and is much smaller and makes a lot of sense. My only concern is, when would trigger_event or signal_hook be NULL? (They were not ever missing in my tests, but maybe it is tied to the build setup or the config files?)

        Show
        William Bardwell added a comment - I like that new patch. It works for me, and is much smaller and makes a lot of sense. My only concern is, when would trigger_event or signal_hook be NULL? (They were not ever missing in my tests, but maybe it is tied to the build setup or the config files?)
        Hide
        mohan_zl added a comment -

        Thank you William for your work, we tested your first diff file, and found ts will crash, so we write another patch.

        Show
        mohan_zl added a comment - Thank you William for your work, we tested your first diff file, and found ts will crash, so we write another patch.
        Hide
        weijin added a comment -

        We can use the Ethread evfd and signal_hook function the achieve it. When thread A can not acquire the lock of a contination, which will be call back on thread B, we can use b->signal_hook(b) to wakeup thread b, if it is blocked on epoll_wait. Reown thread is cool, but is dangerous because of many other things.

        Show
        weijin added a comment - We can use the Ethread evfd and signal_hook function the achieve it. When thread A can not acquire the lock of a contination, which will be call back on thread B, we can use b->signal_hook(b) to wakeup thread b, if it is blocked on epoll_wait. Reown thread is cool, but is dangerous because of many other things.
        Hide
        Zhao Yongming added a comment -

        good work, William
        we have tested this patch in the two environments:
        1, cluster type=1, the hash mode
        it will help TS cut the latency from 24ms to 13.7ms
        2, very heave loopback testing, ie >50K qps
        it will crash TS

        so far, it is cool to get this issue rise up, I'd like to see it get fixed asap, so as same style issue in the other modules

        Show
        Zhao Yongming added a comment - good work, William we have tested this patch in the two environments: 1, cluster type=1, the hash mode it will help TS cut the latency from 24ms to 13.7ms 2, very heave loopback testing, ie >50K qps it will crash TS so far, it is cool to get this issue rise up, I'd like to see it get fixed asap, so as same style issue in the other modules
        Hide
        Leif Hedstrom added a comment -

        I think we should spend some more time on this, it feels like the first approach (moving origin VConn's to the client VConn's ethread) makes the most sense (and clearly has the best performance characteristics.

        Show
        Leif Hedstrom added a comment - I think we should spend some more time on this, it feels like the first approach (moving origin VConn's to the client VConn's ethread) makes the most sense (and clearly has the best performance characteristics.
        Hide
        Manjesh Nilange added a comment -

        William, the new patch worked for me. Thanks!

        Show
        Manjesh Nilange added a comment - William, the new patch worked for me. Thanks!
        Hide
        William Bardwell added a comment -

        New patch based on waking up event loop cross thread. If this is good, it might need to be ported to other event loop styles, as it is epoll only right now. Also on machines that support it an eventfd would supposedly be faster than a pipe.

        Show
        William Bardwell added a comment - New patch based on waking up event loop cross thread. If this is good, it might need to be ported to other event loop styles, as it is epoll only right now. Also on machines that support it an eventfd would supposedly be faster than a pipe.
        Hide
        William Bardwell added a comment -

        I have a new fix based on making UnixNetVConnection::reenable be able to wake up another thread even when it can't get the NetHandler lock. Performance with the fix still drops by half when the connection is on the wrong thread, but things don't crash.

        Show
        William Bardwell added a comment - I have a new fix based on making UnixNetVConnection::reenable be able to wake up another thread even when it can't get the NetHandler lock. Performance with the fix still drops by half when the connection is on the wrong thread, but things don't crash.
        Hide
        William Bardwell added a comment -

        Yeah I can get it past those failures by adding code to remove it from all lists in the old NetHandler and adding it to the new one, but I still get other failures like:
        #0 0xffffe425 in __kernel_vsyscall ()
        #1 0xf7409085 in raise () from /lib/tls/i686/cmov/libc.so.6
        #2 0xf740aa01 in abort () from /lib/tls/i686/cmov/libc.so.6
        #3 0x0844719a in ink_fatal_va (return_code=1,
        message_format=0xf668b994 "UnixNetVConnection.cc:1120: failed assert `!read.ready_link.prev && !read.ready_link.next`", ap=0xf668b988 "") at ink_error.cc:43
        #4 0x084471da in ink_fatal (return_code=1,
        message_format=0xf668b994 "UnixNetVConnection.cc:1120: failed assert `!read.ready_link.prev && !read.ready_link.next`") at ink_error.cc:73
        #5 0x084468fa in _ink_assert (a=0x84acbc8 "!read.ready_link.prev && !read.ready_link.next",
        f=0x84ace20 "UnixNetVConnection.cc", l=1120) at ink_assert.cc:44
        #6 0x08355fcd in UnixNetVConnection::free (this=0x8a9a710, t=0xf678e008) at UnixNetVConnection.cc:1120
        #7 0x08356a6a in close_UnixNetVConnection (vc=0x8a9a710, t=0xf678e008) at UnixNetVConnection.cc:128
        #8 0x0835b410 in write_to_net_io (nh=0xf678f528, vc=0x8a9a710, thread=0xf678e008) at UnixNetVConnection.cc:163
        #9 0x083511b3 in NetHandler::mainNetEvent (this=0xf678f528, event=5, e=0x8a775b0) at UnixNet.cc:405
        #10 0x08386213 in EThread::process_event (this=0xf678e008, e=0x8a775b0, calling_code=5) at I_Continuation.h:146
        #11 0x08386cd0 in EThread::execute (this=0xf678e008) at UnixEThread.cc:271
        #12 0x08385c33 in spawn_thread_internal (a=0x8a4e398) at Thread.cc:88
        #13 0xf77bd4fb in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
        #14 0xf74b4e5e in clone () from /lib/tls/i686/cmov/libc.so.6

        It looks like moving it between threads needs a bunch more fiddling before it will work. Maybe I need to take a different approach around making cross thread notifying work better, but that seems like a worse approach in the abstract.

        Show
        William Bardwell added a comment - Yeah I can get it past those failures by adding code to remove it from all lists in the old NetHandler and adding it to the new one, but I still get other failures like: #0 0xffffe425 in __kernel_vsyscall () #1 0xf7409085 in raise () from /lib/tls/i686/cmov/libc.so.6 #2 0xf740aa01 in abort () from /lib/tls/i686/cmov/libc.so.6 #3 0x0844719a in ink_fatal_va (return_code=1, message_format=0xf668b994 "UnixNetVConnection.cc:1120: failed assert `!read.ready_link.prev && !read.ready_link.next`", ap=0xf668b988 "") at ink_error.cc:43 #4 0x084471da in ink_fatal (return_code=1, message_format=0xf668b994 "UnixNetVConnection.cc:1120: failed assert `!read.ready_link.prev && !read.ready_link.next`") at ink_error.cc:73 #5 0x084468fa in _ink_assert (a=0x84acbc8 "!read.ready_link.prev && !read.ready_link.next", f=0x84ace20 "UnixNetVConnection.cc", l=1120) at ink_assert.cc:44 #6 0x08355fcd in UnixNetVConnection::free (this=0x8a9a710, t=0xf678e008) at UnixNetVConnection.cc:1120 #7 0x08356a6a in close_UnixNetVConnection (vc=0x8a9a710, t=0xf678e008) at UnixNetVConnection.cc:128 #8 0x0835b410 in write_to_net_io (nh=0xf678f528, vc=0x8a9a710, thread=0xf678e008) at UnixNetVConnection.cc:163 #9 0x083511b3 in NetHandler::mainNetEvent (this=0xf678f528, event=5, e=0x8a775b0) at UnixNet.cc:405 #10 0x08386213 in EThread::process_event (this=0xf678e008, e=0x8a775b0, calling_code=5) at I_Continuation.h:146 #11 0x08386cd0 in EThread::execute (this=0xf678e008) at UnixEThread.cc:271 #12 0x08385c33 in spawn_thread_internal (a=0x8a4e398) at Thread.cc:88 #13 0xf77bd4fb in start_thread () from /lib/tls/i686/cmov/libpthread.so.0 #14 0xf74b4e5e in clone () from /lib/tls/i686/cmov/libc.so.6 It looks like moving it between threads needs a bunch more fiddling before it will work. Maybe I need to take a different approach around making cross thread notifying work better, but that seems like a worse approach in the abstract.
        Hide
        Manjesh Nilange added a comment -

        I tried the patch with 3.0.0 today and I am getting this stack trace a couple of times:

        Program received signal SIGSEGV, Segmentation fault.
        [Switching to Thread 0xf629ab70 (LWP 19000)]
        0x082c0934 in close_UnixNetVConnection (vc=0xf1816210, t=0xf6aa3008) at UnixNetVConnection.cc:126
        126 vc->free(t);
        (gdb) bt
        #0 0x082c0934 in close_UnixNetVConnection (vc=0xf1816210, t=0xf6aa3008) at UnixNetVConnection.cc:126
        #1 0x082c3dc8 in read_signal_and_update (nh=0xf66a0528, vc=0xf1816210, thread=<value optimized out>)
        at UnixNetVConnection.cc:140
        #2 read_from_net (nh=0xf66a0528, vc=0xf1816210, thread=<value optimized out>) at UnixNetVConnection.cc:318
        #3 0x082bc594 in NetHandler::mainNetEvent (this=0xf66a0528, event=5, e=0x8e11f40) at UnixNet.cc:397
        #4 0x082e8ff8 in handleEvent (this=0xf669f008, e=0x8e11f40, calling_code=5) at I_Continuation.h:146
        #5 EThread::process_event (this=0xf669f008, e=0x8e11f40, calling_code=5) at UnixEThread.cc:140
        #6 0x082e992d in EThread::execute (this=0xf669f008) at UnixEThread.cc:262
        #7 0x082e7e93 in spawn_thread_internal (a=0x8e08d90) at Thread.cc:88
        #8 0x00d469e9 in start_thread () from /lib/libpthread.so.0
        #9 0x005aacde in clone () from /lib/libc.so.6

        and also this one:

        Program received signal SIGSEGV, Segmentation fault.
        [Switching to Thread 0xf665db70 (LWP 19182)]
        0x082c0934 in close_UnixNetVConnection (vc=0xf1715f20, t=0xf6a62008) at UnixNetVConnection.cc:126
        126 vc->free(t);
        (gdb) bt
        #0 0x082c0934 in close_UnixNetVConnection (vc=0xf1715f20, t=0xf6a62008) at UnixNetVConnection.cc:126
        #1 0x082bd3a7 in InactivityCop::check_inactivity (this=0x98a6730, event=2, e=0x9895080) at UnixNet.cc:60
        #2 0x082e8ff8 in handleEvent (this=0xf6a62008, e=0x9895080, calling_code=2) at I_Continuation.h:146
        #3 EThread::process_event (this=0xf6a62008, e=0x9895080, calling_code=2) at UnixEThread.cc:140
        #4 0x082e9a63 in EThread::execute (this=0xf6a62008) at UnixEThread.cc:217
        #5 0x082e7e93 in spawn_thread_internal (a=0x988bf68) at Thread.cc:88
        #6 0x00d469e9 in start_thread () from /lib/libpthread.so.0
        #7 0x005aacde in clone () from /lib/libc.so.6

        Surprisingly neither variable (vc or t) seem to point to bad addresses.

        Show
        Manjesh Nilange added a comment - I tried the patch with 3.0.0 today and I am getting this stack trace a couple of times: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0xf629ab70 (LWP 19000)] 0x082c0934 in close_UnixNetVConnection (vc=0xf1816210, t=0xf6aa3008) at UnixNetVConnection.cc:126 126 vc->free(t); (gdb) bt #0 0x082c0934 in close_UnixNetVConnection (vc=0xf1816210, t=0xf6aa3008) at UnixNetVConnection.cc:126 #1 0x082c3dc8 in read_signal_and_update (nh=0xf66a0528, vc=0xf1816210, thread=<value optimized out>) at UnixNetVConnection.cc:140 #2 read_from_net (nh=0xf66a0528, vc=0xf1816210, thread=<value optimized out>) at UnixNetVConnection.cc:318 #3 0x082bc594 in NetHandler::mainNetEvent (this=0xf66a0528, event=5, e=0x8e11f40) at UnixNet.cc:397 #4 0x082e8ff8 in handleEvent (this=0xf669f008, e=0x8e11f40, calling_code=5) at I_Continuation.h:146 #5 EThread::process_event (this=0xf669f008, e=0x8e11f40, calling_code=5) at UnixEThread.cc:140 #6 0x082e992d in EThread::execute (this=0xf669f008) at UnixEThread.cc:262 #7 0x082e7e93 in spawn_thread_internal (a=0x8e08d90) at Thread.cc:88 #8 0x00d469e9 in start_thread () from /lib/libpthread.so.0 #9 0x005aacde in clone () from /lib/libc.so.6 and also this one: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0xf665db70 (LWP 19182)] 0x082c0934 in close_UnixNetVConnection (vc=0xf1715f20, t=0xf6a62008) at UnixNetVConnection.cc:126 126 vc->free(t); (gdb) bt #0 0x082c0934 in close_UnixNetVConnection (vc=0xf1715f20, t=0xf6a62008) at UnixNetVConnection.cc:126 #1 0x082bd3a7 in InactivityCop::check_inactivity (this=0x98a6730, event=2, e=0x9895080) at UnixNet.cc:60 #2 0x082e8ff8 in handleEvent (this=0xf6a62008, e=0x9895080, calling_code=2) at I_Continuation.h:146 #3 EThread::process_event (this=0xf6a62008, e=0x9895080, calling_code=2) at UnixEThread.cc:140 #4 0x082e9a63 in EThread::execute (this=0xf6a62008) at UnixEThread.cc:217 #5 0x082e7e93 in spawn_thread_internal (a=0x988bf68) at Thread.cc:88 #6 0x00d469e9 in start_thread () from /lib/libpthread.so.0 #7 0x005aacde in clone () from /lib/libc.so.6 Surprisingly neither variable (vc or t) seem to point to bad addresses.
        Hide
        William Bardwell added a comment -

        Hum, maybe missing locks or other things need to be cleaned up when switching threads. I will experiment.

        Show
        William Bardwell added a comment - Hum, maybe missing locks or other things need to be cleaned up when switching threads. I will experiment.
        Hide
        Manjesh Nilange added a comment -

        I forgot to add platform specifics (I built a 32-bit binary on a 64-bit OS):

        [mnilange@mnilange-md-rhel .libs]$ file traffic_server
        traffic_server: ELF 32-bit LSB executable, Intel 80386, version 1 (GNU/Linux), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, not stripped
        [mnilange@mnilange-md-rhel .libs]$ uname -a
        Linux mnilange-md-rhel 2.6.32-71.el6.x86_64 #1 SMP Wed Sep 1 01:33:01 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

        Show
        Manjesh Nilange added a comment - I forgot to add platform specifics (I built a 32-bit binary on a 64-bit OS): [mnilange@mnilange-md-rhel .libs] $ file traffic_server traffic_server: ELF 32-bit LSB executable, Intel 80386, version 1 (GNU/Linux), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, not stripped [mnilange@mnilange-md-rhel .libs] $ uname -a Linux mnilange-md-rhel 2.6.32-71.el6.x86_64 #1 SMP Wed Sep 1 01:33:01 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
        Hide
        Manjesh Nilange added a comment -

        Hi William,

        Thanks for providing a fix. I'm running into issues however. I applied it on 2.1.7 and 2.1.8 codebases. It just hangs after a few thousand requests. In some cases after the hang, when I disconnect the client and rerun it, I get a crash. However I was not able to get a consistent stack trace.

        Program received signal SIGSEGV, Segmentation fault.
        [Switching to Thread 0xf5de5b70 (LWP 11148)]
        0x082bf9c4 in cancel_action (vc=0xf1a16210, t=0xf63ec008) at ../../iocore/eventsystem/I_Action.h:168
        168 cancelled = true;
        (gdb) bt
        #0 0x082bf9c4 in cancel_action (vc=0xf1a16210, t=0xf63ec008) at ../../iocore/eventsystem/I_Action.h:168
        #1 close_UnixNetVConnection (vc=0xf1a16210, t=0xf63ec008) at UnixNetVConnection.cc:111
        #2 0x082bc4d7 in NetHandler::~NetHandler (this=0x9beb148, __in_chrg=<value optimized out>) at P_UnixNet.h:243
        #3 0x082e80d8 in Mutex_unlock (this=0xf63ec008, e=0x9bd9680, calling_code=2) at I_Lock.h:514
        #4 release (this=0xf63ec008, e=0x9bd9680, calling_code=2) at I_Lock.h:634
        #5 EThread::process_event (this=0xf63ec008, e=0x9bd9680, calling_code=2) at UnixEThread.cc:143
        #6 0x082e8b43 in free (this=0xf63ec008) at ../../lib/ts/Allocator.h:297
        #7 free_event (this=0xf63ec008) at P_UnixEThread.h:171
        #8 EThread::execute (this=0xf63ec008) at UnixEThread.cc:214
        #9 0x082e6f7b in Thread::Thread (this=0x9bcf100) at Thread.cc:45
        #10 0x00d469e9 in start_thread () from /lib/libpthread.so.0
        #11 0x005aacde in clone () from /lib/libc.so.6

        There are no plugins running and cache is disabled.

        Show
        Manjesh Nilange added a comment - Hi William, Thanks for providing a fix. I'm running into issues however. I applied it on 2.1.7 and 2.1.8 codebases. It just hangs after a few thousand requests. In some cases after the hang, when I disconnect the client and rerun it, I get a crash. However I was not able to get a consistent stack trace. Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0xf5de5b70 (LWP 11148)] 0x082bf9c4 in cancel_action (vc=0xf1a16210, t=0xf63ec008) at ../../iocore/eventsystem/I_Action.h:168 168 cancelled = true; (gdb) bt #0 0x082bf9c4 in cancel_action (vc=0xf1a16210, t=0xf63ec008) at ../../iocore/eventsystem/I_Action.h:168 #1 close_UnixNetVConnection (vc=0xf1a16210, t=0xf63ec008) at UnixNetVConnection.cc:111 #2 0x082bc4d7 in NetHandler::~NetHandler (this=0x9beb148, __in_chrg=<value optimized out>) at P_UnixNet.h:243 #3 0x082e80d8 in Mutex_unlock (this=0xf63ec008, e=0x9bd9680, calling_code=2) at I_Lock.h:514 #4 release (this=0xf63ec008, e=0x9bd9680, calling_code=2) at I_Lock.h:634 #5 EThread::process_event (this=0xf63ec008, e=0x9bd9680, calling_code=2) at UnixEThread.cc:143 #6 0x082e8b43 in free (this=0xf63ec008) at ../../lib/ts/Allocator.h:297 #7 free_event (this=0xf63ec008) at P_UnixEThread.h:171 #8 EThread::execute (this=0xf63ec008) at UnixEThread.cc:214 #9 0x082e6f7b in Thread::Thread (this=0x9bcf100) at Thread.cc:45 #10 0x00d469e9 in start_thread () from /lib/libpthread.so.0 #11 0x005aacde in clone () from /lib/libc.so.6 There are no plugins running and cache is disabled.
        Hide
        Leif Hedstrom added a comment -

        This is definitely a strong candidate for 3.0.2, once the patch has been thoroughly reviewed. It does look good to me, but I'd also like for at least one or two more core devs to look at it.

        Show
        Leif Hedstrom added a comment - This is definitely a strong candidate for 3.0.2, once the patch has been thoroughly reviewed. It does look good to me, but I'd also like for at least one or two more core devs to look at it.
        Hide
        Naveen added a comment -

        I have the same problem, the patch fixes it. Thanks William!

        Show
        Naveen added a comment - I have the same problem, the patch fixes it. Thanks William!
        Hide
        William Bardwell added a comment -

        This fixes the problem for me. The name for my 'fix-it' function may not be very helpful. This may only be needed for epoll, I am not sure, and it may be bad for non-epoll, I am not sure.

        Show
        William Bardwell added a comment - This fixes the problem for me. The name for my 'fix-it' function may not be very helpful. This may only be needed for epoll, I am not sure, and it may be bad for non-epoll, I am not sure.
        Hide
        William Bardwell added a comment -

        Oh, I wasn't reseting the NetHandler, when I do that, it seems to fix it fully.

        Show
        William Bardwell added a comment - Oh, I wasn't reseting the NetHandler, when I do that, it seems to fix it fully.
        Hide
        William Bardwell added a comment -

        So when I add a fix to stop and start the EventIO on the UnixNetVConnection that partially solves the problem, but it still breaks sometimes when the tunnel producer handler starts getting called from a different thread, and then the calls to UnixNetVConnection::reenable() don't actually wake up the producer when called by the consumer. (The problem is that now it is a different thread, and it can never get the nh->mutex lock, and the 'add stuff to read_enabled_list' code that runs when it can't get a lock never actually wakes up the other thread, and thus, things only happen when the other thread times out.

        Show
        William Bardwell added a comment - So when I add a fix to stop and start the EventIO on the UnixNetVConnection that partially solves the problem, but it still breaks sometimes when the tunnel producer handler starts getting called from a different thread, and then the calls to UnixNetVConnection::reenable() don't actually wake up the producer when called by the consumer. (The problem is that now it is a different thread, and it can never get the nh->mutex lock, and the 'add stuff to read_enabled_list' code that runs when it can't get a lock never actually wakes up the other thread, and thus, things only happen when the other thread times out.

          People

          • Assignee:
            Igor Galić
            Reporter:
            William Bardwell
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development