Traffic Server
  1. Traffic Server
  2. TS-937

EThread::execute still processing cancelled event

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.1, 2.1.9
    • Fix Version/s: 3.1.3
    • Component/s: Core
    • Labels:
      None
    • Environment:

      RHEL6

      Description

      The included GDB log will show that ATS is trying to process an event that has already been canceled, examining the code of UnixEThread.cc line 232 shows that EThread::process_event gets called without a check for the event being cancelled.

      Brian

      Program received signal SIGSEGV, Segmentation fault.
      [Switching to Thread 0x7ffff64fa700 (LWP 28518)]
      0x00000000006fc663 in EThread::process_event (this=0x7ffff68ff010, e=0x1db45c0, calling_code=1) at UnixEThread.cc:130
      130  MUTEX_TRY_LOCK_FOR(lock, e->mutex.m_ptr, this, e->continuation);
      Missing separate debuginfos, use: debuginfo-install expat-2.0.1-9.1.el6.x86_64 glibc-2.12-1.25.el6_1.3.x86_64 keyutils-libs-1.4-1.el6.x86_64 krb5-libs-1.9-9.el6_1.1.x86_64 libcom_err-1.41.12-7.el6.x86_64 libgcc-4.4.5-6.el6.x86_64 libselinux-2.0.94-5.el6.x86_64 libstdc++-4.4.5-6.el6.x86_64 openssl-1.0.0-10.el6_1.4.x86_64 pcre-7.8-3.1.el6.x86_64 tcl-8.5.7-6.el6.x86_64 zlib-1.2.3-25.el6.x86_64
      (gdb) bt
      #0  0x00000000006fc663 in EThread::process_event (this=0x7ffff68ff010, e=0x1db45c0, calling_code=1) at UnixEThread.cc:130
      #1  0x00000000006fcbaf in EThread::execute (this=0x7ffff68ff010) at UnixEThread.cc:232
      #2  0x00000000006fb844 in spawn_thread_internal (a=0xfb7e80) at Thread.cc:88
      #3  0x00000036204077e1 in start_thread () from /lib64/libpthread.so.0
      #4  0x000000361f8e577d in clone () from /lib64/libc.so.6
      (gdb) bt full
      #0  0x00000000006fc663 in EThread::process_event (this=0x7ffff68ff010, e=0x1db45c0, calling_code=1) at UnixEThread.cc:130
              lock = {m = {m_ptr = 0x7ffff64f9d20}, lock_acquired = 202}
      #1  0x00000000006fcbaf in EThread::execute (this=0x7ffff68ff010) at UnixEThread.cc:232
              done_one = false
              e = 0x1db45c0
              NegativeQueue = {<DLL<Event, Event::Link_link>> = {head = 0xfc75f0}, tail = 0xfc75f0}
              next_time = 1314647904419648000
      #2  0x00000000006fb844 in spawn_thread_internal (a=0xfb7e80) at Thread.cc:88
              p = 0xfb7e80
      #3  0x00000036204077e1 in start_thread () from /lib64/libpthread.so.0
      No symbol table info available.
      #4  0x000000361f8e577d in clone () from /lib64/libc.so.6
      No symbol table info available.
      (gdb) f 0
      #0  0x00000000006fc663 in EThread::process_event (this=0x7ffff68ff010, e=0x1db45c0, calling_code=1) at UnixEThread.cc:130
      130  MUTEX_TRY_LOCK_FOR(lock, e->mutex.m_ptr, this, e->continuation);
      (gdb) p *e
      $2 = {<Action> = {_vptr.Action = 0x775170, continuation = 0x1f2fc08, mutex = {m_ptr = 0x7fffd40fba40}, cancelled = 1}, ethread = 0x7ffff68ff010, in_the_prot_queue = 0, in_the_priority_queue = 0, 
        immediate = 1, globally_allocated = 1, in_heap = 0, callback_event = 1, timeout_at = 0, period = 0, cookie = 0x0, link = {<SLink<Event>> = {next = 0x0}, prev = 0x0}}
      
      1. ts937.full.patch
        11 kB
        Brian Geffon
      2. UnixEThread.patch
        1 kB
        Brian Geffon

        Issue Links

          Activity

          Brian Geffon created issue -
          Brian Geffon made changes -
          Field Original Value New Value
          Status Open [ 1 ] Patch Available [ 10002 ]
          Brian Geffon made changes -
          Attachment UnixEThread.patch [ 12492142 ]
          Hide
          Brian Geffon added a comment -

          Patch included.

          Show
          Brian Geffon added a comment - Patch included.
          Brian Geffon made changes -
          Summary EThread::execute still processing cancelled thread EThread::execute still processing cancelled event
          Leif Hedstrom made changes -
          Fix Version/s 3.1.1 [ 12316496 ]
          Hide
          weijin added a comment -

          The main reason that EThread processing the canceled event is because the canceled action is performed in other thread,so I do not think this patch can solve the problem completely.

          Show
          weijin added a comment - The main reason that EThread processing the canceled event is because the canceled action is performed in other thread,so I do not think this patch can solve the problem completely.
          Hide
          Brian Geffon added a comment -

          @weijin, is EventQueueExternal the queue for events coming from other threads? I've been using the patch for a few days now and it has completely resolved the segfaults. Do you have ideas about what else I would need to check for, because I can create a new patch.

          Brian

          Show
          Brian Geffon added a comment - @weijin, is EventQueueExternal the queue for events coming from other threads? I've been using the patch for a few days now and it has completely resolved the segfaults. Do you have ideas about what else I would need to check for, because I can create a new patch. Brian
          Hide
          weijin added a comment -

          @Brian Geffon, Yes, EventQueueExternal used for events coming from other threads. In my opinion, the canceled aciton should be performed only when the event`s continuaiton called back. The cancel action is not thread safe, see I_Action.h

          Show
          weijin added a comment - @Brian Geffon, Yes, EventQueueExternal used for events coming from other threads. In my opinion, the canceled aciton should be performed only when the event`s continuaiton called back. The cancel action is not thread safe, see I_Action.h
          Hide
          weijin added a comment -

          You need to find where this event be canceled, If it`s not performed when continuation called back, that`s the bug.

          Show
          weijin added a comment - You need to find where this event be canceled, If it`s not performed when continuation called back, that`s the bug.
          Hide
          Brian Geffon added a comment -

          Thanks for the response weijin. Perhaps I'm missing something but there is currently a check for the event being cancelled in ProcessEvent for all events that do not have a timeout, if the event has some timeout on it then there won't be a check, why would it be safe to put the cancel check in ProcessEvent in that situation?

          http://svn.apache.org/viewvc/trafficserver/traffic/trunk/iocore/eventsystem/UnixEThread.cc?view=markup#l234

          Show
          Brian Geffon added a comment - Thanks for the response weijin. Perhaps I'm missing something but there is currently a check for the event being cancelled in ProcessEvent for all events that do not have a timeout, if the event has some timeout on it then there won't be a check, why would it be safe to put the cancel check in ProcessEvent in that situation? http://svn.apache.org/viewvc/trafficserver/traffic/trunk/iocore/eventsystem/UnixEThread.cc?view=markup#l234
          Hide
          Leif Hedstrom added a comment -

          Moving all unassigned bugs out to 3.1.2

          Show
          Leif Hedstrom added a comment - Moving all unassigned bugs out to 3.1.2
          Leif Hedstrom made changes -
          Fix Version/s 3.1.2 [ 12317605 ]
          Fix Version/s 3.1.1 [ 12316496 ]
          Hide
          John Plevyak added a comment -

          This patch shouldn't have any effect in correct code. Checking ->canceled outside of the mutex is at best an optimization. In any case, before the continuation is used the canceled field has to be checked under the lock. Most likely the core problem is that someone is canceling the event without holding the lock (a situation which subsumes the case of being in a callback as weijin suggested). If this happens, there is a potential race where process_event is in progress, someone cancels the event (without the lock) and deletes the Continuation resulting in a crash.

          However, this patch does reduce the change of a race condition in the debug build. The core problem is that the macro MUTEX_LOCK_FOR is accessing HANDLER_NAME
          which dereferences the continuation. This is bad. The macro needs to be fixed to delay deferencing the Continuation until the lock is taken.

          Show
          John Plevyak added a comment - This patch shouldn't have any effect in correct code. Checking ->canceled outside of the mutex is at best an optimization. In any case, before the continuation is used the canceled field has to be checked under the lock. Most likely the core problem is that someone is canceling the event without holding the lock (a situation which subsumes the case of being in a callback as weijin suggested). If this happens, there is a potential race where process_event is in progress, someone cancels the event (without the lock) and deletes the Continuation resulting in a crash. However, this patch does reduce the change of a race condition in the debug build. The core problem is that the macro MUTEX_LOCK_FOR is accessing HANDLER_NAME which dereferences the continuation. This is bad. The macro needs to be fixed to delay deferencing the Continuation until the lock is taken.
          Hide
          John Plevyak added a comment -

          On a side note, this macro should not do the deref on non-debug builds.... but for some reason it looks like the case is TS_HAS_PURIFY.... what the heck is that all about?

          Show
          John Plevyak added a comment - On a side note, this macro should not do the deref on non-debug builds.... but for some reason it looks like the case is TS_HAS_PURIFY.... what the heck is that all about?
          Hide
          Leif Hedstrom added a comment -

          I saw a bunch of TS_HAS_PURIFY while doing all the memory allocation changes. Should we just nuke that sucker, now that we have jemalloc / tcmalloc support ? I have no idea if the "purify" stuff actually works?

          Although in this case, no idea why we'd make a special case here for purify...

          Show
          Leif Hedstrom added a comment - I saw a bunch of TS_HAS_PURIFY while doing all the memory allocation changes. Should we just nuke that sucker, now that we have jemalloc / tcmalloc support ? I have no idea if the "purify" stuff actually works? Although in this case, no idea why we'd make a special case here for purify...
          Brian Geffon made changes -
          Comment [ FYI, apparently this patch was included in 3.0.2 but there was no reference in the change log that could get confusing down the road. ]
          Hide
          John Plevyak added a comment -

          Let's nuke TS_HAS_PURIFY. If we want to make a valgrind target (e.g. something which would enable normal malloc) than that is an idea, but this macro is confusing.

          Show
          John Plevyak added a comment - Let's nuke TS_HAS_PURIFY. If we want to make a valgrind target (e.g. something which would enable normal malloc) than that is an idea, but this macro is confusing.
          Hide
          Leif Hedstrom added a comment -

          Sold.

          I did add a --disable-freelist configure option a while ago, which turns the freelist into malloc/free calls (I hope at least, unless I fucked it up . The thought was that we'd use this option for memory debugging either with valgrind, or e.g. tcmalloc.

          Show
          Leif Hedstrom added a comment - Sold. I did add a --disable-freelist configure option a while ago, which turns the freelist into malloc/free calls (I hope at least, unless I fucked it up . The thought was that we'd use this option for memory debugging either with valgrind, or e.g. tcmalloc.
          Leif Hedstrom made changes -
          Assignee Brian Geffon [ briang ]
          Alan M. Carroll made changes -
          Fix Version/s 3.1.3 [ 12317969 ]
          Fix Version/s 3.1.2 [ 12317605 ]
          Brian Geffon made changes -
          Comment [ So, I've began to look into this bug again. To try to determine where the action is being canceled I modified Action to add a const char * volatile cancelled_by; and then simply replaced any instance of ->cancel() to pass the name of the method doing the cancelling:
          ]
          Hide
          Brian Geffon added a comment -

          So I've tracked down where the event is being cancelled:

          PluginVC::process_close() line 699:

          if (core_lock_retry_event)

          { core_lock_retry_event->cancel(); core_lock_retry_event = NULL; }
          Show
          Brian Geffon added a comment - So I've tracked down where the event is being cancelled: PluginVC::process_close() line 699: if (core_lock_retry_event) { core_lock_retry_event->cancel(); core_lock_retry_event = NULL; }
          Brian Geffon made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 3.1.2 [ 12317605 ]
          Fix Version/s 3.1.3 [ 12317969 ]
          Resolution Fixed [ 1 ]
          Brian Geffon made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Hide
          Brian Geffon added a comment -

          So it appears that the event being cancelled is an event callback related to a MUTEX being held. See PluginVC.cc:489

          Show
          Brian Geffon added a comment - So it appears that the event being cancelled is an event callback related to a MUTEX being held. See PluginVC.cc:489
          Brian Geffon made changes -
          Comment [ nevermind, I lied. Reopening :/ ]
          Brian Geffon made changes -
          Comment [ So it turns out that this bug is fixed in TS-1074, I've verified this and am closing this bug. ]
          Brian Geffon made changes -
          Comment [ Ignore notifications about a comment talking about inclusion in 3.0.2, I was mistaken.
          ]
          Hide
          Leif Hedstrom added a comment -

          Brian, should we keep this for 3.1.2, or move out to 3.1.3 ?

          Show
          Leif Hedstrom added a comment - Brian, should we keep this for 3.1.2, or move out to 3.1.3 ?
          Hide
          Brian Geffon added a comment -

          I'll move it out to 3.1.3, I'm preparing my fix. Basically it's going to involve 1) blowing away the HANDLER_NAME macro for a quick fix, 2) Identifying why PluginVC is canceling an action without holding the lock, and 3) blowing away the TS_HAS_PURIFY crap, which is not a big deal but affects hundreds of files.

          Show
          Brian Geffon added a comment - I'll move it out to 3.1.3, I'm preparing my fix. Basically it's going to involve 1) blowing away the HANDLER_NAME macro for a quick fix, 2) Identifying why PluginVC is canceling an action without holding the lock, and 3) blowing away the TS_HAS_PURIFY crap, which is not a big deal but affects hundreds of files.
          Brian Geffon made changes -
          Fix Version/s 3.1.3 [ 12317969 ]
          Fix Version/s 3.1.2 [ 12317605 ]
          Hide
          Brian Geffon added a comment -

          Attaching new patch (ts937.full.patch) which removes TS_HAS_PURIFY and HANDLER_NAME macros entirely, i have tested against trunk and passes regression in both debug and non-debug builds.

          Show
          Brian Geffon added a comment - Attaching new patch (ts937.full.patch) which removes TS_HAS_PURIFY and HANDLER_NAME macros entirely, i have tested against trunk and passes regression in both debug and non-debug builds.
          Brian Geffon made changes -
          Attachment ts937.full.patch [ 12515538 ]
          Igor Galić made changes -
          Description The included GDB log will show that ATS is trying to process an event that has already been canceled, examining the code of UnixEThread.cc line 232 shows that EThread::process_event gets called without a check for the event being cancelled.

          Brian



          Program received signal SIGSEGV, Segmentation fault.
          [Switching to Thread 0x7ffff64fa700 (LWP 28518)]
          0x00000000006fc663 in EThread::process_event (this=0x7ffff68ff010, e=0x1db45c0, calling_code=1) at UnixEThread.cc:130
          130 MUTEX_TRY_LOCK_FOR(lock, e->mutex.m_ptr, this, e->continuation);
          Missing separate debuginfos, use: debuginfo-install expat-2.0.1-9.1.el6.x86_64 glibc-2.12-1.25.el6_1.3.x86_64 keyutils-libs-1.4-1.el6.x86_64 krb5-libs-1.9-9.el6_1.1.x86_64 libcom_err-1.41.12-7.el6.x86_64 libgcc-4.4.5-6.el6.x86_64 libselinux-2.0.94-5.el6.x86_64 libstdc++-4.4.5-6.el6.x86_64 openssl-1.0.0-10.el6_1.4.x86_64 pcre-7.8-3.1.el6.x86_64 tcl-8.5.7-6.el6.x86_64 zlib-1.2.3-25.el6.x86_64
          (gdb) bt
          #0 0x00000000006fc663 in EThread::process_event (this=0x7ffff68ff010, e=0x1db45c0, calling_code=1) at UnixEThread.cc:130
          #1 0x00000000006fcbaf in EThread::execute (this=0x7ffff68ff010) at UnixEThread.cc:232
          #2 0x00000000006fb844 in spawn_thread_internal (a=0xfb7e80) at Thread.cc:88
          #3 0x00000036204077e1 in start_thread () from /lib64/libpthread.so.0
          #4 0x000000361f8e577d in clone () from /lib64/libc.so.6
          (gdb) bt full
          #0 0x00000000006fc663 in EThread::process_event (this=0x7ffff68ff010, e=0x1db45c0, calling_code=1) at UnixEThread.cc:130
                  lock = {m = {m_ptr = 0x7ffff64f9d20}, lock_acquired = 202}
          #1 0x00000000006fcbaf in EThread::execute (this=0x7ffff68ff010) at UnixEThread.cc:232
                  done_one = false
                  e = 0x1db45c0
                  NegativeQueue = {<DLL<Event, Event::Link_link>> = {head = 0xfc75f0}, tail = 0xfc75f0}
                  next_time = 1314647904419648000
          #2 0x00000000006fb844 in spawn_thread_internal (a=0xfb7e80) at Thread.cc:88
                  p = 0xfb7e80
          #3 0x00000036204077e1 in start_thread () from /lib64/libpthread.so.0
          No symbol table info available.
          #4 0x000000361f8e577d in clone () from /lib64/libc.so.6
          No symbol table info available.
          (gdb) f 0
          #0 0x00000000006fc663 in EThread::process_event (this=0x7ffff68ff010, e=0x1db45c0, calling_code=1) at UnixEThread.cc:130
          130 MUTEX_TRY_LOCK_FOR(lock, e->mutex.m_ptr, this, e->continuation);
          (gdb) p *e
          $2 = {<Action> = {_vptr.Action = 0x775170, continuation = 0x1f2fc08, mutex = {m_ptr = 0x7fffd40fba40}, cancelled = 1}, ethread = 0x7ffff68ff010, in_the_prot_queue = 0, in_the_priority_queue = 0,
            immediate = 1, globally_allocated = 1, in_heap = 0, callback_event = 1, timeout_at = 0, period = 0, cookie = 0x0, link = {<SLink<Event>> = {next = 0x0}, prev = 0x0}}

          The included GDB log will show that ATS is trying to process an event that has already been canceled, examining the code of UnixEThread.cc line 232 shows that EThread::process_event gets called without a check for the event being cancelled.

          Brian


          {noformat}
          Program received signal SIGSEGV, Segmentation fault.
          [Switching to Thread 0x7ffff64fa700 (LWP 28518)]
          0x00000000006fc663 in EThread::process_event (this=0x7ffff68ff010, e=0x1db45c0, calling_code=1) at UnixEThread.cc:130
          130 MUTEX_TRY_LOCK_FOR(lock, e->mutex.m_ptr, this, e->continuation);
          Missing separate debuginfos, use: debuginfo-install expat-2.0.1-9.1.el6.x86_64 glibc-2.12-1.25.el6_1.3.x86_64 keyutils-libs-1.4-1.el6.x86_64 krb5-libs-1.9-9.el6_1.1.x86_64 libcom_err-1.41.12-7.el6.x86_64 libgcc-4.4.5-6.el6.x86_64 libselinux-2.0.94-5.el6.x86_64 libstdc++-4.4.5-6.el6.x86_64 openssl-1.0.0-10.el6_1.4.x86_64 pcre-7.8-3.1.el6.x86_64 tcl-8.5.7-6.el6.x86_64 zlib-1.2.3-25.el6.x86_64
          (gdb) bt
          #0 0x00000000006fc663 in EThread::process_event (this=0x7ffff68ff010, e=0x1db45c0, calling_code=1) at UnixEThread.cc:130
          #1 0x00000000006fcbaf in EThread::execute (this=0x7ffff68ff010) at UnixEThread.cc:232
          #2 0x00000000006fb844 in spawn_thread_internal (a=0xfb7e80) at Thread.cc:88
          #3 0x00000036204077e1 in start_thread () from /lib64/libpthread.so.0
          #4 0x000000361f8e577d in clone () from /lib64/libc.so.6
          (gdb) bt full
          #0 0x00000000006fc663 in EThread::process_event (this=0x7ffff68ff010, e=0x1db45c0, calling_code=1) at UnixEThread.cc:130
                  lock = {m = {m_ptr = 0x7ffff64f9d20}, lock_acquired = 202}
          #1 0x00000000006fcbaf in EThread::execute (this=0x7ffff68ff010) at UnixEThread.cc:232
                  done_one = false
                  e = 0x1db45c0
                  NegativeQueue = {<DLL<Event, Event::Link_link>> = {head = 0xfc75f0}, tail = 0xfc75f0}
                  next_time = 1314647904419648000
          #2 0x00000000006fb844 in spawn_thread_internal (a=0xfb7e80) at Thread.cc:88
                  p = 0xfb7e80
          #3 0x00000036204077e1 in start_thread () from /lib64/libpthread.so.0
          No symbol table info available.
          #4 0x000000361f8e577d in clone () from /lib64/libc.so.6
          No symbol table info available.
          (gdb) f 0
          #0 0x00000000006fc663 in EThread::process_event (this=0x7ffff68ff010, e=0x1db45c0, calling_code=1) at UnixEThread.cc:130
          130 MUTEX_TRY_LOCK_FOR(lock, e->mutex.m_ptr, this, e->continuation);
          (gdb) p *e
          $2 = {<Action> = {_vptr.Action = 0x775170, continuation = 0x1f2fc08, mutex = {m_ptr = 0x7fffd40fba40}, cancelled = 1}, ethread = 0x7ffff68ff010, in_the_prot_queue = 0, in_the_priority_queue = 0,
            immediate = 1, globally_allocated = 1, in_heap = 0, callback_event = 1, timeout_at = 0, period = 0, cookie = 0x0, link = {<SLink<Event>> = {next = 0x0}, prev = 0x0}}
          {noformat}
          Hide
          Igor Galić added a comment -

          ad ts937.full.patch: Now that I see them all in one place: How did this ever make sense?
          Looks good, +1

          What I don't get is what the crasher has to do with the PURIFY cleanup.

          Show
          Igor Galić added a comment - ad ts937.full.patch: Now that I see them all in one place: How did this ever make sense? Looks good, +1 What I don't get is what the crasher has to do with the PURIFY cleanup.
          Hide
          Brian Geffon added a comment - - edited

          @Igor: short answer: it doesn't. The crasher is resolved by removing the HANDLER_NAME macro, I suppose I move the remove TS_HAS_PURIFY stuff into a new Jira ticket so that we can potentially backport the removal of HANDLER_NAME since it causes segfaults in debug builds. I'll do that and commit all this stuff tonight.

          Show
          Brian Geffon added a comment - - edited @Igor: short answer: it doesn't. The crasher is resolved by removing the HANDLER_NAME macro, I suppose I move the remove TS_HAS_PURIFY stuff into a new Jira ticket so that we can potentially backport the removal of HANDLER_NAME since it causes segfaults in debug builds. I'll do that and commit all this stuff tonight.
          Hide
          Brian Geffon added a comment -

          I've created a new ticket TS-1117 (Remove TS_HAS_PURIFY) and I'll close this as a duplicate of that since removing TS_HAS_PURIFY indirectly fixes this bug as it also removes all usage of HANDLER_NAME.

          Show
          Brian Geffon added a comment - I've created a new ticket TS-1117 (Remove TS_HAS_PURIFY) and I'll close this as a duplicate of that since removing TS_HAS_PURIFY indirectly fixes this bug as it also removes all usage of HANDLER_NAME.
          Brian Geffon made changes -
          Link This issue is part of TS-1117 [ TS-1117 ]
          Hide
          Brian Geffon added a comment -

          Fixed in commit cd6eb8f62272aad10bd3ac49bd4c6a20d36b566f

          Show
          Brian Geffon added a comment - Fixed in commit cd6eb8f62272aad10bd3ac49bd4c6a20d36b566f
          Brian Geffon made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Leif Hedstrom made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Brian Geffon
              Reporter:
              Brian Geffon
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development