Traffic Server
  1. Traffic Server
  2. TS-1240

Debug assert triggered in LogBuffer.cc:209

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.4
    • Fix Version/s: 3.1.4
    • Component/s: Logging
    • Labels:
      None

      Description

      From John:

      [May  1 09:08:44.746] Server {0x7ffff7fce800} NOTE: traffic server running
      FATAL: LogBuffer.cc:209: failed assert `m_unaligned_buffer`
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server - STACK TRACE: 
      /home/jplevyak/projects/ts/ts-2/lib/ts/.libs/libtsutil.so.3(ink_fatal+0xa3)[0x7ffff7bae4a5]
      /home/jplevyak/projects/ts/ts-2/lib/ts/.libs/libtsutil.so.3(_ink_assert+0x3c)[0x7ffff7bad47c]
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server(_ZN9LogBuffer14checkout_writeEPmm+0x35)[0x5d3a53]
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server(_ZN9LogObject15_checkout_writeEPmm+0x41)[0x5eef75]
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server(_ZN9LogObject3logEP9LogAccessPc+0x4cb)[0x5ef5b9]
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server(_ZN16LogObjectManager3logEP9LogAccess+0x4a)[0x5daab4]
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server(_ZN3Log6accessEP9LogAccess+0x235)[0x5d97f9]
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server(_ZN6HttpSM12update_statsEv+0x204)[0x579872]
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server(_ZN6HttpSM9kill_thisEv+0x31d)[0x579525]
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server(_ZN6HttpSM12main_handlerEiPv+0x337)[0x56cec1]
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server(_ZN12Continuation11handleEventEiPv+0x72)[0x4e2450]
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server(_ZN10HttpTunnel12main_handlerEiPv+0x14c)[0x5b24aa]
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server(_ZN12Continuation11handleEventEiPv+0x72)[0x4e2450]
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server[0x6bb9d1]
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server[0x6bbafa]
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server(_Z15write_to_net_ioP10NetHandlerP18UnixNetVConnectionP7EThread+0x6fa)[0x6bcaaf]
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server(_Z12write_to_netP10NetHandlerP18UnixNetVConnectionP14PollDescriptorP7EThread+0x7d)[0x6bc3b3]
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server(_ZN10NetHandler12mainNetEventEiP5Event+0x6e6)[0x6b8828]
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server(_ZN12Continuation11handleEventEiPv+0x72)[0x4e2450]
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server(_ZN7EThread13process_eventEP5Eventi+0x111)[0x6dde7f]
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server(_ZN7EThread7executeEv+0x431)[0x6de42b]
      /a/home/jplevyak/projects/ts/ts-2/proxy/.libs/lt-traffic_server[0x6dd0bc]
      /lib64/libpthread.so.0(+0x7d90)[0x7ffff7676d90]
      /lib64/libc.so.6(clone+0x6d)[0x7ffff54f9f5d]
      
      1. logstuff.diff
        4 kB
        Leif Hedstrom

        Activity

        Hide
        Leif Hedstrom added a comment -

        commit 3505de43f5615c0f5a7ae6b62d0b0673257261c5

        Show
        Leif Hedstrom added a comment - commit 3505de43f5615c0f5a7ae6b62d0b0673257261c5
        Hide
        Leif Hedstrom added a comment -

        This forces some of the old behavior of deferring the deletes until later. I'm not committing this yet, until John has looked it, he said he had some ideas of solving this with some pointer / atomic CAS magic.

        Show
        Leif Hedstrom added a comment - This forces some of the old behavior of deferring the deletes until later. I'm not committing this yet, until John has looked it, he said he had some ideas of solving this with some pointer / atomic CAS magic.
        Hide
        Bryan Call added a comment - - edited

        We are currently seeing asserts in production on 3.1.3 (before Leif's change) when the disks are busy. I have not tested ATS with Leif's fix in production.

        (gdb) bt
        #0 0x00000036c5832885 in raise () from /lib64/libc.so.6
        #1 0x00000036c5834065 in abort () from /lib64/libc.so.6
        #2 0x00002afdc6060e89 in ink_die_die_die (retval=3867) at ink_error.cc:43
        #3 0x00002afdc60610c5 in ink_fatal_va(int, const char *, typedef __va_list_tag __va_list_tag *) (return_code=1, message_format=<value optimized out>, ap=0x2afdc7e71710) at ink_error.cc:65
        #4 0x00002afdc6061218 in ink_fatal (return_code=<value optimized out>, message_format=<value optimized out>) at ink_error.cc:73
        #5 0x00002afdc605fa7f in _ink_assert (expression=<value optimized out>, file=<value optimized out>, line=<value optimized out>) at ink_assert.cc:38
        #6 0x00000000005a4fec in add_to_flush_queue (this=0x2bcab50, write_offset=0x2afdc7e71898, bytes_needed=336) at LogObject.h:97
        #7 LogObject::_checkout_write (this=0x2bcab50, write_offset=0x2afdc7e71898, bytes_needed=336) at LogObject.cc:455
        #8 0x00000000005a55b0 in LogObject::log (this=0x2bcab50, lad=0x2afdc7e71a80, text_entry=0x0) at LogObject.cc:580
        #9 0x000000000058c7d6 in log (lad=0x2afdc7e71a80) at LogObject.h:476
        #10 Log::access (lad=0x2afdc7e71a80) at Log.cc:1093
        #11 0x0000000000534f8b in HttpSM::update_stats (this=0x2afe433db980) at HttpSM.cc:6051
        #12 0x0000000000535c98 in HttpSM::kill_this (this=0x2afe433db980) at HttpSM.cc:6012
        #13 0x0000000000535f68 in HttpSM::main_handler (this=0x2afe433db980, event=2301, data=0x2afe433dd4c8) at HttpSM.cc:2453
        #14 0x0000000000567d2e in handleEvent (this=0x2afe433dd4c8, event=<value optimized out>, data=<value optimized out>) at ../../iocore/eventsystem/I_Continuation.h:146
        #15 HttpTunnel::main_handler (this=0x2afe433dd4c8, event=<value optimized out>, data=<value optimized out>) at HttpTunnel.cc:1468
        #16 0x0000000000678ba1 in handleEvent (event=<value optimized out>, nh=0x2afdc726a1c8, vc=0x2afdfc8db830) at ../../iocore/eventsystem/I_Continuation.h:146
        #17 write_signal_and_update (event=<value optimized out>, nh=0x2afdc726a1c8, vc=0x2afdfc8db830) at UnixNetVConnection.cc:153
        #18 write_signal_done (event=<value optimized out>, nh=0x2afdc726a1c8, vc=0x2afdfc8db830) at UnixNetVConnection.cc:180
        #19 0x000000000067a9e7 in write_to_net_io (nh=0x2afdc726a1c8, vc=0x2afdfc8db830, thread=<value optimized out>) at UnixNetVConnection.cc:479
        #20 0x000000000067339e in NetHandler::mainNetEvent (this=0x2afdc726a1c8, event=<value optimized out>, e=<value optimized out>) at UnixNet.cc:407
        #21 0x000000000069c7c4 in handleEvent (this=0x2afdc7267010, e=0x28b1f40, calling_code=5) at I_Continuation.h:146
        #22 EThread::process_event (this=0x2afdc7267010, e=0x28b1f40, calling_code=5) at UnixEThread.cc:142
        #23 0x000000000069d153 in EThread::execute (this=0x2afdc7267010) at UnixEThread.cc:264
        #24 0x000000000069b702 in spawn_thread_internal (a=0x28abc60) at Thread.cc:88
        #25 0x00000036c5c077f1 in start_thread () from /lib64/libpthread.so.0
        #26 0x00000036c58e592d in clone () from /lib64/libc.so.6

        Show
        Bryan Call added a comment - - edited We are currently seeing asserts in production on 3.1.3 (before Leif's change) when the disks are busy. I have not tested ATS with Leif's fix in production. (gdb) bt #0 0x00000036c5832885 in raise () from /lib64/libc.so.6 #1 0x00000036c5834065 in abort () from /lib64/libc.so.6 #2 0x00002afdc6060e89 in ink_die_die_die (retval=3867) at ink_error.cc:43 #3 0x00002afdc60610c5 in ink_fatal_va(int, const char *, typedef __va_list_tag __va_list_tag *) (return_code=1, message_format=<value optimized out>, ap=0x2afdc7e71710) at ink_error.cc:65 #4 0x00002afdc6061218 in ink_fatal (return_code=<value optimized out>, message_format=<value optimized out>) at ink_error.cc:73 #5 0x00002afdc605fa7f in _ink_assert (expression=<value optimized out>, file=<value optimized out>, line=<value optimized out>) at ink_assert.cc:38 #6 0x00000000005a4fec in add_to_flush_queue (this=0x2bcab50, write_offset=0x2afdc7e71898, bytes_needed=336) at LogObject.h:97 #7 LogObject::_checkout_write (this=0x2bcab50, write_offset=0x2afdc7e71898, bytes_needed=336) at LogObject.cc:455 #8 0x00000000005a55b0 in LogObject::log (this=0x2bcab50, lad=0x2afdc7e71a80, text_entry=0x0) at LogObject.cc:580 #9 0x000000000058c7d6 in log (lad=0x2afdc7e71a80) at LogObject.h:476 #10 Log::access (lad=0x2afdc7e71a80) at Log.cc:1093 #11 0x0000000000534f8b in HttpSM::update_stats (this=0x2afe433db980) at HttpSM.cc:6051 #12 0x0000000000535c98 in HttpSM::kill_this (this=0x2afe433db980) at HttpSM.cc:6012 #13 0x0000000000535f68 in HttpSM::main_handler (this=0x2afe433db980, event=2301, data=0x2afe433dd4c8) at HttpSM.cc:2453 #14 0x0000000000567d2e in handleEvent (this=0x2afe433dd4c8, event=<value optimized out>, data=<value optimized out>) at ../../iocore/eventsystem/I_Continuation.h:146 #15 HttpTunnel::main_handler (this=0x2afe433dd4c8, event=<value optimized out>, data=<value optimized out>) at HttpTunnel.cc:1468 #16 0x0000000000678ba1 in handleEvent (event=<value optimized out>, nh=0x2afdc726a1c8, vc=0x2afdfc8db830) at ../../iocore/eventsystem/I_Continuation.h:146 #17 write_signal_and_update (event=<value optimized out>, nh=0x2afdc726a1c8, vc=0x2afdfc8db830) at UnixNetVConnection.cc:153 #18 write_signal_done (event=<value optimized out>, nh=0x2afdc726a1c8, vc=0x2afdfc8db830) at UnixNetVConnection.cc:180 #19 0x000000000067a9e7 in write_to_net_io (nh=0x2afdc726a1c8, vc=0x2afdfc8db830, thread=<value optimized out>) at UnixNetVConnection.cc:479 #20 0x000000000067339e in NetHandler::mainNetEvent (this=0x2afdc726a1c8, event=<value optimized out>, e=<value optimized out>) at UnixNet.cc:407 #21 0x000000000069c7c4 in handleEvent (this=0x2afdc7267010, e=0x28b1f40, calling_code=5) at I_Continuation.h:146 #22 EThread::process_event (this=0x2afdc7267010, e=0x28b1f40, calling_code=5) at UnixEThread.cc:142 #23 0x000000000069d153 in EThread::execute (this=0x2afdc7267010) at UnixEThread.cc:264 #24 0x000000000069b702 in spawn_thread_internal (a=0x28abc60) at Thread.cc:88 #25 0x00000036c5c077f1 in start_thread () from /lib64/libpthread.so.0 #26 0x00000036c58e592d in clone () from /lib64/libc.so.6
        Hide
        Leif Hedstrom added a comment -

        Also, someone said they had seen the assert befor my changes but I have not been able to confirm this. John, can you back out my commit and verify it fixes it (I'm sure it dies). There is a change in that commit though tat fixes a serious segfault that needs to be preserved.

        Show
        Leif Hedstrom added a comment - Also, someone said they had seen the assert befor my changes but I have not been able to confirm this. John, can you back out my commit and verify it fixes it (I'm sure it dies). There is a change in that commit though tat fixes a serious segfault that needs to be preserved.
        Hide
        Leif Hedstrom added a comment -

        It seemed like a klunky solution to the problem I experienced (segfaults). My fix cleared the segfault, but apparently the delayed delete was there for several reasons .

        We should restore the delayed delete if it fixes this other race. I still don't know why I couldn't reproduce the problem, I did billions of requests at very high concurrency, in debug build,and it would never trigger the assert using squid.blog. :/

        Show
        Leif Hedstrom added a comment - It seemed like a klunky solution to the problem I experienced (segfaults). My fix cleared the segfault, but apparently the delayed delete was there for several reasons . We should restore the delayed delete if it fixes this other race. I still don't know why I couldn't reproduce the problem, I did billions of requests at very high concurrency, in debug build,and it would never trigger the assert using squid.blog. :/
        Hide
        John Plevyak added a comment -

        What is the downside to restoring the delete delay buffer (was the memory usage too high)?

        Show
        John Plevyak added a comment - What is the downside to restoring the delete delay buffer (was the memory usage too high)?
        Hide
        Leif Hedstrom added a comment - - edited

        Bummer. The changes I made fixed the segfault, but apparently it's not
        enough. Should we just restore the delayed delete, or do you have a better solution in mind?

        Show
        Leif Hedstrom added a comment - - edited Bummer. The changes I made fixed the segfault, but apparently it's not enough. Should we just restore the delayed delete, or do you have a better solution in mind?
        Hide
        John Plevyak added a comment -

        I think removing the delay buffer before deleting the LogBuffer has lead to this race condition. The problem is that checkout_write can execute at the same time as the flush. The only thing that prevents a problem is that checkout_write will fail when it finds that the buffer is full. When a buffer becomes full there can still be other threads trying to do checkout_write, so the buffer must not be deleted immediately after the flush... instead it is kept around for a while, until all the other thread which might be doing checkout_write figure out that the buffer is full and back out and reload the current buffer pointer (which is going to be different). Once all those threads have dropped references to the LogBuffer, it is finally OK to delete it. This should probably be documented in grizzly detail in the LogObject file. It is a bit of a pain, but it does make it possible to implement the critical path of logging completely without locks.

        Do you want me to generate a patch?

        Show
        John Plevyak added a comment - I think removing the delay buffer before deleting the LogBuffer has lead to this race condition. The problem is that checkout_write can execute at the same time as the flush. The only thing that prevents a problem is that checkout_write will fail when it finds that the buffer is full. When a buffer becomes full there can still be other threads trying to do checkout_write, so the buffer must not be deleted immediately after the flush... instead it is kept around for a while, until all the other thread which might be doing checkout_write figure out that the buffer is full and back out and reload the current buffer pointer (which is going to be different). Once all those threads have dropped references to the LogBuffer, it is finally OK to delete it. This should probably be documented in grizzly detail in the LogObject file. It is a bit of a pain, but it does make it possible to implement the critical path of logging completely without locks. Do you want me to generate a patch?
        Hide
        John Plevyak added a comment -

        I an repro on my machine any time you like

        Show
        John Plevyak added a comment - I an repro on my machine any time you like
        Hide
        Leif Hedstrom added a comment -

        I can not reproduce this, so moving out to v3.1.5 for now. If anyone can provide details how to reproduce it (it's obviously only going to trigger in a debug build), please let me know.

        Show
        Leif Hedstrom added a comment - I can not reproduce this, so moving out to v3.1.5 for now. If anyone can provide details how to reproduce it (it's obviously only going to trigger in a debug build), please let me know.

          People

          • Assignee:
            Leif Hedstrom
            Reporter:
            Leif Hedstrom
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development