Traffic Server
  1. Traffic Server
  2. TS-1219

Crash report: ink_freelist_new causing core dumps

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0.4, 3.0.2
    • Fix Version/s: 5.2.0
    • Component/s: Core
    • Labels:

      Description

      Our TS based production boxes are seeing a couple of crashes per day with stack traces ending in

      #3 0x00000000004c85ea in signal_handler (sig=11) at signals.cc:225
      #4 <signal handler called>
      #5 0x0000003312414160 in ink_freelist_new (f=0x3312629ce0) at ink_queue.cc:326
      #6 0x000000331240d47a in alloc_void (this=0x7fff80173cd8, size=511, alignment=1) at Allocator.h:208
      #7 blk_alloc (this=0x7fff80173cd8, size=511, alignment=1) at Arena.cc:45
      #8 Arena::alloc (this=0x7fff80173cd8, size=511, alignment=1) at Arena.cc:118
      #9 0x0000000000569b93 in str_alloc (arena=<value optimized out>,
      url=0x31be6a8 "*************"...,
      len_in=<value optimized out>, len_out=0x7fff80173d20) at ../../lib/ts/Arena.h:123
      #10 LogUtils::escapify_url (arena=<value optimized out>,
      url=0x31be6a8 "*************"...,
      len_in=<value optimized out>, len_out=0x7fff80173d20) at LogUtils.cc:392
      #11 0x00000000005482e9 in LogAccessHttp::init (this=0x7fff80173cc0) at LogAccessHttp.cc:98
      #12 0x000000000054b436 in Log::access (lad=0x7fff80173cc0) at Log.cc:1055
      #13 0x00000000004f3085 in HttpSM::update_stats (this=0x2ae5480651e0) at HttpSM.cc:6044
      #14 0x00000000004f8918 in HttpSM::kill_this (this=0x2ae5480651e0) at HttpSM.cc:6005
      #15 0x00000000004f8c08 in HttpSM::main_handler (this=0x2ae5480651e0, event=2301, data=0x2ae548066ec8)
      at HttpSM.cc:2452

      The URL was valid, I just "anonymized" it. This is our environment

      $ uname -a
      Linux xxx.prod 2.6.32-131.4.1.el6.x86_64 #1 SMP Fri Jun 10 10:54:26 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

      $ file /usr/bin/traffic_server
      /usr/bin/traffic_server: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, stripped

      There doesn't seem to be more useful information in the frame:

      (gdb) f 5
      #5 0x0000003312414160 in ink_freelist_new (f=0x3312629ce0) at ink_queue.cc:326
      326 FREELIST_VERSION(item) + 1);
      (gdb) list
      321 fastalloc_mem_in_use += f->chunk_size * f->type_size;
      322 #endif
      323
      324 } else {
      325 SET_FREELIST_POINTER_VERSION(next, *ADDRESS_OF_NEXT(TO_PTR(FREELIST_POINTER(item)), f->offset),
      326 FREELIST_VERSION(item) + 1);
      327 #if !defined(INK_USE_MUTEX_FOR_FREELISTS)
      328 result = ink_atomic_cas64((int64_t *) & f->head.data, item.data, next.data);
      329 #else
      330 f->head.data = next.data;
      (gdb) p next
      $2 = <value optimized out>
      (gdb) p f
      $3 = (InkFreeList *) 0x3312629ce0
      (gdb) p *f
      $4 = {head =

      {data = -8941773651046140890}

      , name = 0x331241e92f "ArenaBlock", type_size = 1024, chunk_size = 128,
      count = 519, allocated = 1536, offset = 0, alignment = 8, allocated_base = 0, count_base = 0}
      (gdb) p item
      $5 =

      {data = -8953314125091277786}

        Activity

        Hide
        Leif Hedstrom added a comment -

        Any chance you can test either 3.0.4 or better, trunk ? I don't know that this has been fixed, but I know there's been a number of fixes on the freelist code since 3.0.

        Show
        Leif Hedstrom added a comment - Any chance you can test either 3.0.4 or better, trunk ? I don't know that this has been fixed, but I know there's been a number of fixes on the freelist code since 3.0.
        Hide
        Manjesh Nilange added a comment -

        We are planning to move to a newer version. I'll try to fast-track it.

        Show
        Manjesh Nilange added a comment - We are planning to move to a newer version. I'll try to fast-track it.
        Hide
        Brian Geffon added a comment - - edited

        I have a possible theory for the crash, but I think jplevyak might be the only one who could say for sure. In 3.0.x inkfreelist_new follows the standard pattern which does a load followed by a CAS:

         
         int result = 0;
         do {
            INK_QUEUE_LD64(item, f->head);
            if (TO_PTR(FREELIST_POINTER(item)) == NULL) {
               ...
            } else {\n
               // crash happens on the following line
               SET_FREELIST_POINTER_VERSION(next, *ADDRESS_OF_NEXT(TO_PTR(FREELIST_POINTER(item)), f->offset),
                                           FREELIST_VERSION(item) + 1);
        
              result = ink_atomic_cas64((int64_t *) & f->head.data, item.data, next.data);
            }
         }
         while(result == 0)
        

        I believe this might be a race in the event that the head is freed between the time of the load to item and the SET_FREELIST_POINTER_VERSION, then we would potentially be dereferencing item, which is no longer valid if it's freed as part of a dequeue operation. Does anyone who understands the freelists think this might be the situation?

        Also, to support this idea, I see the following in the crash:

        (gdb) p f->head
        $17 =

        {data = -6044627300393456023}

        (gdb) p item
        $18 =

        {data = -6053071549694775703}

        (gdb)

        Which means that item != f->head meaning that head has changed since we first read it, and if it was changed as a result of a dequeue then the memory might have been freed explaining the crash.

        Show
        Brian Geffon added a comment - - edited I have a possible theory for the crash, but I think jplevyak might be the only one who could say for sure. In 3.0.x inkfreelist_new follows the standard pattern which does a load followed by a CAS: int result = 0; do { INK_QUEUE_LD64(item, f->head); if (TO_PTR(FREELIST_POINTER(item)) == NULL) { ... } else {\n // crash happens on the following line SET_FREELIST_POINTER_VERSION(next, *ADDRESS_OF_NEXT(TO_PTR(FREELIST_POINTER(item)), f->offset), FREELIST_VERSION(item) + 1); result = ink_atomic_cas64((int64_t *) & f->head.data, item.data, next.data); } } while(result == 0) I believe this might be a race in the event that the head is freed between the time of the load to item and the SET_FREELIST_POINTER_VERSION, then we would potentially be dereferencing item, which is no longer valid if it's freed as part of a dequeue operation. Does anyone who understands the freelists think this might be the situation? Also, to support this idea, I see the following in the crash: (gdb) p f->head $17 = {data = -6044627300393456023} (gdb) p item $18 = {data = -6053071549694775703} (gdb) Which means that item != f->head meaning that head has changed since we first read it, and if it was changed as a result of a dequeue then the memory might have been freed explaining the crash.
        Hide
        Manjesh Nilange added a comment -

        Issue is seen on 3.0.4 as well.

        Show
        Manjesh Nilange added a comment - Issue is seen on 3.0.4 as well.
        Hide
        Leif Hedstrom added a comment -

        Manjesh and Brian G.: Is this still an issue with the changes done to the current master for freelist?

        Show
        Leif Hedstrom added a comment - Manjesh and Brian G.: Is this still an issue with the changes done to the current master for freelist?
        Hide
        Leif Hedstrom added a comment -

        Moving back to v3.3.5 for now.

        Show
        Leif Hedstrom added a comment - Moving back to v3.3.5 for now.
        Hide
        Leif Hedstrom added a comment -

        Any updates on this? Still an issue?

        Show
        Leif Hedstrom added a comment - Any updates on this? Still an issue?
        Show
        Leif Hedstrom added a comment - Moving to 4.2.0 as per https://cwiki.apache.org/confluence/display/TS/New+Release+Processes
        Hide
        Phil Sorber added a comment -

        Brian Geffon,

        Is this still a problem? Was this fixed by the 128bit CAS patches?

        Show
        Phil Sorber added a comment - Brian Geffon , Is this still a problem? Was this fixed by the 128bit CAS patches?

          People

          • Assignee:
            Unassigned
            Reporter:
            Manjesh Nilange
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development