Traffic Server
  1. Traffic Server
  2. TS-1233

FATAL: HttpSM.cc:890: failed assert `0`

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 3.1.3
    • Fix Version/s: None
    • Component/s: HTTP
    • Labels:
    • Environment:

      Linux hostname.domain.com 2.6.32-220.el6.x86_64 #1 SMP Wed Dec 7 10:41:06 EST 2011 x86_64 x86_64 x86_64 GNU/Linux

      Description

      We have a 3.1.3 installation running against a light production load and are occasionally getting this failed assertion:

      FATAL: HttpSM.cc:890: failed assert `0`
      /usr/bin/traffic_server - STACK TRACE:
      /usr/lib64/trafficserver/libtsutil.so.3(ink_fatal+0x88)[0x2b3b69c88538]
      /usr/lib64/trafficserver/libtsutil.so.3(_ink_assert+0x1f)[0x2b3b69c86c6f]
      /usr/bin/traffic_server(_ZN6HttpSM28state_watch_for_client_abortEiPv+0x7d)[0x52ae0d]
      /usr/bin/traffic_server(_ZN6HttpSM12main_handlerEiPv+0xe8)[0x536008]
      /usr/bin/traffic_server(_ZN18UnixNetVConnection14readSignalDoneEiP10NetHandler+0x3d)[0x678e2d]
      /usr/bin/traffic_server(_ZN17SSLNetVConnection11net_read_ioEP10NetHandlerP7EThread+0x6ef)[0x66cabf]
      /usr/bin/traffic_server(_ZN10NetHandler12mainNetEventEiP5Event+0x25a)[0x6737ba]
      /usr/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0xb4)[0x69cc34]
      /usr/bin/traffic_server(_ZN7EThread7executeEv+0x4c3)[0x69d5c3]
      /usr/bin/traffic_server[0x69bb72]
      /lib64/libpthread.so.0[0x3170e077f1]
      /lib64/libc.so.6(clone+0x6d)[0x3170ae570d]

      This causes a corruption of the disk cache every time it occurs, such that whenever the traffic server starts back up, it gets into a crash loop and does this over and over each time it tries to start up:

      NOTE: Traffic Server received Sig 11: Segmentation fault
      /usr/bin/traffic_server - STACK TRACE:
      /lib64/libpthread.so.0[0x3170e0f4a0]
      /usr/bin/traffic_server(_Z15dir_clear_rangellP3Vol+0x6b)[0x63370b]
      /usr/bin/traffic_server(_ZN3Vol24handle_recover_from_dataEiPv+0x531)[0x62c5d1]
      /usr/bin/traffic_server(_ZN19AIOCallbackInternal11io_completeEiPv+0x35)[0x62cfb5]
      /usr/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0xb4)[0x69cc34]
      /usr/bin/traffic_server(_ZN7EThread7executeEv+0x5ab)[0x69d6ab]
      /usr/bin/traffic_server[0x69bb72]
      /lib64/libpthread.so.0[0x3170e077f1]
      /lib64/libc.so.6(clone+0x6d)[0x3170ae570d]
      [Apr 29 17:39:57.149] Manager

      {0x7f4d3481c7e0} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
      [Apr 29 17:39:57.149] Manager {0x7f4d3481c7e0}

      FATAL: (last system error 104: Connection reset by peer)
      [Apr 29 17:39:57.149] Manager

      {0x7f4d3481c7e0} NOTE: [LocalManager::mgmtShutdown] Executing shutdown request.
      [Apr 29 17:39:57.149] Manager {0x7f4d3481c7e0}

      NOTE: [LocalManager::processShutdown] Executing process shutdown request.
      [Apr 29 17:39:57.149] Manager

      {0x7f4d3481c7e0} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message
      [Apr 29 17:39:57.149] Manager {0x7f4d3481c7e0}

      ERROR: (last system error 32: Broken pipe)

      The configuration for this setup is pretty minimal, and is mostly a single remap rule that chains together the remap_regex and conf_remap plugins to turn off auth for files ending in a particular extension.

      I will add any appropriate configs, logs, etc as needed to help troubleshoot this.

        Issue Links

          Activity

          Hide
          Michael Turner added a comment -

          Should also note:

          • All requests are via SSL
          • Use case is generally large files, 100mg - 8+gb
          Show
          Michael Turner added a comment - Should also note: All requests are via SSL Use case is generally large files, 100mg - 8+gb
          Hide
          Christopher Arnold added a comment -

          a quick highly naive peek at the code shows that this failed assertion assert(0) is happening in the HttpSM::state_watch_for_client_abort handler. Since these are large files there's a high likelihood for a lot of client's sending aborts at one of these points. Could there be a case where when a client aborts a slow download they are causing a FATAL to happen in the handler?

          Show
          Christopher Arnold added a comment - a quick highly naive peek at the code shows that this failed assertion assert(0) is happening in the HttpSM::state_watch_for_client_abort handler. Since these are large files there's a high likelihood for a lot of client's sending aborts at one of these points. Could there be a case where when a client aborts a slow download they are causing a FATAL to happen in the handler?
          Hide
          James Peach added a comment -

          For the assert case:

          Apr 30 14:13:18 seo-ats1 traffic_server[17797]:

          {0x2ae1c8e0d700}

          ERROR: state_watch_for_client_abort received unexpected event 102
          Apr 30 14:13:18 seo-ats1 traffic_server[17797]: FATAL: HttpSM.cc:891: failed assert `0`

          102 is VC_EVENT_READ_COMPLETE

          Show
          James Peach added a comment - For the assert case: Apr 30 14:13:18 seo-ats1 traffic_server [17797] : {0x2ae1c8e0d700} ERROR: state_watch_for_client_abort received unexpected event 102 Apr 30 14:13:18 seo-ats1 traffic_server [17797] : FATAL: HttpSM.cc:891: failed assert `0` 102 is VC_EVENT_READ_COMPLETE
          Hide
          James Peach added a comment -

          I looked at this harder, and the VC_EVENT_READ_COMPLETE event originates in the SSL state machine, here:

          220 SSLNetVConnection::net_read_io(NetHandler *nh, EThread *lthread)
          221 {
          ...
          274 // If there is nothing to do, disable connection
          275 int64_t ntodo = s->vio.ntodo();
          276 if (ntodo <= 0)

          { 277 read_disable(nh, this); 278 // Don't return early even if there's nothing. We still need 279 // to propagate events for zero-length reads. 280 readSignalDone(VC_EVENT_READ_COMPLETE, nh); 281 return; 282 }

          I added the VC_EVENT_READ_COMPLETE propagation in TS-841. However, state_watch_for_client_abort() is clearly not expecting this event (and nor is the tunnel, incidentally).

          Show
          James Peach added a comment - I looked at this harder, and the VC_EVENT_READ_COMPLETE event originates in the SSL state machine, here: 220 SSLNetVConnection::net_read_io(NetHandler *nh, EThread *lthread) 221 { ... 274 // If there is nothing to do, disable connection 275 int64_t ntodo = s->vio.ntodo(); 276 if (ntodo <= 0) { 277 read_disable(nh, this); 278 // Don't return early even if there's nothing. We still need 279 // to propagate events for zero-length reads. 280 readSignalDone(VC_EVENT_READ_COMPLETE, nh); 281 return; 282 } I added the VC_EVENT_READ_COMPLETE propagation in TS-841 . However, state_watch_for_client_abort() is clearly not expecting this event (and nor is the tunnel, incidentally).
          Hide
          James Peach added a comment -

          From production testing, simply ignoring VC_EVENT_READ_COMPLETE HttpSM::state_watch_for_client_abort() prevents the crash. So far, so ill effects from this.

          That said, I can't reproduce locally, so I don't know whether this is a dirty hack or a correct fix.

          In most places where we enter HttpSM::state_watch_for_client_abort(), we are doing a read for INT64_MAX bytes, so I expect that vio.ntodo() should not return <= 0 in this case.

          Show
          James Peach added a comment - From production testing, simply ignoring VC_EVENT_READ_COMPLETE HttpSM::state_watch_for_client_abort() prevents the crash. So far, so ill effects from this. That said, I can't reproduce locally, so I don't know whether this is a dirty hack or a correct fix. In most places where we enter HttpSM::state_watch_for_client_abort(), we are doing a read for INT64_MAX bytes, so I expect that vio.ntodo() should not return <= 0 in this case.
          Hide
          Leif Hedstrom added a comment -

          James: Can you take this bug please?

          Show
          Leif Hedstrom added a comment - James: Can you take this bug please?
          Hide
          James Peach added a comment -

          Need to get Mike a logging patch so we can watch this in prod and root cause.

          Show
          James Peach added a comment - Need to get Mike a logging patch so we can watch this in prod and root cause.
          Hide
          Leif Hedstrom added a comment -

          Moving out to 3.1.5 for now.

          Show
          Leif Hedstrom added a comment - Moving out to 3.1.5 for now.
          Hide
          Leif Hedstrom added a comment -

          I think this ended up being a duplicate of TS-1272, so closing this.

          Show
          Leif Hedstrom added a comment - I think this ended up being a duplicate of TS-1272 , so closing this.

            People

            • Assignee:
              Unassigned
              Reporter:
              Michael Turner
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development