Uploaded image for project: 'Traffic Server'
  1. Traffic Server
  2. TS-3386

Heartbeat failed with high load, trafficserver restarted

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • sometime
    • Performance
    • None

    Description

      I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache.

      Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache.

      /var/log/syslog:

      Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502)
      Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
      Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502)
      Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
      Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
      Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed
      Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset
      Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting ---
      Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42)
      Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236)
      Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502)
      Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
      Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502)
      Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
      Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
      Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
      Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message
      Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  (last system error 32: Broken pipe)
      Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256]
      Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead
      Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
      Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
      Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting ---
      Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:05:19)
      Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236)
      Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server Starting ---
      Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42)
      Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236)
      Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 status(502)
      Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
      Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 status(502)
      Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
      Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
      Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed
      Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset
      Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server Starting ---
      Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42)
      Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236)
      Feb 11 10:06:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502)
      Feb 11 10:06:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
      Feb 11 10:07:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502)
      Feb 11 10:07:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
      Feb 11 10:07:02 test-cache traffic_cop[32984]: killing server
      Feb 11 10:07:02 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed
      Feb 11 10:07:02 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset
      Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE: --- traffic_server Starting ---
      Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42)
      Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236)
      Feb 11 10:07:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502)
      Feb 11 10:07:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
      Feb 11 10:07:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502)
      Feb 11 10:07:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
      Feb 11 10:07:22 test-cache traffic_cop[32984]: killing server
      Feb 11 10:07:22 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed
      Feb 11 10:07:22 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset
      Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE: --- traffic_server Starting ---
      Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42)
      Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236)
      Feb 11 10:07:32 test-cache traffic_cop[32984]: (http test) received non-200 status(502)
      Feb 11 10:07:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
      Feb 11 10:07:43 test-cache traffic_cop[32984]: server heartbeat succeeded
      

      manager.log:

      [Feb 11 10:06:02.593] Manager {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed
      [Feb 11 10:06:02.593] Manager {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset
      [Feb 11 10:06:03.603] Manager {0x7f975c537720} NOTE: [LocalManager::startProxy] Launching ts process
      [Feb 11 10:06:03.634] Manager {0x7f975c537720} NOTE: [LocalManager::pollMgmtProcessServer] New process connecting fd '11'
      [Feb 11 10:06:03.634] Manager {0x7f975c537720} NOTE: [Alarms::signalAlarm] Server Process born
      [Feb 11 10:06:22.616] Manager {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
      [Feb 11 10:06:22.616] Manager {0x7f975c537720} NOTE: [LocalManager::mgmtShutdown] Executing shutdown request.
      [Feb 11 10:06:22.616] Manager {0x7f975c537720} NOTE: [LocalManager::processShutdown] Executing process shutdown request.
      [Feb 11 10:06:22.622] Manager {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message
      [Feb 11 10:06:22.622] Manager {0x7f975c537720} ERROR:  (last system error 32: Broken pipe)
      [Feb 11 10:06:22.764] {0x7f2c94ded720} STATUS: opened /usr/local/var/log/trafficserver/manager.log
      [Feb 11 10:06:22.764] {0x7f2c94ded720} NOTE: updated diags config
      [Feb 11 10:06:22.768] Manager {0x7f2c94ded720} NOTE: [ClusterCom::ClusterCom] Node running on OS: 'Linux' Release: '3.2.0-4-amd64'
      [Feb 11 10:06:22.769] Manager {0x7f2c94ded720} NOTE: [LocalManager::listenForProxy] Listening on port: 8080
      [Feb 11 10:06:22.769] Manager {0x7f2c94ded720} NOTE: [TrafficManager] Setup complete
      [Feb 11 10:06:23.827] Manager {0x7f2c94ded720} NOTE: [LocalManager::startProxy] Launching ts process
      [Feb 11 10:06:23.890] Manager {0x7f2c94ded720} NOTE: [LocalManager::pollMgmtProcessServer] New process connecting fd '13'
      [Feb 11 10:06:23.890] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] Server Process born
      [Feb 11 10:06:42.786] Manager {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed
      [Feb 11 10:06:42.786] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset
      [Feb 11 10:06:43.813] Manager {0x7f2c94ded720} NOTE: [LocalManager::startProxy] Launching ts process
      [Feb 11 10:06:43.879] Manager {0x7f2c94ded720} NOTE: [LocalManager::pollMgmtProcessServer] New process connecting fd '13'
      [Feb 11 10:06:43.879] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] Server Process born
      [Feb 11 10:07:02.886] Manager {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed
      [Feb 11 10:07:02.886] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset
      [Feb 11 10:07:03.914] Manager {0x7f2c94ded720} NOTE: [LocalManager::startProxy] Launching ts process
      [Feb 11 10:07:03.927] Manager {0x7f2c94ded720} NOTE: [LocalManager::pollMgmtProcessServer] New process connecting fd '13'
      [Feb 11 10:07:03.927] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] Server Process born
      [Feb 11 10:07:22.988] Manager {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed
      [Feb 11 10:07:22.989] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset
      [Feb 11 10:07:23.997] Manager {0x7f2c94ded720} NOTE: [LocalManager::startProxy] Launching ts process
      [Feb 11 10:07:24.046] Manager {0x7f2c94ded720} NOTE: [LocalManager::pollMgmtProcessServer] New process connecting fd '13'
      [Feb 11 10:07:24.046] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] Server Process born
      

      Note the "(http test) received non-200 status(502)" in the syslog, so it's like the process didn't respond gracefully to the heartbeat.

      Not sure if it was a problem of the backend, or of the trafficserver worker. Either ways, how do I make trafficserver more robust against this condition? I never want to lose the cache in such cases where trafficserver is under high load. What could be the options in the config to tweak to achieve more stability?

      Thanks.

      Attachments

        Activity

          People

            Unassigned Unassigned
            lethalman Luca Bruno
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated: