Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache.
Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache.
/var/log/syslog:
Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting --- Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: (last system error 32: Broken pipe) Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256] Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting --- Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:05:19) Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server Starting --- Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server Starting --- Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:07:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:07:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:07:02 test-cache traffic_cop[32984]: killing server Feb 11 10:07:02 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:07:02 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE: --- traffic_server Starting --- Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:07:04 test-cache traffic_server[59088]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:07:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:07:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:07:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:07:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:07:22 test-cache traffic_cop[32984]: killing server Feb 11 10:07:22 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:07:22 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE: --- traffic_server Starting --- Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:07:25 test-cache traffic_server[59099]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:07:32 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:07:32 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:07:43 test-cache traffic_cop[32984]: server heartbeat succeeded
manager.log:
[Feb 11 10:06:02.593] Manager {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed [Feb 11 10:06:02.593] Manager {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset [Feb 11 10:06:03.603] Manager {0x7f975c537720} NOTE: [LocalManager::startProxy] Launching ts process [Feb 11 10:06:03.634] Manager {0x7f975c537720} NOTE: [LocalManager::pollMgmtProcessServer] New process connecting fd '11' [Feb 11 10:06:03.634] Manager {0x7f975c537720} NOTE: [Alarms::signalAlarm] Server Process born [Feb 11 10:06:22.616] Manager {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) [Feb 11 10:06:22.616] Manager {0x7f975c537720} NOTE: [LocalManager::mgmtShutdown] Executing shutdown request. [Feb 11 10:06:22.616] Manager {0x7f975c537720} NOTE: [LocalManager::processShutdown] Executing process shutdown request. [Feb 11 10:06:22.622] Manager {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message [Feb 11 10:06:22.622] Manager {0x7f975c537720} ERROR: (last system error 32: Broken pipe) [Feb 11 10:06:22.764] {0x7f2c94ded720} STATUS: opened /usr/local/var/log/trafficserver/manager.log [Feb 11 10:06:22.764] {0x7f2c94ded720} NOTE: updated diags config [Feb 11 10:06:22.768] Manager {0x7f2c94ded720} NOTE: [ClusterCom::ClusterCom] Node running on OS: 'Linux' Release: '3.2.0-4-amd64' [Feb 11 10:06:22.769] Manager {0x7f2c94ded720} NOTE: [LocalManager::listenForProxy] Listening on port: 8080 [Feb 11 10:06:22.769] Manager {0x7f2c94ded720} NOTE: [TrafficManager] Setup complete [Feb 11 10:06:23.827] Manager {0x7f2c94ded720} NOTE: [LocalManager::startProxy] Launching ts process [Feb 11 10:06:23.890] Manager {0x7f2c94ded720} NOTE: [LocalManager::pollMgmtProcessServer] New process connecting fd '13' [Feb 11 10:06:23.890] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] Server Process born [Feb 11 10:06:42.786] Manager {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed [Feb 11 10:06:42.786] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset [Feb 11 10:06:43.813] Manager {0x7f2c94ded720} NOTE: [LocalManager::startProxy] Launching ts process [Feb 11 10:06:43.879] Manager {0x7f2c94ded720} NOTE: [LocalManager::pollMgmtProcessServer] New process connecting fd '13' [Feb 11 10:06:43.879] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] Server Process born [Feb 11 10:07:02.886] Manager {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed [Feb 11 10:07:02.886] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset [Feb 11 10:07:03.914] Manager {0x7f2c94ded720} NOTE: [LocalManager::startProxy] Launching ts process [Feb 11 10:07:03.927] Manager {0x7f2c94ded720} NOTE: [LocalManager::pollMgmtProcessServer] New process connecting fd '13' [Feb 11 10:07:03.927] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] Server Process born [Feb 11 10:07:22.988] Manager {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed [Feb 11 10:07:22.989] Manager {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset [Feb 11 10:07:23.997] Manager {0x7f2c94ded720} NOTE: [LocalManager::startProxy] Launching ts process [Feb 11 10:07:24.046] Manager {0x7f2c94ded720} NOTE: [LocalManager::pollMgmtProcessServer] New process connecting fd '13' [Feb 11 10:07:24.046] Manager {0x7f2c94ded720} NOTE: [Alarms::signalAlarm] Server Process born
Note the "(http test) received non-200 status(502)" in the syslog, so it's like the process didn't respond gracefully to the heartbeat.
Not sure if it was a problem of the backend, or of the trafficserver worker. Either ways, how do I make trafficserver more robust against this condition? I never want to lose the cache in such cases where trafficserver is under high load. What could be the options in the config to tweak to achieve more stability?
Thanks.