[CLOUDSTACK-9857] CloudStack KVM Agent Self Fencing - improper systemd config - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 4.5.2
Fix Version/s: 4.10.1.0, 4.11.0.0
Component/s: KVM
Security Level: Public (Anyone can view this level - this is the default.)
Labels:
None

Description

We had a database outage few days ago, we noticed that most of cloudstack KVM agents committed a suicide and never retried to connect. Moreover - we had puppet - that was suppose to restart cloudstack-agent daemon when it goes into failed, but apparently it never does go to “failed” state.

2017-03-30 04:07:50,720 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:null) Request:Seq 1-1: { Cmd , MgmtId: -1, via: -1, Ver: v1, Flags: 111, [{"com.cloud.agent.api.ReadyCommand":{"_details":"com.cloud.utils.exception.CloudRuntimeException: DB Exception on: null","wait":0}}] }
2017-03-30 04:07:50,721 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:null) Processing command: com.cloud.agent.api.ReadyCommand
2017-03-30 04:07:50,721 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:null) Not ready to connect to mgt server: com.cloud.utils.exception.CloudRuntimeException: DB Exception on: null
2017-03-30 04:07:50,722 INFO [cloud.agent.Agent] (AgentShutdownThread:null) Stopping the agent: Reason = sig.kill
2017-03-30 04:07:50,723 DEBUG [cloud.agent.Agent] (AgentShutdownThread:null) Sending shutdown to management server

While agent fenced itself for whatever logic reason it had - the systemd agent did not exit properly.

Here what the status of the cloudstack-agent looks like

[root@mqa6-kvm02 ~]# service cloudstack-agent status
● cloudstack-agent.service - SYSV: Cloud Agent
Loaded: loaded (/etc/rc.d/init.d/cloudstack-agent)
Active: active (exited) since Fri 2017-03-31 23:50:47 GMT; 12s ago
Docs: man:systemd-sysv-generator(8)
Process: 632 ExecStop=/etc/rc.d/init.d/cloudstack-agent stop (code=exited, status=0/SUCCESS)
Process: 654 ExecStart=/etc/rc.d/init.d/cloudstack-agent start (code=exited, status=0/SUCCESS)
Main PID: 441

Mar 31 23:50:47 mqa6-kvm02 systemd[1]: Starting SYSV: Cloud Agent...
Mar 31 23:50:47 mqa6-kvm02 cloudstack-agent[654]: Starting Cloud Agent:
Mar 31 23:50:47 mqa6-kvm02 systemd[1]: Started SYSV: Cloud Agent.
Mar 31 23:50:49 mqa6-kvm02 sudo[806]: root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/grep InitiatorName= /etc/iscsi/initiatorname.iscsi

The "Active: active (exited)" should be "Active: failed (Result: exit-code)”

Solution:

The fix is to add pidfile into /etc/init.d/cloudstack-agent

Like so:

chkconfig: 35 99 10
description: Cloud Agent
+ # pidfile: /var/run/cloudstack-agent.pid

Post that - if agent dies - the systemd will catch it properly and it will look as expected

[root@mqa6-kvm02 ~]# service cloudstack-agent status
● cloudstack-agent.service - SYSV: Cloud Agent
Loaded: loaded (/etc/rc.d/init.d/cloudstack-agent)
Active: failed (Result: exit-code) since Fri 2017-03-31 23:51:40 GMT; 7s ago
Docs: man:systemd-sysv-generator(8)
Process: 1124 ExecStop=/etc/rc.d/init.d/cloudstack-agent stop (code=exited, status=255)
Process: 949 ExecStart=/etc/rc.d/init.d/cloudstack-agent start (code=exited, status=0/SUCCESS)
Main PID: 975

With this change - some other tool can properly inspect the state of daemon and take actions when it failed instead of it being in active (exited) state.

Attachments

Issue Links

links to

GitHub Pull Request #2024

GitHub Pull Request #2029

Activity

People

Assignee:: Abhinandan Prateek

Reporter:: Abhinandan Prateek

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 03/Apr/17 03:12

Updated:: 19/Dec/17 08:55

Resolved:: 19/Dec/17 08:55