Uploaded image for project: 'CloudStack'
  1. CloudStack
  2. CLOUDSTACK-9857

CloudStack KVM Agent Self Fencing - improper systemd config

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 4.5.2
    • 4.10.1.0, 4.11.0.0
    • KVM
    • Security Level: Public (Anyone can view this level - this is the default.)
    • None

    Description

      We had a database outage few days ago, we noticed that most of cloudstack KVM agents committed a suicide and never retried to connect. Moreover - we had puppet - that was suppose to restart cloudstack-agent daemon when it goes into failed, but apparently it never does go to “failed” state.

      2017-03-30 04:07:50,720 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:null) Request:Seq 1-1: { Cmd , MgmtId: -1, via: -1, Ver: v1, Flags: 111, [{"com.cloud.agent.api.ReadyCommand":{"_details":"com.cloud.utils.exception.CloudRuntimeException: DB Exception on: null","wait":0}}] }
      2017-03-30 04:07:50,721 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:null) Processing command: com.cloud.agent.api.ReadyCommand
      2017-03-30 04:07:50,721 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:null) Not ready to connect to mgt server: com.cloud.utils.exception.CloudRuntimeException: DB Exception on: null
      2017-03-30 04:07:50,722 INFO [cloud.agent.Agent] (AgentShutdownThread:null) Stopping the agent: Reason = sig.kill
      2017-03-30 04:07:50,723 DEBUG [cloud.agent.Agent] (AgentShutdownThread:null) Sending shutdown to management server

      While agent fenced itself for whatever logic reason it had - the systemd agent did not exit properly.

      Here what the status of the cloudstack-agent looks like

      [root@mqa6-kvm02 ~]# service cloudstack-agent status
      ● cloudstack-agent.service - SYSV: Cloud Agent
      Loaded: loaded (/etc/rc.d/init.d/cloudstack-agent)
      Active: active (exited) since Fri 2017-03-31 23:50:47 GMT; 12s ago
      Docs: man:systemd-sysv-generator(8)
      Process: 632 ExecStop=/etc/rc.d/init.d/cloudstack-agent stop (code=exited, status=0/SUCCESS)
      Process: 654 ExecStart=/etc/rc.d/init.d/cloudstack-agent start (code=exited, status=0/SUCCESS)
      Main PID: 441

      Mar 31 23:50:47 mqa6-kvm02 systemd[1]: Starting SYSV: Cloud Agent...
      Mar 31 23:50:47 mqa6-kvm02 cloudstack-agent[654]: Starting Cloud Agent:
      Mar 31 23:50:47 mqa6-kvm02 systemd[1]: Started SYSV: Cloud Agent.
      Mar 31 23:50:49 mqa6-kvm02 sudo[806]: root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/grep InitiatorName= /etc/iscsi/initiatorname.iscsi

      The "Active: active (exited)" should be "Active: failed (Result: exit-code)”

      Solution:

      The fix is to add pidfile into /etc/init.d/cloudstack-agent

      Like so:

      1. chkconfig: 35 99 10
      2. description: Cloud Agent
        + # pidfile: /var/run/cloudstack-agent.pid

      Post that - if agent dies - the systemd will catch it properly and it will look as expected

      [root@mqa6-kvm02 ~]# service cloudstack-agent status
      ● cloudstack-agent.service - SYSV: Cloud Agent
      Loaded: loaded (/etc/rc.d/init.d/cloudstack-agent)
      Active: failed (Result: exit-code) since Fri 2017-03-31 23:51:40 GMT; 7s ago
      Docs: man:systemd-sysv-generator(8)
      Process: 1124 ExecStop=/etc/rc.d/init.d/cloudstack-agent stop (code=exited, status=255)
      Process: 949 ExecStart=/etc/rc.d/init.d/cloudstack-agent start (code=exited, status=0/SUCCESS)
      Main PID: 975

      With this change - some other tool can properly inspect the state of daemon and take actions when it failed instead of it being in active (exited) state.

      Attachments

        Activity

          People

            aprateek Abhinandan Prateek
            aprateek Abhinandan Prateek
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: