Uploaded image for project: 'CloudStack'
  1. CloudStack
  2. CLOUDSTACK-7853

Hosts that are temporary Disconnected and get behind on ping (PingTimeout) turn up in permanent state Alert

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0
    • None
    • None
    • Security Level: Public (Anyone can view this level - this is the default.)
    • None

    Description

      If for some reason (I've been unable to determine why but my suspicion is that the management server is busy processing other agent requests and/or xapi is temporary unavailable) a host that is Disconnected gets behind on ping (PingTimeout) it it transitioned to a permanent state of Alert.

      INFO [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-9551e174) Found the following agents behind on ping: [421, 427, 425]
      DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Ping timeout for host 421, do invstigation
      DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Transition:[Resource state = Enabled, Agent event = PingTimeout, Host id = 421, name = xxxxxx1]
      DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Agent status update: [id = 421; name = xxxxxx1; old status = Disconnected; event = PingTimeout; new status = Alert; old update count = 111; new update count = 112]

      ----/ next cycle / -----

      INFO [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Found the following agents behind on ping: [421, 427, 425]
      DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Ping timeout for host 421, do invstigation
      DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Transition:[Resource state = Enabled, Agent event = PingTimeout, Host id = 421, name = xxxxxx1]
      DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Cannot transit agent status with event PingTimeout for host 421, name=xxxxxx1, mangement server id is 345052370017
      ERROR [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Caught the following exception:
      com.cloud.utils.exception.CloudRuntimeException: Cannot transit agent status with event PingTimeout for host 421, mangement server id is 345052370017,Unable to transition to a new state from Alert via PingTimeout
      at com.cloud.agent.manager.AgentManagerImpl.agentStatusTransitTo(AgentManagerImpl.java:1334)
      at com.cloud.agent.manager.AgentManagerImpl.disconnectAgent(AgentManagerImpl.java:1349)
      at com.cloud.agent.manager.AgentManagerImpl.disconnectInternal(AgentManagerImpl.java:1378)
      at com.cloud.agent.manager.AgentManagerImpl.disconnectWithInvestigation(AgentManagerImpl.java:1384)
      at com.cloud.agent.manager.AgentManagerImpl$MonitorTask.runInContext(AgentManagerImpl.java:1466)
      at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
      at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
      at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
      at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
      at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
      at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:701)

      I think the bug occures because there is no valid state transition from Alert via PingTimeout to something recoverable

      Status.java
      s_fsm.addTransition(Status.Alert, Event.AgentConnected, Status.Connecting);
      s_fsm.addTransition(Status.Alert, Event.Ping, Status.Up);
      s_fsm.addTransition(Status.Alert, Event.Remove, Status.Removed);
      s_fsm.addTransition(Status.Alert, Event.ManagementServerDown, Status.Alert);
      s_fsm.addTransition(Status.Alert, Event.AgentDisconnected, Status.Alert);
      s_fsm.addTransition(Status.Alert, Event.ShutdownRequested, Status.Disconnected);

      As a workaround to get out of this situation we put the cluster in Unmanage, wait 10 minutes and put the cluster back in manage

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jvanlieshout@schubergphilis.com Joris van Lieshout
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: