Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Cannot Reproduce
-
4.1.0, 4.1.1, 4.2.0, Future
-
Security Level: Public (Anyone can view this level - this is the default.)
-
I'm running ACS 4.1.1 with vSphere 5.1
Description
If management server is restarted, when management server starts - it checks whether the agentState for vSphere VMs and if it does not get a proper response, it marks them as stopped.
As the result, some of my virtual instances would shutdown.
Attempting to analyze this issue further, here are my findings and errors seen in the log.
2013-09-25 14:35:49,928 DEBUG [vmware.resource.VmwareResource] (AgentTaskPool-1:null) Detecting a new state but couldn't find a old state so adding it to the changes: i-2-262-acs-docs-fc17
2013-09-25 14:35:51,213 DEBUG [agent.transport.Request] (AgentTaskPool-1:null) Seq 1-1: Startup request from directly connected host: { Cmd , MgmtId: -1, via: -1, Ver: v1, Flags: 11, [{"cpus":16,"speed":2199,"memory":68683468800,"dom0MinMemory":0,"poolSync":false,"vms":{"i-8-270-CLOUD411":
,"r-15-CLOUD41-OLD":
{"state":"Stopped"},"v-260-CLOUD411":
{"state":"Running"},"i-2-283-vmbld01l-ops-08":
{"state":"Running"},"i-2-104-ACS41VM":
{"state":"Running"},"--s-1-CLOUD41-OLD":
{"state":"Running"},"i-27-280-CLOUD411":
{"state":"Running"},"i-2-285-ossec01l-ops-08":
{"state":"Running"},"i-2-262-acs-docs-fc17":
{"state":"Stopped"},"i-24-265-test3":
{"state":"Running"},"cloud01l-ops-08.portal.webmd.com":
{"state":"Running"},"i-2-278-demo01t-ops-08":
{"state":"Running"},"s-63-CLOUD411":
{"state":"Running"},"r-66-CLOUD411":
{"state":"Running"},"i-2-281-acs-appliance":{"state":"Running"}},"caps":"hvm","hypervisorType":"VMware","hostDetails":
{"com.cloud.network.Networks.RouterPrivateIpStrategy":"DcGlobal","NativeHA":"true"},"hypervisorVersion":"5.0","type":"Routing","dataCenter":"2","pod":"2","cluster":"3","guid":"HostSystem:host-19716@vc00q-ops-08.portal.webmd.com","name":"vmha62d-ops-08.portal.webmd.com","version":"4.1.1-SNAPSHOT","privateIpAddress":"172.25.243.31","privateMacAddress":"68:b5:99:73:0b:c2","privateNetmask":"255.255.255.0","storageIpAddress":"172.25.243.31","storageNetmask":"255.255.255.0","storageMacAddress":"68:b5:99:73:0b:c2","wait":0},{"totalSize":0,"poolInfo":
{"uuid":"72c8aedb-58c4-4569-ac51-adc5af770bf6","host":"vmha62d-ops-08.portal.webmd.com","localPath":"","hostPath":"datastore-19718","poolType":"LVM","capacityBytes":141465485312,"availableBytes":140383354880},"resourceType":"STORAGE_POOL","hostDetails":{},"type":"Storage","dataCenter":"2","pod":"2","cluster":"3","guid":"72c8aedb-58c4-4569-ac51-adc5af770bf6","name":"72c8aedb-58c4-4569-ac51-adc5af770bf6","wait":0}] }
2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running and realState = Stopped
2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running and realState = Stopped
2013-09-25 14:35:53,614 INFO [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-1:null) Skip HA for VMware VM i-2-262-acs-docs-fc17
2013-09-25 14:35:53,694 DEBUG [agent.transport.Request] (AgentTaskPool-1:null) Seq 11-1418264581: Sending { Cmd , MgmtId: 345049078181, via: 11, Ver: v1, Flags: 100101, [{"StopCommand":{"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278-demo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281-acs-appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-283-vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-285-ossec01l-ops-08","wait":0}}] }
2013-09-25 14:35:53,695 DEBUG [agent.transport.Request] (AgentTaskPool-1:null) Seq 11-1418264581: Executing: { Cmd , MgmtId: 345049078181, via: 11, Ver: v1, Flags: 100101, [{"StopCommand":{"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278-demo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281-acs-appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-283-vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-285-ossec01l-ops-08","wait":0}}] }
2013-09-25 14:35:53,702 INFO [vmware.resource.VmwareResource] (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) Executing resource StopCommand:
2013-09-25 14:35:53,703 DEBUG [vmware.mo.HostMO] (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on host
2013-09-25 14:35:54,753 INFO [vmware.resource.VmwareResource] (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) VM i-2-262-acs-docs-fc17 is already in stopped state
2013-09-25 14:37:50,264 DEBUG [agent.transport.Request] (DirectAgent-3:null) Seq 11-1418264581: Processing: { Ans: , MgmtId: 345049078181, via: 11, Ver: v1, Flags: 100, [{"StopAnswer":{"vncPort":0,"result":true,"details":"VM i-2-262-acs-docs-fc17 is already in stopped state","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-278-demo01t-ops-08 Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-281-acs-appliance Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-283-vmbld01l-ops-08 Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-285-ossec01l-ops-08 Succeed","wait":0}}] }
2013-09-25 14:38:14,528 DEBUG [vmware.mo.HostMO] (DirectAgent-20:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on host
If for whatever reason agentState is stopped for vSphere, we should not force power off the VM, neither do we have to restart it.
The code that causes the restart is in engine/orchestration/src/com/cloud/vm/VirtualMachineManagerImpl.java,
lines 2365 - 2376
In this snippet we do the check if cs state == Running and agentState == Stopped
We should not power off the VM if we cannot retrieve the state!
2233 protected Command compareState(long hostId, VMInstanceVO vm, final AgentVmInfo info, final boolean fullSync, boolean trackExternalChange) {
2234 State agentState = info.state;
2235 final State serverState = vm.getState();
2236 final String serverName = vm.getInstanceName();
2237
2238 Command command = null;
2239 s_logger.debug("VM " + serverName + ": cs state = " + serverState + " and realState = " + agentState);
2240 if (s_logger.isDebugEnabled())
2243
2244 if (agentState == State.Error) {
2245 agentState = State.Stopped;
2246
2247 short alertType = AlertManager.ALERT_TYPE_USERVM;
2248 if (VirtualMachine.Type.DomainRouter.equals(vm.getType()))
else if (VirtualMachine.Type.ConsoleProxy.equals(vm.getType()))
{ 2251 alertType = AlertManager.ALERT_TYPE_CONSOLE_PROXY; 2252 }else if (VirtualMachine.Type.SecondaryStorageVm.equals(vm.getType()))
{ 2253 alertType = AlertManager.ALERT_TYPE_SSVM; 2254 }2255
2256 HostPodVO podVO = _podDao.findById(vm.getPodIdToDeployIn());
2257 DataCenterVO dcVO = _dcDao.findById(vm.getDataCenterId());
2258 HostVO hostVO = _hostDao.findById(vm.getHostId());
2259
2260 String hostDesc = "name: " + hostVO.getName() + " (id:" + hostVO.getId() + "), availability zone: " + dcVO.getName() + ", pod: " + podVO.getName();
2261 _alertMgr.sendAlert(alertType, vm.getDataCenterId(), vm.getPodIdToDeployIn(), "VM (name: " + vm.getInstanceName() + ", id: " + vm.getId() + ") stopped on host " +
2262 hostDesc + " due to storage failure",
2263 "Virtual Machine " + vm.getInstanceName() + " (id: " + vm.getId() + ") running on host [" + vm.getHostId() + "] stopped due to storage failure.");
2264 }
2265
2266 if (trackExternalChange) {
2267 if (serverState == State.Starting) {
2268 if (vm.getHostId() != null && vm.getHostId() != hostId)
2273 }
2274 if (vm.getHostId() == null || hostId != vm.getHostId()) {
2275 try {
2276 ItWorkVO workItem = _workDao.findByOutstandingWork(vm.getId(), State.Migrating);
2277 if (workItem == null)
2280 } catch (NoTransitionException e)
{ 2281 }2282 }
2283 }
2284
2285 // during VM migration time, don't sync state will agent status update
2286 if (serverState == State.Migrating)
2290
2291 if (trackExternalChange) {
2292 if (serverState == State.Starting) {
2293 if (vm.getHostId() != null && vm.getHostId() != hostId)
2298 }
2299
2300 if (serverState == State.Running) {
2301 try {
2302 //
2303 // we had a bug that sometimes VM may be at Running State
2304 // but host_id is null, we will cover it here.
2305 // means that when CloudStack DB lost of host information,
2306 // we will heal it with the info reported from host
2307 //
2308 if (vm.getHostId() == null || hostId != vm.getHostId()) {
2309 if (s_logger.isDebugEnabled())
2313
2314 stateTransitTo(vm, VirtualMachine.Event.AgentReportMigrated, hostId);
2315 }
2316 } catch (NoTransitionException e)
2319 }
2320 }
2321
2322 if (agentState == serverState) {
2323 if (s_logger.isDebugEnabled())
2326 assert (agentState == State.Stopped || agentState == State.Running) : "If the states we send up is changed, this must be changed.";
2327 if (agentState == State.Running) {
2328 try
catch (NoTransitionException e)
{ 2331 s_logger.warn(e.getMessage()); 2332 }2333 // FIXME: What if someone comes in and sets it to stopping? Then
2334 // what?
2335 return null;
2336 }
2337
2338 s_logger.debug("State matches but the agent said stopped so let's send a cleanup command anyways.");
2339 return cleanup(vm);
2340 }
2341
2342 if (agentState == State.Shutdowned) {
2343 if (serverState == State.Running || serverState == State.Starting || serverState == State.Stopping) {
2344 try
catch (AgentUnavailableException e)
{ 2347 assert (false) : "How do we hit this with forced on?"; 2348 return null; 2349 }catch (OperationTimedoutException e)
{ 2350 assert (false) : "How do we hit this with forced on?"; 2351 return null; 2352 }catch (ConcurrentOperationException e)
{ 2353 assert (false) : "How do we hit this with forced on?"; 2354 return null; 2355 }2356 } else
{ 2357 s_logger.debug("Sending cleanup to a shutdowned vm: " + vm.getInstanceName()); 2358 command = cleanup(vm); 2359 }2360 } else if (agentState == State.Stopped) {
2361 // This state means the VM on the agent was detected previously
2362 // and now is gone. This is slightly different than if the VM
2363 // was never completed but we still send down a Stop Command
2364 // to ensure there's cleanup.
2365 if (serverState == State.Running)
else if (serverState == State.Stopping)
{ 2370 _haMgr.scheduleStop(vm, hostId, WorkType.ForceStop); 2371 s_logger.debug("Scheduling a check stop for VM in stopping mode: " + vm); 2372 }else if (serverState == State.Starting)
{ 2373 s_logger.debug("Ignoring VM in starting mode: " + vm.getInstanceName()); 2374 _haMgr.scheduleRestart(vm, false); 2375 }2376 command = cleanup(vm);
I'd be happy to test the fix if such is comes out.
Thanks
ilya