[CLOUDSTACK-4740] Some vSphere VMs are shutdown when ACS is restarted - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Cannot Reproduce
Affects Version/s: 4.1.0, 4.1.1, 4.2.0, Future
Fix Version/s: 4.3.0
Component/s: Management Server
Security Level: Public (Anyone can view this level - this is the default.)
Labels:
- management
- poweroff
Environment:
I'm running ACS 4.1.1 with vSphere 5.1

Description

If management server is restarted, when management server starts - it checks whether the agentState for vSphere VMs and if it does not get a proper response, it marks them as stopped.

As the result, some of my virtual instances would shutdown.

Attempting to analyze this issue further, here are my findings and errors seen in the log.

2013-09-25 14:35:49,928 DEBUG [vmware.resource.VmwareResource] (AgentTaskPool-1:null) Detecting a new state but couldn't find a old state so adding it to the changes: i-2-262-acs-docs-fc17
2013-09-25 14:35:51,213 DEBUG [agent.transport.Request] (AgentTaskPool-1:null) Seq 1-1: Startup request from directly connected host: { Cmd , MgmtId: -1, via: -1, Ver: v1, Flags: 11, [{"cpus":16,"speed":2199,"memory":68683468800,"dom0MinMemory":0,"poolSync":false,"vms":{"i-8-270-CLOUD411":

{"state":"Running"}

,"r-15-CLOUD41-OLD":

{"state":"Stopped"}

,"v-260-CLOUD411":

{"state":"Running"}

,"i-2-283-vmbld01l-ops-08":

{"state":"Running"}

,"i-2-104-ACS41VM":

{"state":"Running"}

,"--s-1-CLOUD41-OLD":

{"state":"Running"}

,"i-27-280-CLOUD411":

{"state":"Running"}

,"i-2-285-ossec01l-ops-08":

{"state":"Running"}

,"i-2-262-acs-docs-fc17":

{"state":"Stopped"}

,"i-24-265-test3":

{"state":"Running"}

,"cloud01l-ops-08.portal.webmd.com":

{"state":"Running"}

,"i-2-278-demo01t-ops-08":

{"state":"Running"}

,"s-63-CLOUD411":

{"state":"Running"}

,"r-66-CLOUD411":

{"state":"Running"}

,"i-2-281-acs-appliance":{"state":"Running"}},"caps":"hvm","hypervisorType":"VMware","hostDetails":

{"com.cloud.network.Networks.RouterPrivateIpStrategy":"DcGlobal","NativeHA":"true"}

,"hypervisorVersion":"5.0","type":"Routing","dataCenter":"2","pod":"2","cluster":"3","guid":"HostSystem:host-19716@vc00q-ops-08.portal.webmd.com","name":"vmha62d-ops-08.portal.webmd.com","version":"4.1.1-SNAPSHOT","privateIpAddress":"172.25.243.31","privateMacAddress":"68:b5:99:73:0b:c2","privateNetmask":"255.255.255.0","storageIpAddress":"172.25.243.31","storageNetmask":"255.255.255.0","storageMacAddress":"68:b5:99:73:0b:c2","wait":0},{"totalSize":0,"poolInfo":

{"uuid":"72c8aedb-58c4-4569-ac51-adc5af770bf6","host":"vmha62d-ops-08.portal.webmd.com","localPath":"","hostPath":"datastore-19718","poolType":"LVM","capacityBytes":141465485312,"availableBytes":140383354880}

,"resourceType":"STORAGE_POOL","hostDetails":{},"type":"Storage","dataCenter":"2","pod":"2","cluster":"3","guid":"72c8aedb-58c4-4569-ac51-adc5af770bf6","name":"72c8aedb-58c4-4569-ac51-adc5af770bf6","wait":0}] }
2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running and realState = Stopped
2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running and realState = Stopped
2013-09-25 14:35:53,614 INFO [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-1:null) Skip HA for VMware VM i-2-262-acs-docs-fc17
2013-09-25 14:35:53,694 DEBUG [agent.transport.Request] (AgentTaskPool-1:null) Seq 11-1418264581: Sending { Cmd , MgmtId: 345049078181, via: 11, Ver: v1, Flags: 100101, [{"StopCommand":{"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278-demo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281-acs-appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-283-vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-285-ossec01l-ops-08","wait":0}}] }
2013-09-25 14:35:53,695 DEBUG [agent.transport.Request] (AgentTaskPool-1:null) Seq 11-1418264581: Executing: { Cmd , MgmtId: 345049078181, via: 11, Ver: v1, Flags: 100101, [{"StopCommand":{"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278-demo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281-acs-appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-283-vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-285-ossec01l-ops-08","wait":0}}] }
2013-09-25 14:35:53,702 INFO [vmware.resource.VmwareResource] (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) Executing resource StopCommand:

{"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}

2013-09-25 14:35:53,703 DEBUG [vmware.mo.HostMO] (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on host
2013-09-25 14:35:54,753 INFO [vmware.resource.VmwareResource] (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) VM i-2-262-acs-docs-fc17 is already in stopped state
2013-09-25 14:37:50,264 DEBUG [agent.transport.Request] (DirectAgent-3:null) Seq 11-1418264581: Processing: { Ans: , MgmtId: 345049078181, via: 11, Ver: v1, Flags: 100, [{"StopAnswer":{"vncPort":0,"result":true,"details":"VM i-2-262-acs-docs-fc17 is already in stopped state","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-278-demo01t-ops-08 Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-281-acs-appliance Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-283-vmbld01l-ops-08 Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM i-2-285-ossec01l-ops-08 Succeed","wait":0}}] }
2013-09-25 14:38:14,528 DEBUG [vmware.mo.HostMO] (DirectAgent-20:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on host

If for whatever reason agentState is stopped for vSphere, we should not force power off the VM, neither do we have to restart it.

The code that causes the restart is in engine/orchestration/src/com/cloud/vm/VirtualMachineManagerImpl.java,

lines 2365 - 2376

In this snippet we do the check if cs state == Running and agentState == Stopped

We should not power off the VM if we cannot retrieve the state!

2233 protected Command compareState(long hostId, VMInstanceVO vm, final AgentVmInfo info, final boolean fullSync, boolean trackExternalChange) {
2234 State agentState = info.state;
2235 final State serverState = vm.getState();
2236 final String serverName = vm.getInstanceName();
2237
2238 Command command = null;
2239 s_logger.debug("VM " + serverName + ": cs state = " + serverState + " and realState = " + agentState);
2240 if (s_logger.isDebugEnabled())

{ 2241 s_logger.debug("VM " + serverName + ": cs state = " + serverState + " and realState = " + agentState); 2242 }

2243
2244 if (agentState == State.Error) {
2245 agentState = State.Stopped;
2246
2247 short alertType = AlertManager.ALERT_TYPE_USERVM;
2248 if (VirtualMachine.Type.DomainRouter.equals(vm.getType()))

{ 2249 alertType = AlertManager.ALERT_TYPE_DOMAIN_ROUTER; 2250 }

else if (VirtualMachine.Type.ConsoleProxy.equals(vm.getType()))

{ 2251 alertType = AlertManager.ALERT_TYPE_CONSOLE_PROXY; 2252 }

else if (VirtualMachine.Type.SecondaryStorageVm.equals(vm.getType()))

{ 2253 alertType = AlertManager.ALERT_TYPE_SSVM; 2254 }

2255
2256 HostPodVO podVO = _podDao.findById(vm.getPodIdToDeployIn());
2257 DataCenterVO dcVO = _dcDao.findById(vm.getDataCenterId());
2258 HostVO hostVO = _hostDao.findById(vm.getHostId());
2259
2260 String hostDesc = "name: " + hostVO.getName() + " (id:" + hostVO.getId() + "), availability zone: " + dcVO.getName() + ", pod: " + podVO.getName();
2261 _alertMgr.sendAlert(alertType, vm.getDataCenterId(), vm.getPodIdToDeployIn(), "VM (name: " + vm.getInstanceName() + ", id: " + vm.getId() + ") stopped on host " +
2262 hostDesc + " due to storage failure",
2263 "Virtual Machine " + vm.getInstanceName() + " (id: " + vm.getId() + ") running on host [" + vm.getHostId() + "] stopped due to storage failure.");
2264 }
2265
2266 if (trackExternalChange) {
2267 if (serverState == State.Starting) {
2268 if (vm.getHostId() != null && vm.getHostId() != hostId)

{ 2269 s_logger.info("CloudStack is starting VM on host " + vm.getHostId() + ", but status report comes from a different host " + hostId + 2270 ", skip status sync for vm: " + vm.getInstanceName()); 2271 return null; 2272 }

2273 }
2274 if (vm.getHostId() == null || hostId != vm.getHostId()) {
2275 try {
2276 ItWorkVO workItem = _workDao.findByOutstandingWork(vm.getId(), State.Migrating);
2277 if (workItem == null)

{ 2278 stateTransitTo(vm, VirtualMachine.Event.AgentReportMigrated, hostId); 2279 }

2280 } catch (NoTransitionException e)

{ 2281 }

2282 }
2283 }
2284
2285 // during VM migration time, don't sync state will agent status update
2286 if (serverState == State.Migrating)

{ 2287 s_logger.debug("Skipping vm in migrating state: " + vm); 2288 return null; 2289 }

2290
2291 if (trackExternalChange) {
2292 if (serverState == State.Starting) {
2293 if (vm.getHostId() != null && vm.getHostId() != hostId)

{ 2294 s_logger.info("CloudStack is starting VM on host " + vm.getHostId() + ", but status report comes from a different host " + hostId + 2295 ", skip status sync for vm: " + vm.getInstanceName()); 2296 return null; 2297 }

2298 }
2299
2300 if (serverState == State.Running) {
2301 try {
2302 //
2303 // we had a bug that sometimes VM may be at Running State
2304 // but host_id is null, we will cover it here.
2305 // means that when CloudStack DB lost of host information,
2306 // we will heal it with the info reported from host
2307 //
2308 if (vm.getHostId() == null || hostId != vm.getHostId()) {
2309 if (s_logger.isDebugEnabled())

{ 2310 s_logger.debug("detected host change when VM " + vm + " is at running state, VM could be live-migrated externally from host " + vm.getHostId() + 2311 " to host " + hostId); 2312 }

2313
2314 stateTransitTo(vm, VirtualMachine.Event.AgentReportMigrated, hostId);
2315 }
2316 } catch (NoTransitionException e)

{ 2317 s_logger.warn(e.getMessage()); 2318 }

2319 }
2320 }
2321
2322 if (agentState == serverState) {
2323 if (s_logger.isDebugEnabled())

{ 2324 s_logger.debug("Both states are " + agentState + " for " + vm); 2325 }

2326 assert (agentState == State.Stopped || agentState == State.Running) : "If the states we send up is changed, this must be changed.";
2327 if (agentState == State.Running) {
2328 try

{ 2329 stateTransitTo(vm, VirtualMachine.Event.AgentReportRunning, hostId); 2330 }

catch (NoTransitionException e)

{ 2331 s_logger.warn(e.getMessage()); 2332 }

2333 // FIXME: What if someone comes in and sets it to stopping? Then
2334 // what?
2335 return null;
2336 }
2337
2338 s_logger.debug("State matches but the agent said stopped so let's send a cleanup command anyways.");
2339 return cleanup(vm);
2340 }
2341
2342 if (agentState == State.Shutdowned) {
2343 if (serverState == State.Running || serverState == State.Starting || serverState == State.Stopping) {
2344 try

{ 2345 advanceStop(vm, true); 2346 }

catch (AgentUnavailableException e)

{ 2347 assert (false) : "How do we hit this with forced on?"; 2348 return null; 2349 }

catch (OperationTimedoutException e)

{ 2350 assert (false) : "How do we hit this with forced on?"; 2351 return null; 2352 }

catch (ConcurrentOperationException e)

{ 2353 assert (false) : "How do we hit this with forced on?"; 2354 return null; 2355 }

2356 } else

{ 2357 s_logger.debug("Sending cleanup to a shutdowned vm: " + vm.getInstanceName()); 2358 command = cleanup(vm); 2359 }

2360 } else if (agentState == State.Stopped) {
2361 // This state means the VM on the agent was detected previously
2362 // and now is gone. This is slightly different than if the VM
2363 // was never completed but we still send down a Stop Command
2364 // to ensure there's cleanup.
2365 if (serverState == State.Running)

{ 2366 // Our records showed that it should be running so let's restart 2367 // it. 2368 _haMgr.scheduleRestart(vm, false); 2369 }

else if (serverState == State.Stopping)

{ 2370 _haMgr.scheduleStop(vm, hostId, WorkType.ForceStop); 2371 s_logger.debug("Scheduling a check stop for VM in stopping mode: " + vm); 2372 }

else if (serverState == State.Starting)

{ 2373 s_logger.debug("Ignoring VM in starting mode: " + vm.getInstanceName()); 2374 _haMgr.scheduleRestart(vm, false); 2375 }

2376 command = cleanup(vm);

I'd be happy to test the fix if such is comes out.

Thanks
ilya

Some vSphere VMs are shutdown when ACS is restarted

Details

Description

Attachments

Activity

People

Dates