[CLOUDSTACK-5859] [HA] Shared storage failure results in reboot loop; VMs with Local storage brought offline - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 4.2.0
Fix Version/s: None
Component/s: KVM
Security Level: Public (Anyone can view this level - this is the default.)
Labels:
None
Environment:
RHEL/CentOS 6.4 with KVM

Description

We have a group of 13 KVM servers added to a single cluster within CloudStack. All VMs use local hypervisor storage, with the exception of one that was configured to use NFS-based primary storage with a HA service offering.

An issue occurred with the SAN responsible for serving the NFS mount (primary storage for HA VM) and the mount was put into a read-only state. Shortly after, each host in the cluster rebooted and continued to stay in a reboot loop until I put the primary storage into maintenance. These messages were in the agent.log on each of the KVM hosts:

2014-01-12 02:40:20,953 WARN [kvm.resource.KVMHAMonitor] (Thread-137180:null) write heartbeat failed: timeout, retry: 4
2014-01-12 02:40:20,953 WARN [kvm.resource.KVMHAMonitor] (Thread-137180:null) write heartbeat failed: timeout; reboot the host

In essence, a single HA-enabled VM was able to bring down an entire KVM cluster that was hosting a number of VMs with local storage. It would seem that the fencing script needs to be improved to account for cases where both local and shared storage is used.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Dave Garbus

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 12/Jan/14 20:53

Updated:: 09/Apr/15 08:43