Details
-
Bug
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
4.2.0
-
None
-
Security Level: Public (Anyone can view this level - this is the default.)
-
None
-
RHEL/CentOS 6.4 with KVM
Description
We have a group of 13 KVM servers added to a single cluster within CloudStack. All VMs use local hypervisor storage, with the exception of one that was configured to use NFS-based primary storage with a HA service offering.
An issue occurred with the SAN responsible for serving the NFS mount (primary storage for HA VM) and the mount was put into a read-only state. Shortly after, each host in the cluster rebooted and continued to stay in a reboot loop until I put the primary storage into maintenance. These messages were in the agent.log on each of the KVM hosts:
2014-01-12 02:40:20,953 WARN [kvm.resource.KVMHAMonitor] (Thread-137180:null) write heartbeat failed: timeout, retry: 4
2014-01-12 02:40:20,953 WARN [kvm.resource.KVMHAMonitor] (Thread-137180:null) write heartbeat failed: timeout; reboot the host
In essence, a single HA-enabled VM was able to bring down an entire KVM cluster that was hosting a number of VMs with local storage. It would seem that the fencing script needs to be improved to account for cases where both local and shared storage is used.