Uploaded image for project: 'CloudStack'
  1. CloudStack
  2. CLOUDSTACK-5859

[HA] Shared storage failure results in reboot loop; VMs with Local storage brought offline

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 4.2.0
    • None
    • KVM
    • Security Level: Public (Anyone can view this level - this is the default.)
    • None
    • RHEL/CentOS 6.4 with KVM

    Description

      We have a group of 13 KVM servers added to a single cluster within CloudStack. All VMs use local hypervisor storage, with the exception of one that was configured to use NFS-based primary storage with a HA service offering.

      An issue occurred with the SAN responsible for serving the NFS mount (primary storage for HA VM) and the mount was put into a read-only state. Shortly after, each host in the cluster rebooted and continued to stay in a reboot loop until I put the primary storage into maintenance. These messages were in the agent.log on each of the KVM hosts:

      2014-01-12 02:40:20,953 WARN [kvm.resource.KVMHAMonitor] (Thread-137180:null) write heartbeat failed: timeout, retry: 4
      2014-01-12 02:40:20,953 WARN [kvm.resource.KVMHAMonitor] (Thread-137180:null) write heartbeat failed: timeout; reboot the host

      In essence, a single HA-enabled VM was able to bring down an entire KVM cluster that was hosting a number of VMs with local storage. It would seem that the fencing script needs to be improved to account for cases where both local and shared storage is used.

      Attachments

        Activity

          People

            Unassigned Unassigned
            dgarbus Dave Garbus
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: