Uploaded image for project: 'CloudStack'
  1. CloudStack
  2. CLOUDSTACK-2918

In a scaled up environment, hosts fail to come up after Management server restart in clustered set up

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 4.2.0
    • 4.2.1
    • Management Server
    • Security Level: Public (Anyone can view this level - this is the default.)
    • None
    • Load Test environment with simulator

    Description

      Scaled setup with 20K simulated hosts and as many VMs, Basic Zone, Clustered setup with 3 management servers.

      After restarting one or all the Management servers, re balancing hosts takes very long time, and the rebalanced hosts don't come to Up state either.
      Following is the error message from logs when the MS was starting up:

      2013-05-31 00:00:22,361 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Joining node, IP: 10.223.48.2, msid: 206915885094132
      2013-05-31 00:00:22,361 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Joining node, IP: 10.223.48.66, msid: 206915885097283
      2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Receive cluster alert, EventArgs: com.cloud.cluster.ClusterNodeJoinEventArgs
      2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Handle cluster node join alert, joined node: 10.223.48.2, msidL: 206915885094132
      2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Handle cluster node join alert, joined node: 10.223.48.66, msidL: 206915885097283
      2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Management server node 10.223.48.66 is up, send alert
      2013-05-31 00:00:22,539 WARN [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Notifying management server join event took 178 ms
      2013-05-31 00:00:25,825 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 5583ms, profilerH
      eartbeatUpdate: Done. Duration: 187ms, profilerPeerScan: Done. Duration: 560ms, profilerAgentLB: Done. Duration: 4836ms
      2013-05-31 00:00:25,963 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Detected management node joined, id:3, nodeIP:10.223.48.130
      2013-05-31 00:00:25,964 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Notify management server node join to listeners.
      2013-05-31 00:00:25,964 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Joining node, IP: 10.223.48.130, msid: 206915885093830
      2013-05-31 00:00:25,964 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Receive cluster alert, EventArgs: com.cloud.cluster.ClusterNodeJoinEventArgs
      2013-05-31 00:00:25,964 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Handle cluster node join alert, joined node: 10.223.48.130, msidL: 206915885093830
      2013-05-31 00:00:25,964 WARN [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Notifying management server join event took 0 ms
      2013-05-31 00:00:27,501 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1675ms, profilerH
      eartbeatUpdate: Done. Duration: 135ms, profilerPeerScan: Done. Duration: 179ms, profilerAgentLB: Done. Duration: 1361ms
      2013-05-31 00:00:29,389 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1888ms, profilerH
      eartbeatUpdate: Done. Duration: 462ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1424ms
      2013-05-31 00:00:33,912 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1643ms, profilerH
      eartbeatUpdate: Done. Duration: 141ms, profilerPeerScan: Done. Duration: 1ms, profilerAgentLB: Done. Duration: 1501ms
      2013-05-31 00:00:41,328 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1675ms, profilerH
      eartbeatUpdate: Done. Duration: 178ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1495ms
      2013-05-31 00:00:48,799 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Running Capacity Checker ...
      2013-05-31 00:00:48,805 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) recalculating system capacity
      2013-05-31 00:00:48,805 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Executing cpu/ram capacity update
      2013-05-31 00:00:48,883 DEBUG [network.router.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:null) Found 0 routers to update status.
      2013-05-31 00:00:48,901 DEBUG [network.router.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:null) Found 0 networks to update RvR status.
      2013-05-31 00:00:50,031 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1596ms, profilerHeartbeatUpdate: Done. Duration: 179ms, profilerPeerScan: Done. Duration: 1ms, profilerAgentLB: Done. Duration: 1416ms
      2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:2 usedCpu: 500 reservedCpu: 0
      2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:2 usedMem: 268435456 reservedMem: 0
      2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:3 usedCpu: 500 reservedCpu: 0
      2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:3 usedMem: 1073741824 reservedMem: 0
      2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:5 usedCpu: 0 reservedCpu: 0
      ...
      ...
      ...

      2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:2 usedMem: 268435456 reservedMem: 0
      2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:3 usedCpu: 500 reservedCpu: 0
      2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:3 usedMem: 1073741824 reservedMem: 0
      2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:5 usedCpu: 0 reservedCpu: 0
      2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:5 usedMem: 0 reservedMem: 0
      2013-05-31 00:00:50,414 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:6 usedCpu: 0 reservedCpu: 0
      ...
      ...
      ...

      2013-05-31 00:03:05,249 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:20002 usedMem: 0 reservedMem: 0
      2013-05-31 00:03:05,249 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Done executing cpu/ram capacity update
      2013-05-31 00:03:05,250 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Executing storage capacity update
      2013-05-31 00:03:05,890 DEBUG [cloud.storage.StorageManagerImpl] (CapacityChecker:null) Successfully set Capacity - 1099511627776 for capacity type - 9 , DataCenterId - 1, HostOrPoolId - 20
      0, PodId 1
      2013-05-31 00:03:06,080 DEBUG [cloud.storage.StorageManagerImpl] (CapacityChecker:null) Successfully set Capacity - 1099511627776 for capacity type - 9 , DataCenterId - 1, HostOrPoolId - 20
      1, PodId 1

      2013-05-31 00:04:32,680 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Done executing storage capacity update
      2013-05-31 00:04:32,680 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Executing capacity updates for public ip and Vlans
      2013-05-31 00:04:33,488 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1565ms, profilerHeartbeatUpdate: Done. Duration: 122ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1441ms
      2013-05-31 00:04:35,070 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1582ms, profilerHeartbeatUpdate: Done. Duration: 172ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1408ms
      2013-05-31 00:04:35,206 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Done capacity updates for public ip and Vlans
      2013-05-31 00:04:35,206 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Executing capacity updates for private ip
      2013-05-31 00:04:37,155 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 2085ms, profilerHeartbeatUpdate: Done. Duration: 158ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1925ms

      ...
      ...
      ...

      2013-05-31 01:56:10,384 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 2105ms, profilerHeartbeatUpdate: Done. Duration: 141ms, profilerPeerScan: Done. Duration: 1ms, profilerAgentLB: Done. Duration: 1963ms
      2013-05-31 01:56:12,043 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1659ms, profilerHeartbeatUpdate: Done. Duration: 120ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1537ms
      2013-05-31 01:56:14,172 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 2129ms, profilerHeartbeatUpdate: Done. Duration: 96ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 2031ms

      All 3 Management servers have the same above message and the hosts continue to be in disconnected state

      mysql> select count, status, type, mgmt_server_id from host group by mgmt_server_id, status, type;
      -------------------------------------------------------+

      count status type mgmt_server_id

      -------------------------------------------------------+

      1 Alert SecondaryStorage NULL
      17854 Disconnected Routing NULL
      1 Disconnected SecondaryStorageVM NULL
      53 Disconnected Routing 206915885093830
      1485 Disconnected Routing 206915885094132
      608 Disconnected Routing 206915885097283

      -------------------------------------------------------+
      6 rows in set (0.07 sec)

      Observed this issue even with lesser hosts ~5K

      Attachments

        Activity

          People

            koushikd Koushik Das
            sowmyak Sowmya Krishnan
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: