[CLOUDSTACK-2918] In a scaled up environment, hosts fail to come up after Management server restart in clustered set up - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 4.2.0
Fix Version/s: 4.2.1
Component/s: Management Server
Security Level: Public (Anyone can view this level - this is the default.)
Labels:
None
Environment:
Load Test environment with simulator

Description

Scaled setup with 20K simulated hosts and as many VMs, Basic Zone, Clustered setup with 3 management servers.

After restarting one or all the Management servers, re balancing hosts takes very long time, and the rebalanced hosts don't come to Up state either.
Following is the error message from logs when the MS was starting up:

2013-05-31 00:00:22,361 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Joining node, IP: 10.223.48.2, msid: 206915885094132
2013-05-31 00:00:22,361 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Joining node, IP: 10.223.48.66, msid: 206915885097283
2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Receive cluster alert, EventArgs: com.cloud.cluster.ClusterNodeJoinEventArgs
2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Handle cluster node join alert, joined node: 10.223.48.2, msidL: 206915885094132
2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Handle cluster node join alert, joined node: 10.223.48.66, msidL: 206915885097283
2013-05-31 00:00:22,362 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Management server node 10.223.48.66 is up, send alert
2013-05-31 00:00:22,539 WARN [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Notifying management server join event took 178 ms
2013-05-31 00:00:25,825 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 5583ms, profilerH
eartbeatUpdate: Done. Duration: 187ms, profilerPeerScan: Done. Duration: 560ms, profilerAgentLB: Done. Duration: 4836ms
2013-05-31 00:00:25,963 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Detected management node joined, id:3, nodeIP:10.223.48.130
2013-05-31 00:00:25,964 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Notify management server node join to listeners.
2013-05-31 00:00:25,964 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Joining node, IP: 10.223.48.130, msid: 206915885093830
2013-05-31 00:00:25,964 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Receive cluster alert, EventArgs: com.cloud.cluster.ClusterNodeJoinEventArgs
2013-05-31 00:00:25,964 DEBUG [cloud.alert.ClusterAlertAdapter] (Cluster-Notification-1:null) Handle cluster node join alert, joined node: 10.223.48.130, msidL: 206915885093830
2013-05-31 00:00:25,964 WARN [cloud.cluster.ClusterManagerImpl] (Cluster-Notification-1:null) Notifying management server join event took 0 ms
2013-05-31 00:00:27,501 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1675ms, profilerH
eartbeatUpdate: Done. Duration: 135ms, profilerPeerScan: Done. Duration: 179ms, profilerAgentLB: Done. Duration: 1361ms
2013-05-31 00:00:29,389 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1888ms, profilerH
eartbeatUpdate: Done. Duration: 462ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1424ms
2013-05-31 00:00:33,912 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1643ms, profilerH
eartbeatUpdate: Done. Duration: 141ms, profilerPeerScan: Done. Duration: 1ms, profilerAgentLB: Done. Duration: 1501ms
2013-05-31 00:00:41,328 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1675ms, profilerH
eartbeatUpdate: Done. Duration: 178ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1495ms
2013-05-31 00:00:48,799 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Running Capacity Checker ...
2013-05-31 00:00:48,805 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) recalculating system capacity
2013-05-31 00:00:48,805 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Executing cpu/ram capacity update
2013-05-31 00:00:48,883 DEBUG [network.router.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:null) Found 0 routers to update status.
2013-05-31 00:00:48,901 DEBUG [network.router.VirtualNetworkApplianceManagerImpl] (RouterStatusMonitor-1:null) Found 0 networks to update RvR status.
2013-05-31 00:00:50,031 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1596ms, profilerHeartbeatUpdate: Done. Duration: 179ms, profilerPeerScan: Done. Duration: 1ms, profilerAgentLB: Done. Duration: 1416ms
2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:2 usedCpu: 500 reservedCpu: 0
2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:2 usedMem: 268435456 reservedMem: 0
2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:3 usedCpu: 500 reservedCpu: 0
2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:3 usedMem: 1073741824 reservedMem: 0
2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:5 usedCpu: 0 reservedCpu: 0
...
...
...

2013-05-31 00:00:50,388 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:2 usedMem: 268435456 reservedMem: 0
2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:3 usedCpu: 500 reservedCpu: 0
2013-05-31 00:00:50,397 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:3 usedMem: 1073741824 reservedMem: 0
2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:5 usedCpu: 0 reservedCpu: 0
2013-05-31 00:00:50,406 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:5 usedMem: 0 reservedMem: 0
2013-05-31 00:00:50,414 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate cpu capacity, host:6 usedCpu: 0 reservedCpu: 0
...
...
...

2013-05-31 00:03:05,249 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:null) No need to calibrate memory capacity, host:20002 usedMem: 0 reservedMem: 0
2013-05-31 00:03:05,249 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Done executing cpu/ram capacity update
2013-05-31 00:03:05,250 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Executing storage capacity update
2013-05-31 00:03:05,890 DEBUG [cloud.storage.StorageManagerImpl] (CapacityChecker:null) Successfully set Capacity - 1099511627776 for capacity type - 9 , DataCenterId - 1, HostOrPoolId - 20
0, PodId 1
2013-05-31 00:03:06,080 DEBUG [cloud.storage.StorageManagerImpl] (CapacityChecker:null) Successfully set Capacity - 1099511627776 for capacity type - 9 , DataCenterId - 1, HostOrPoolId - 20
1, PodId 1

2013-05-31 00:04:32,680 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Done executing storage capacity update
2013-05-31 00:04:32,680 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Executing capacity updates for public ip and Vlans
2013-05-31 00:04:33,488 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1565ms, profilerHeartbeatUpdate: Done. Duration: 122ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1441ms
2013-05-31 00:04:35,070 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1582ms, profilerHeartbeatUpdate: Done. Duration: 172ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1408ms
2013-05-31 00:04:35,206 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Done capacity updates for public ip and Vlans
2013-05-31 00:04:35,206 DEBUG [cloud.alert.AlertManagerImpl] (CapacityChecker:null) Executing capacity updates for private ip
2013-05-31 00:04:37,155 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 2085ms, profilerHeartbeatUpdate: Done. Duration: 158ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1925ms

...
...
...

2013-05-31 01:56:10,384 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 2105ms, profilerHeartbeatUpdate: Done. Duration: 141ms, profilerPeerScan: Done. Duration: 1ms, profilerAgentLB: Done. Duration: 1963ms
2013-05-31 01:56:12,043 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 1659ms, profilerHeartbeatUpdate: Done. Duration: 120ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 1537ms
2013-05-31 01:56:14,172 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Heartbeat-1:null) Management server heartbeat takes too long to finish. profiler: Done. Duration: 2129ms, profilerHeartbeatUpdate: Done. Duration: 96ms, profilerPeerScan: Done. Duration: 2ms, profilerAgentLB: Done. Duration: 2031ms

All 3 Management servers have the same above message and the hosts continue to be in disconnected state

mysql> select count, status, type, mgmt_server_id from host group by mgmt_server_id, status, type;
-------------------------------------------------------+

count

status

type

mgmt_server_id

-------------------------------------------------------+

1	Alert	SecondaryStorage	NULL
17854	Disconnected	Routing	NULL
1	Disconnected	SecondaryStorageVM	NULL
53	Disconnected	Routing	206915885093830
1485	Disconnected	Routing	206915885094132
608	Disconnected	Routing	206915885097283

-------------------------------------------------------+
6 rows in set (0.07 sec)

Observed this issue even with lesser hosts ~5K

In a scaled up environment, hosts fail to come up after Management server restart in clustered set up

Details

Description

Attachments

Activity

People

Dates