Uploaded image for project: 'CloudStack'
  1. CloudStack
  2. CLOUDSTACK-4371

[Performance Testing] Basic zone with 20K Hosts, management server restart leaves the hosts in disconnected state for very long time

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 4.2.0
    • 4.3.0
    • Management Server
    • Security Level: Public (Anyone can view this level - this is the default.)
    • Basic zone, with over 20K simulator hosts

    Description

      Basic zone performance test bed:
      20K simulator hosts,
      3 Management servers
      1 host/cluster
      Local storage

      Java heap size: 12GB
      db.cloud.maxActive=2000
      direct.agent.load.size=1000
      agent.lb.enabled=true

      Deploy around 20K simulator hosts with 3 Management servers clustered
      (Not deployed any VMs yet)

      After all hosts are deployed, stop all 3 Management servers and then start all 3 one after another

      Result
      =====

      Hosts don't get to connected state at all even after 10 minutes. While around 2K of them go into alert state while rest are in disconnected state.

      mysql> select count, status, resource_state, type, mgmt_server_id from host group by mgmt_server_id, status, type, resource_state;
      --------------------------------------------------------------------

      count status resource_state type mgmt_server_id

      --------------------------------------------------------------------

      1946 Alert Enabled Routing NULL
      18054 Disconnected Enabled Routing NULL
      1 Disconnected Enabled SecondaryStorageVM NULL

      --------------------------------------------------------------------
      3 rows in set (0.11 sec)

      MS Logs show lot of storage pool exceptions while hosts try to get connected:

      2013-08-16 05:49:25,592 DEBUG [agent.transport.Request] (AgentTaskPool-12:null) Seq 13-32440322: Sending { Cmd , MgmtId: 206915885094132, via: 13, Ver: v1, Flags: 100011, [{"com.cloud.agen
      t.api.CleanupNetworkRulesCmd":{"interval":2028,"wait":0}}] }
      2013-08-16 05:49:25,592 DEBUG [agent.transport.Request] (AgentTaskPool-12:null) Seq 13-32440322: Executing: { Cmd , MgmtId: 206915885094132, via: 13, Ver: v1, Flags: 100011, [{"com.cloud.a
      gent.api.CleanupNetworkRulesCmd":{"interval":2028,"wait":0}}] }
      2013-08-16 05:49:25,592 DEBUG [xen.discoverer.XcpServerDiscoverer] (AgentTaskPool-14:null) Not XenServer so moving on.
      2013-08-16 05:49:25,592 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-14:null) Sending Connect to listener: DeploymentPlanningManagerImpl_EnhancerByCloudStack_76f3d8e4
      2013-08-16 05:49:25,591 DEBUG [cloud.resource.AgentResourceBase] (ClusteredAgentManager Timer:null) Deserializing simulated agent on reconnect
      2013-08-16 05:49:25,594 INFO [network.security.SecurityGroupListener] (AgentTaskPool-12:null) Scheduled network rules cleanup, interval=2028
      2013-08-16 05:49:25,594 INFO [network.security.SecurityGroupListener] (AgentTaskPool-12:null) Received a host startup notification
      2013-08-16 05:49:25,595 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-12:null) Sending Connect to listener: StoragePoolMonitor

      ...
      ...

      2013-08-16 05:49:25,761 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-12:null) Sending Connect to listener: ClusteredVirtualMachineManagerImpl_EnhancerByCloudStack_b5459b7b
      2013-08-16 05:49:25,764 DEBUG [cloud.vm.VirtualMachineManagerImpl] (AgentTaskPool-12:null) Found 0 VMs for host 13
      2013-08-16 05:49:25,765 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-12:null) Sending Connect to listener: LocalStoragePoolListener
      2013-08-16 05:49:25,768 DEBUG [datastore.lifecycle.CloudStackPrimaryDataStoreLifeCycleImpl] (AgentTaskPool-12:null) createPool Params @ scheme - Filesystem storageHost - 172.1.3.131 hostPath - /mnt/2a2463b4-4fd2-4ac7-ad3f-040a3046e478 port - -1
      2013-08-16 05:49:25,771 DEBUG [datastore.lifecycle.CloudStackPrimaryDataStoreLifeCycleImpl] (AgentTaskPool-12:null) Another active pool with the same uuid already exists
      2013-08-16 05:49:25,772 WARN [cloud.storage.StorageManagerImpl] (AgentTaskPool-12:null) Unable to setup the local storage pool for Host[-13-Routing]
      com.cloud.utils.exception.CloudRuntimeException: Another active pool with the same uuid already exists
      at org.apache.cloudstack.storage.datastore.lifecycle.CloudStackPrimaryDataStoreLifeCycleImpl.initialize(CloudStackPrimaryDataStoreLifeCycleImpl.java:319)
      at com.cloud.storage.StorageManagerImpl.createLocalStorage(StorageManagerImpl.java:647)
      at com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
      at com.cloud.storage.LocalStoragePoolListener.processConnect(LocalStoragePoolListener.java:86)
      at com.cloud.agent.manager.AgentManagerImpl.notifyMonitorsOfConnection(AgentManagerImpl.java:587)
      at com.cloud.agent.manager.AgentManagerImpl.handleDirectConnectAgent(AgentManagerImpl.java:1479)
      at com.cloud.resource.ResourceManagerImpl.createHostAndAgent(ResourceManagerImpl.java:1739)
      at com.cloud.resource.ResourceManagerImpl.createHostAndAgent(ResourceManagerImpl.java:1901)
      at com.cloud.agent.manager.AgentManagerImpl$SimulateStartTask.run(AgentManagerImpl.java:1130)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:679)
      2013-08-16 05:49:25,773 INFO [utils.exception.CSExceptionErrorCode] (AgentTaskPool-12:null) Could not find exception: com.cloud.exception.ConnectionException in error code list for exceptions
      2013-08-16 05:49:25,773 WARN [agent.manager.AgentManagerImpl] (AgentTaskPool-12:null) Monitor LocalStoragePoolListener says there is an error in the connect process for 13 due to Unable to setup the local storage pool for Host[-13-Routing]
      2013-08-16 05:49:25,773 INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-12:null) Host 13 is disconnecting with event AgentDisconnected

      Attachments

        1. agenttaskpool_334.log
          14 kB
          Sowmya Krishnan
        2. ms1_restartfail.log.gz
          1.59 MB
          Sowmya Krishnan
        3. ms2_restartfail.log.gz
          1.57 MB
          Sowmya Krishnan
        4. ms3_restartfail.log.gz
          1.65 MB
          Sowmya Krishnan

        Activity

          People

            koushikd Koushik Das
            sowmyak Sowmya Krishnan
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: