Uploaded image for project: 'Stratos'
  1. Stratos
  2. STRATOS-706

member terminate event should log reason

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0.0
    • Fix Version/s: FUTURE
    • Component/s: Autoscaler
    • Labels:
      None

      Description

      When Stratos terminates a member it must log the reason for it. Ideally the logging should be systematic enough so that one can grep for different severity, or by member, or by event type or some other useful categorization.
      The justification for this defect is that it will improve greatly debugging and troubleshooting capabilities. Without logging it is very difficult to debug terminations of members.

      For example, consider this sequence in the stratos log file:

      ===================
      TID: [0] [STRATOS] [2014-07-15 09:58:48,654] DEBUG

      {org.apache.stratos.cloud.controller.impl.CloudControllerServiceImpl} - Received an instance spawn request : MemberContext [memberId=null, nodeId=null, clusterId=cisco-gilan-appmgr-1.cisco-gil, cartridgeType=null, privateIpAddress=null, publicIpAddress=null, allocatedIpAddress=null, initTime=1405418328649, lbClusterId=null, networkPartitionId=OAM1] {org.apache.stratos.cloud.controller.impl.CloudControllerServiceImpl}

      TID: [0] [STRATOS] [2014-07-15 09:58:48,654] DEBUG

      {org.apache.stratos.cloud.controller.impl.CloudControllerServiceImpl} - Payload: SERVICE_NAME=cisco-gilan-appmgr,HOST_NAME=cisco-gilan-appmgr-1.qmog.cisco.com,MULTITENANT=false,TENANT_ID=-1234,TENANT_RANGE=-1234,CARTRIDGE_ALIAS=cisco-gilan-appmgr-1,CLUSTER_ID=cisco-gilan-appmgr-1.cisco-gil,CARTRIDGE_KEY=o1jbiPPmPWBgyNVM,DEPLOYMENT=default,REPO_URL=null,PORTS=9482,PUPPET_IP=PUPPET_IP,PUPPET_HOSTNAME=PUPPET_HOSTNAME,PUPPET_ENV=PUPPET_ENV,HEARTBEAT_AUTHKEY=20c9629a87f53ecdb5278d2ddb5a9d42,TRUSTSTORE_PASSWORD=wso2carbon,CEP_PORT=7611,MONITORING_SERVER_SECURE_PORT=0,MB_PORT=61616,OPENSTACK_COMPUTE_DNS=10.58.10.82,MB_IP=octl-01.qmog.cisco.com,QSB_PUPPET_ENVIR=,CEP_IP=octl-01.qmog.cisco.com,VSM_USER=admin,VEM_IP=192.168.66.43,ENABLE_DATA_PUBLISHER=false,MONITORING_SERVER_ADMIN_PASSWORD=xxxx,MONITORING_SERVER_IP=octl-01.qmog.cisco.com,VEM_USER=ubuntu,VEM_PWD=ubuntu,COMMIT_ENABLED=false,MONITORING_SERVER_ADMIN_USERNAME=xxxx,CERT_TRUSTSTORE=/opt/apache-stratos-cartridge-agent/security/client-truststore.jks,VSM_PWD=Starent123!,VSM_IP=192.168.66.2,MONITORING_SERVER_PORT=0,APPMGR_GITREPO=ssh://jenapper@10.58.10.189/home/jenapper/code/eccentrica.git,MEMBER_ID=cisco-gilan-appmgr-1.cisco-gil7ef7327f-2bb2-4768-820f-d064de29aa59,LB_CLUSTER_ID=null,NETWORK_PARTITION_ID=OAM1,PARTITION_ID=RegionOne-AZ-1 {org.apache.stratos.cloud.controller.impl.CloudControllerServiceImpl}

      TID: [0] [STRATOS] [2014-07-15 09:58:55,888] INFO

      {org.apache.stratos.cloud.controller.impl.CloudControllerServiceImpl} - Member is terminated: MemberContext [memberId=cisco-gilan-appmgr-1.cisco-gil407f5bdc-aad2-4234-80fc-6cdf17be6192, nodeId=RegionOne/89433818-21ed-48d4-bd8f-c396ab30f6d2, clusterId=cisco-gilan-appmgr-1.cisco-gil, cartridgeType=cisco-gilan-appmgr, privateIpAddress=192.168.66.1, publicIpAddress=null, allocatedIpAddress=null, initTime=1405417410736, lbClusterId=null, networkPartitionId=OAM1] {org.apache.stratos.cloud.controller.impl.CloudControllerServiceImpl}

      ===================

      The problem is that Stratos gives no indication of why it is doing this [1]. Stratos should be enhanced so that the above message gives some indication of why the member is being terminated (loss of heartbeats, timeout on port knocking etc. etc.). This is needed as apache stratos expands it's user base.
      This issue has high priority as it affects the efficiency of troubleshooting and system stability.

        Activity

        Hide
        shahhaqu@cisco.com Shaheed Haque added a comment -

        That's what I thought.

        So now you see the problem: the extract I provided was the complete lifecycle of that instance, and there is nothing in the log to indicate "why" the termination happened. Either the log levels in all the callers need to be matched (which is itself sucky for both modularity and readability) or they should provide information along the lines I indicated so the single terminate message has the needed info.

        Show
        shahhaqu@cisco.com Shaheed Haque added a comment - That's what I thought. So now you see the problem: the extract I provided was the complete lifecycle of that instance, and there is nothing in the log to indicate "why" the termination happened. Either the log levels in all the callers need to be matched (which is itself sucky for both modularity and readability) or they should provide information along the lines I indicated so the single terminate message has the needed info.
        Hide
        shahhaqu@cisco.com Shaheed Haque added a comment -

        That's what I thought.

        So now you see the problem: the extract I provided was the complete lifecycle of that instance, and there is nothing in the log to indicate "why" the termination happened. Either the log levels in all the callers need to be matched (which is itself sucky for both modularity and readability) or they should provide information along the lines I indicated so the single terminate message has the needed info.

        Show
        shahhaqu@cisco.com Shaheed Haque added a comment - That's what I thought. So now you see the problem: the extract I provided was the complete lifecycle of that instance, and there is nothing in the log to indicate "why" the termination happened. Either the log levels in all the callers need to be matched (which is itself sucky for both modularity and readability) or they should provide information along the lines I indicated so the single terminate message has the needed info.
        Hide
        shahhaqu@cisco.com Shaheed Haque added a comment -

        That's what I thought.

        So now you see the problem: the extract I provided was the complete lifecycle of that instance, and there is nothing in the log to indicate "why" the termination happened. Either the log levels in all the callers need to be matched (which is itself sucky for both modularity and readability) or they should provide information along the lines I indicated so the single terminate message has the needed info.

        Show
        shahhaqu@cisco.com Shaheed Haque added a comment - That's what I thought. So now you see the problem: the extract I provided was the complete lifecycle of that instance, and there is nothing in the log to indicate "why" the termination happened. Either the log levels in all the callers need to be matched (which is itself sucky for both modularity and readability) or they should provide information along the lines I indicated so the single terminate message has the needed info.
        Hide
        meppel Martin Eppel added a comment -

        I checked the code and found a few instances of VM termination:
        1. scale-down

        • logging (at info level) is provided (see scale.drl, "[scale-down] Trying to terminating an instace to scale down!"
          2. termination of obsolete member
        • in the prelude of running the obsolete member check logging is provided but one could argue it is missing in the execution part of the rule similar to scale.drl (see mincheck.drl, rule "Terminate Obsoleted Instances")
        • more info available under debug level (RuleLog.debug)
          3. Unregistering of subscription
        • seems to be missing when VMs (members) are terminated (CloudControlerServiceImpl.unregisterService)
          4. (only applies to local POC grouping branch: is missing when instances are terminated because of failed dependency check)
          has been added since
          Are there other instance of VM termination which I might have missed ?
        Show
        meppel Martin Eppel added a comment - I checked the code and found a few instances of VM termination: 1. scale-down logging (at info level) is provided (see scale.drl, " [scale-down] Trying to terminating an instace to scale down!" 2. termination of obsolete member in the prelude of running the obsolete member check logging is provided but one could argue it is missing in the execution part of the rule similar to scale.drl (see mincheck.drl, rule "Terminate Obsoleted Instances") more info available under debug level (RuleLog.debug) 3. Unregistering of subscription seems to be missing when VMs (members) are terminated (CloudControlerServiceImpl.unregisterService) 4. (only applies to local POC grouping branch: is missing when instances are terminated because of failed dependency check) has been added since Are there other instance of VM termination which I might have missed ?
        Hide
        nirmal Nirmal Fernando added a comment -

        Hi Shaheed,

        All logs should be there in wso2carbon.log file of single JVM.

        Show
        nirmal Nirmal Fernando added a comment - Hi Shaheed, All logs should be there in wso2carbon.log file of single JVM.
        Hide
        shahhaqu@cisco.com Shaheed Haque added a comment -

        Any update on this, or am I missing somehting obvious?

        Show
        shahhaqu@cisco.com Shaheed Haque added a comment - Any update on this, or am I missing somehting obvious?
        Hide
        shahhaqu@cisco.com Shaheed Haque added a comment -

        I am confused. I thought that with the single JVM packaging, all Stratos logs went to /opt/wso2/apace-stratos/repository/logs/wso2carbon.log (and its neighbours):

        root@octl-01:~# ls -l /opt/wso2/apache-stratos/repository/logs/
        total 364
        rw-rr- 1 root root 6369 Jul 17 09:08 aggregate.log
        rw-rr- 1 root root 0 Jul 17 09:03 audit.log
        rw-rr- 1 root root 7751 Jul 17 09:05 http_access_2014-07-17.log
        rw-rr- 1 root root 6805 Jul 17 09:02 patches.log
        rw-rr- 1 root root 3779 Jul 17 09:03 tm.out
        rw-rr- 1 root root 239236 Jul 17 09:07 wso2carbon.log
        rw-rr- 1 root root 0 Jul 17 09:02 wso2carbon-trace-messages.log
        rw-rr- 1 root root 101022 Jul 17 09:08 wso2-cep-trace.log

        Where are the autoscalar logs?

        Thanks, Shaheed

        Show
        shahhaqu@cisco.com Shaheed Haque added a comment - I am confused. I thought that with the single JVM packaging, all Stratos logs went to /opt/wso2/apace-stratos/repository/logs/wso2carbon.log (and its neighbours): root@octl-01:~# ls -l /opt/wso2/apache-stratos/repository/logs/ total 364 rw-r r - 1 root root 6369 Jul 17 09:08 aggregate.log rw-r r - 1 root root 0 Jul 17 09:03 audit.log rw-r r - 1 root root 7751 Jul 17 09:05 http_access_2014-07-17.log rw-r r - 1 root root 6805 Jul 17 09:02 patches.log rw-r r - 1 root root 3779 Jul 17 09:03 tm.out rw-r r - 1 root root 239236 Jul 17 09:07 wso2carbon.log rw-r r - 1 root root 0 Jul 17 09:02 wso2carbon-trace-messages.log rw-r r - 1 root root 101022 Jul 17 09:08 wso2-cep-trace.log Where are the autoscalar logs? Thanks, Shaheed
        Hide
        meppel@cisco.com Martin Eppel (meppel) added a comment -

        Ok,

        let us verify that the logs from the autoscaler will satisfy the request for sufficient logging, if yes I’ll close it otherwise update the JIRA
        Thanks

        Martin

        From: Udara Liyanage udara@wso2.com
        Sent: Wednesday, July 16, 2014 9:24 PM
        To: dev
        Cc: dev@stratos.incubator.apache.org
        Subject: Re: [jira] [Commented] (STRATOS-706) member terminate event should log reason

        Hi Martin,

        The job of the CC is to spawn/terminate instances. AS is the one who decides when/what to start and when to terminate. So as Nirmal said have a look at the AS logs in order to find the reason for termination/spawning.

        On Thu, Jul 17, 2014 at 5:09 AM, Nirmal Fernando (JIRA) <jira@apache.org<jira@apache.org>> wrote:

        [ https://issues.apache.org/jira/browse/STRATOS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14064337#comment-14064337 ]

        Nirmal Fernando commented on STRATOS-706:
        -----------------------------------------

        On Thu, Jul 17, 2014 at 1:11 AM, Martin Eppel (JIRA) <jira@apache.org<jira@apache.org>>

        All the log file you quoted is from Cloud Controller. And what CC does is
        providing an API to terminate instances. The caller of this API, i.e.
        auto-scaler is the one who logs the reason for calling CC to terminate
        instances. Did you check auto-scaler logs?


        Best Regards,
        Nirmal

        Nirmal Fernando.
        PPMC Member & Committer of Apache Stratos,
        Senior Software Engineer, WSO2 Inc.

        Blog: http://nirmalfdo.blogspot.com/


        This message was sent by Atlassian JIRA
        (v6.2#6252)

        Udara Liyanage
        Software Engineer
        WSO2, Inc.: http://wso2.com<http://wso2.com/>
        lean. enterprise. middleware
        web: http://udaraliyanage.wordpress.com
        phone: +94 71 443 6897

        Show
        meppel@cisco.com Martin Eppel (meppel) added a comment - Ok, let us verify that the logs from the autoscaler will satisfy the request for sufficient logging, if yes I’ll close it otherwise update the JIRA Thanks Martin From: Udara Liyanage udara@wso2.com Sent: Wednesday, July 16, 2014 9:24 PM To: dev Cc: dev@stratos.incubator.apache.org Subject: Re: [jira] [Commented] ( STRATOS-706 ) member terminate event should log reason Hi Martin, The job of the CC is to spawn/terminate instances. AS is the one who decides when/what to start and when to terminate. So as Nirmal said have a look at the AS logs in order to find the reason for termination/spawning. On Thu, Jul 17, 2014 at 5:09 AM, Nirmal Fernando (JIRA) <jira@apache.org< jira@apache.org >> wrote: [ https://issues.apache.org/jira/browse/STRATOS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14064337#comment-14064337 ] Nirmal Fernando commented on STRATOS-706 : ----------------------------------------- On Thu, Jul 17, 2014 at 1:11 AM, Martin Eppel (JIRA) <jira@apache.org< jira@apache.org >> All the log file you quoted is from Cloud Controller. And what CC does is providing an API to terminate instances. The caller of this API, i.e. auto-scaler is the one who logs the reason for calling CC to terminate instances. Did you check auto-scaler logs? – Best Regards, Nirmal Nirmal Fernando. PPMC Member & Committer of Apache Stratos, Senior Software Engineer, WSO2 Inc. Blog: http://nirmalfdo.blogspot.com/ – This message was sent by Atlassian JIRA (v6.2#6252) – Udara Liyanage Software Engineer WSO2, Inc.: http://wso2.com < http://wso2.com/ > lean. enterprise. middleware web: http://udaraliyanage.wordpress.com phone: +94 71 443 6897
        Hide
        meppel@cisco.com Martin Eppel (meppel) added a comment -

        Ok,

        let us verify that the logs from the autoscaler will satisfy the request for sufficient logging, if yes I’ll close it otherwise update the JIRA
        Thanks

        Martin

        From: Udara Liyanage udara@wso2.com
        Sent: Wednesday, July 16, 2014 9:24 PM
        To: dev
        Cc: dev@stratos.incubator.apache.org
        Subject: Re: [jira] [Commented] (STRATOS-706) member terminate event should log reason

        Hi Martin,

        The job of the CC is to spawn/terminate instances. AS is the one who decides when/what to start and when to terminate. So as Nirmal said have a look at the AS logs in order to find the reason for termination/spawning.

        On Thu, Jul 17, 2014 at 5:09 AM, Nirmal Fernando (JIRA) <jira@apache.org<jira@apache.org>> wrote:

        [ https://issues.apache.org/jira/browse/STRATOS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14064337#comment-14064337 ]

        Nirmal Fernando commented on STRATOS-706:
        -----------------------------------------

        On Thu, Jul 17, 2014 at 1:11 AM, Martin Eppel (JIRA) <jira@apache.org<jira@apache.org>>

        All the log file you quoted is from Cloud Controller. And what CC does is
        providing an API to terminate instances. The caller of this API, i.e.
        auto-scaler is the one who logs the reason for calling CC to terminate
        instances. Did you check auto-scaler logs?


        Best Regards,
        Nirmal

        Nirmal Fernando.
        PPMC Member & Committer of Apache Stratos,
        Senior Software Engineer, WSO2 Inc.

        Blog: http://nirmalfdo.blogspot.com/


        This message was sent by Atlassian JIRA
        (v6.2#6252)

        Udara Liyanage
        Software Engineer
        WSO2, Inc.: http://wso2.com<http://wso2.com/>
        lean. enterprise. middleware
        web: http://udaraliyanage.wordpress.com
        phone: +94 71 443 6897

        Show
        meppel@cisco.com Martin Eppel (meppel) added a comment - Ok, let us verify that the logs from the autoscaler will satisfy the request for sufficient logging, if yes I’ll close it otherwise update the JIRA Thanks Martin From: Udara Liyanage udara@wso2.com Sent: Wednesday, July 16, 2014 9:24 PM To: dev Cc: dev@stratos.incubator.apache.org Subject: Re: [jira] [Commented] ( STRATOS-706 ) member terminate event should log reason Hi Martin, The job of the CC is to spawn/terminate instances. AS is the one who decides when/what to start and when to terminate. So as Nirmal said have a look at the AS logs in order to find the reason for termination/spawning. On Thu, Jul 17, 2014 at 5:09 AM, Nirmal Fernando (JIRA) <jira@apache.org< jira@apache.org >> wrote: [ https://issues.apache.org/jira/browse/STRATOS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14064337#comment-14064337 ] Nirmal Fernando commented on STRATOS-706 : ----------------------------------------- On Thu, Jul 17, 2014 at 1:11 AM, Martin Eppel (JIRA) <jira@apache.org< jira@apache.org >> All the log file you quoted is from Cloud Controller. And what CC does is providing an API to terminate instances. The caller of this API, i.e. auto-scaler is the one who logs the reason for calling CC to terminate instances. Did you check auto-scaler logs? – Best Regards, Nirmal Nirmal Fernando. PPMC Member & Committer of Apache Stratos, Senior Software Engineer, WSO2 Inc. Blog: http://nirmalfdo.blogspot.com/ – This message was sent by Atlassian JIRA (v6.2#6252) – Udara Liyanage Software Engineer WSO2, Inc.: http://wso2.com < http://wso2.com/ > lean. enterprise. middleware web: http://udaraliyanage.wordpress.com phone: +94 71 443 6897
        Hide
        udaraliyanage Udara Liyanage added a comment -

        Hi Martin,

        The job of the CC is to spawn/terminate instances. AS is the one who
        decides when/what to start and when to terminate. So as Nirmal said have a
        look at the AS logs in order to find the reason for termination/spawning.

        On Thu, Jul 17, 2014 at 5:09 AM, Nirmal Fernando (JIRA) <jira@apache.org>

        Udara Liyanage
        Software Engineer
        WSO2, Inc.: http://wso2.com
        lean. enterprise. middleware

        web: http://udaraliyanage.wordpress.com
        phone: +94 71 443 6897

        Show
        udaraliyanage Udara Liyanage added a comment - Hi Martin, The job of the CC is to spawn/terminate instances. AS is the one who decides when/what to start and when to terminate. So as Nirmal said have a look at the AS logs in order to find the reason for termination/spawning. On Thu, Jul 17, 2014 at 5:09 AM, Nirmal Fernando (JIRA) <jira@apache.org> – Udara Liyanage Software Engineer WSO2, Inc.: http://wso2.com lean. enterprise. middleware web: http://udaraliyanage.wordpress.com phone: +94 71 443 6897
        Hide
        udaraliyanage Udara Liyanage added a comment -

        Hi Martin,

        The job of the CC is to spawn/terminate instances. AS is the one who
        decides when/what to start and when to terminate. So as Nirmal said have a
        look at the AS logs in order to find the reason for termination/spawning.

        On Thu, Jul 17, 2014 at 5:09 AM, Nirmal Fernando (JIRA) <jira@apache.org>

        Udara Liyanage
        Software Engineer
        WSO2, Inc.: http://wso2.com
        lean. enterprise. middleware

        web: http://udaraliyanage.wordpress.com
        phone: +94 71 443 6897

        Show
        udaraliyanage Udara Liyanage added a comment - Hi Martin, The job of the CC is to spawn/terminate instances. AS is the one who decides when/what to start and when to terminate. So as Nirmal said have a look at the AS logs in order to find the reason for termination/spawning. On Thu, Jul 17, 2014 at 5:09 AM, Nirmal Fernando (JIRA) <jira@apache.org> – Udara Liyanage Software Engineer WSO2, Inc.: http://wso2.com lean. enterprise. middleware web: http://udaraliyanage.wordpress.com phone: +94 71 443 6897
        Hide
        nirmal Nirmal Fernando added a comment -

        On Thu, Jul 17, 2014 at 1:11 AM, Martin Eppel (JIRA) <jira@apache.org>

        All the log file you quoted is from Cloud Controller. And what CC does is
        providing an API to terminate instances. The caller of this API, i.e.
        auto-scaler is the one who logs the reason for calling CC to terminate
        instances. Did you check auto-scaler logs?


        Best Regards,
        Nirmal

        Nirmal Fernando.
        PPMC Member & Committer of Apache Stratos,
        Senior Software Engineer, WSO2 Inc.

        Blog: http://nirmalfdo.blogspot.com/

        Show
        nirmal Nirmal Fernando added a comment - On Thu, Jul 17, 2014 at 1:11 AM, Martin Eppel (JIRA) <jira@apache.org> All the log file you quoted is from Cloud Controller. And what CC does is providing an API to terminate instances. The caller of this API, i.e. auto-scaler is the one who logs the reason for calling CC to terminate instances. Did you check auto-scaler logs? – Best Regards, Nirmal Nirmal Fernando. PPMC Member & Committer of Apache Stratos, Senior Software Engineer, WSO2 Inc. Blog: http://nirmalfdo.blogspot.com/
        Hide
        nirmal Nirmal Fernando added a comment -

        On Thu, Jul 17, 2014 at 1:11 AM, Martin Eppel (JIRA) <jira@apache.org>

        All the log file you quoted is from Cloud Controller. And what CC does is
        providing an API to terminate instances. The caller of this API, i.e.
        auto-scaler is the one who logs the reason for calling CC to terminate
        instances. Did you check auto-scaler logs?


        Best Regards,
        Nirmal

        Nirmal Fernando.
        PPMC Member & Committer of Apache Stratos,
        Senior Software Engineer, WSO2 Inc.

        Blog: http://nirmalfdo.blogspot.com/

        Show
        nirmal Nirmal Fernando added a comment - On Thu, Jul 17, 2014 at 1:11 AM, Martin Eppel (JIRA) <jira@apache.org> All the log file you quoted is from Cloud Controller. And what CC does is providing an API to terminate instances. The caller of this API, i.e. auto-scaler is the one who logs the reason for calling CC to terminate instances. Did you check auto-scaler logs? – Best Regards, Nirmal Nirmal Fernando. PPMC Member & Committer of Apache Stratos, Senior Software Engineer, WSO2 Inc. Blog: http://nirmalfdo.blogspot.com/
        Hide
        snowch chris snow added a comment -

        I also think this sort of information should be made available in the stratos manager user interface so that tenants without access to the logs can see what is going on with their cartridges.

        Show
        snowch chris snow added a comment - I also think this sort of information should be made available in the stratos manager user interface so that tenants without access to the logs can see what is going on with their cartridges.

          People

          • Assignee:
            vishanth Vishanth
            Reporter:
            meppel Martin Eppel
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:

              Development