Whirr
  1. Whirr
  2. WHIRR-167

Improve bootstrapping and configuration to be able to isolate and repair or evict failing nodes on EC2

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.4.0
    • Fix Version/s: 0.4.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      Amazon EC2

      Description

      Actually it is very unstable the cluster startup process on Amazon EC2 instances. How the number of nodes to be started up is increasing the startup process it fails more often. But sometimes even 2-3 nodes startup process fails. We don't know how many number of instance startup is going on at the same time at Amazon side when it fails or when it successfully starting up. The only think I see is that when I am starting around 10 nodes, the statistics of failing nodes are higher then with smaller number of nodes and is not direct proportional with the number of nodes, looks like it is exponentialy higher probability to fail some nodes.

      Lookint into BootstrapCluterAction.java, there is a note "// TODO: Check for RunNodesException and don't bail out if only a few " which indicated the current unreliable startup process. So we should improve it.

      We could add a "max percent failure" property (per instance template), so that if the number failures exceeded this value the whole cluster fails to launch and is shutdown. For the master node the value would be 100%, but for datanodes it would be more like 75%. (Tom White also mentioned in an email).

      Let's discuss if there are any other requirements to this improvement.

      1. whirr.log
        35 kB
        Tibor Kiss
      2. whirr.log
        5 kB
        Tibor Kiss
      3. whirr-167-1.patch
        16 kB
        Tibor Kiss
      4. whirr-167-2.patch
        50 kB
        Tibor Kiss
      5. whirr-167-3.patch
        53 kB
        Tibor Kiss
      6. whirr-167-4.patch
        58 kB
        Tibor Kiss
      7. whirr-167-5.patch
        60 kB
        Tibor Kiss
      8. whirr-167-7.patch
        64 kB
        Tibor Kiss
      9. WHIRR-167-cleanup.patch
        3 kB
        Adrian Cole
      10. whirr-integrationtest.tar.gz
        5 kB
        Tibor Kiss

        Activity

        Hide
        Tibor Kiss added a comment -

        I attached a whirr.log which contains exactly the most famous case of failing node while starting the cluster and it breaks the entire cluster at the end.

        Show
        Tibor Kiss added a comment - I attached a whirr.log which contains exactly the most famous case of failing node while starting the cluster and it breaks the entire cluster at the end.
        Hide
        Tom White added a comment -

        This covers the requirements I think. If the namenode fails to start then we want to fail the whole cluster (in the future we may be able to do better than this, by starting a new namenode, but this is the behaviour we want first of all).

        Note that "max percent failure" property is per instance template (group of instances). For example, imagine that we have "1 nn+jt,10 dn+tt" with 100% max percent failure for nn+jt, and 75% for dn+tt. If the namenode fails then the whole cluster fails, whereas up to two datanodes can fail and the cluster will still run.

        In terms of testing, Mockito (which we're already using) can be used to simulate the relevant exceptions.

        Show
        Tom White added a comment - This covers the requirements I think. If the namenode fails to start then we want to fail the whole cluster (in the future we may be able to do better than this, by starting a new namenode, but this is the behaviour we want first of all). Note that "max percent failure" property is per instance template (group of instances). For example, imagine that we have "1 nn+jt,10 dn+tt" with 100% max percent failure for nn+jt, and 75% for dn+tt. If the namenode fails then the whole cluster fails, whereas up to two datanodes can fail and the cluster will still run. In terms of testing, Mockito (which we're already using) can be used to simulate the relevant exceptions.
        Hide
        Tibor Kiss added a comment - - edited

        I attached whirr-167-1.patch. I'm sure that is not a final one, but I would like to hear your opinions too.

        I changed the ClusterSpec and InstanceTemplate, in order to be able to tell a minimum percentage of successfully started nodes.
        If we are not specify anything, it means %100, so a value
        whirr.instance-templates=1 jt+nn,4 dn+tt%60
        would mean that "jt+nn" roles passess only when 100% of the nodes start successfully and
        "dn+tt" roles passess only when 60% of the nodes starts successfully.

        If any of the roles didn't passed the minim requirement, it will initiate a retry phase in which the failing nodes on each roles will be replaced with new ones. That means that even a namenode startup problem wouldn't mean a complete lost cluster.
        Without any retries a failure in namenode would break an entire cluster with many dn+tt successfuly started. I think that it worst to minimize the chance to fail in this way, therefore I introduced a retry cycle.
        If there are some failure in dn+tt only while passing the minimum limit, the cluster will start up only with that amount of nodes without any retry.
        A retry cycle would mean a chance for both roles to increase the number of nodes until the maximum value.

        At this moment I don't think that more than one retry it worst! The target is just to replace a few sporadic service problems.
        My question would be that we can leave a retry in case of insufficient nodes or we would leave the default value as without retry and add an extra parameter? Initially I wouldn't like the ideea to add more parameters.

        About failing nodes... There are 2 different cases:
        1. In case when the minimum required nodes couldn't be satisfied by a retry cycle, in that case all of the lost nodes will be left as it is. A full cluster destroy will be able to remove them.
        2. In case when the number of nodes is satisfied from the first round or a retry, all the failed nodes (from first round and from retry cycle) will be destroyed automatically at the end of BootstrapClusterAction.doAction.

        I experienced some difficulties in destroying the nodes. Initially I used a destroyNodesMatching(Predicate<NodeMetadata> filter) method which would terminate all my enumerated nodes in parallel. But this method would like to delete also the security group and placement group. Then I had to use the simple destroyNode(String id), which now deletes the nodes sequentially and I cannot control the KeyPair delition. My opinion that jclouds library is missing some convenient methods to revoke some nodes without optional propagation of KeyPair, SecurityGroup and PlacementGroup cleanup. Effectively here I get screwed up and I feel I couldn't find an elegant solution which does not incurr the revoke process.

        About Mockito simulation of retry....
        Unfortunately Mockito failed to mock the static ComputeServiceContextBuilder.build(clusterSpec) method, therefore I could write junit test for the retry. I could only test the retry and bad nodes cleanup by temporary hardcoding the exception then running it in live integration test. If somebody has an ideea how to mock all those static methods in BootstrapClusterAction, feel free to point me a solution.

        Show
        Tibor Kiss added a comment - - edited I attached whirr-167-1.patch. I'm sure that is not a final one, but I would like to hear your opinions too. I changed the ClusterSpec and InstanceTemplate, in order to be able to tell a minimum percentage of successfully started nodes. If we are not specify anything, it means %100, so a value whirr.instance-templates=1 jt+nn,4 dn+tt%60 would mean that "jt+nn" roles passess only when 100% of the nodes start successfully and "dn+tt" roles passess only when 60% of the nodes starts successfully. If any of the roles didn't passed the minim requirement, it will initiate a retry phase in which the failing nodes on each roles will be replaced with new ones. That means that even a namenode startup problem wouldn't mean a complete lost cluster. Without any retries a failure in namenode would break an entire cluster with many dn+tt successfuly started. I think that it worst to minimize the chance to fail in this way, therefore I introduced a retry cycle. If there are some failure in dn+tt only while passing the minimum limit, the cluster will start up only with that amount of nodes without any retry. A retry cycle would mean a chance for both roles to increase the number of nodes until the maximum value. At this moment I don't think that more than one retry it worst! The target is just to replace a few sporadic service problems. My question would be that we can leave a retry in case of insufficient nodes or we would leave the default value as without retry and add an extra parameter? Initially I wouldn't like the ideea to add more parameters. About failing nodes... There are 2 different cases: 1. In case when the minimum required nodes couldn't be satisfied by a retry cycle, in that case all of the lost nodes will be left as it is. A full cluster destroy will be able to remove them. 2. In case when the number of nodes is satisfied from the first round or a retry, all the failed nodes (from first round and from retry cycle) will be destroyed automatically at the end of BootstrapClusterAction.doAction. I experienced some difficulties in destroying the nodes. Initially I used a destroyNodesMatching(Predicate<NodeMetadata> filter) method which would terminate all my enumerated nodes in parallel. But this method would like to delete also the security group and placement group. Then I had to use the simple destroyNode(String id), which now deletes the nodes sequentially and I cannot control the KeyPair delition. My opinion that jclouds library is missing some convenient methods to revoke some nodes without optional propagation of KeyPair, SecurityGroup and PlacementGroup cleanup. Effectively here I get screwed up and I feel I couldn't find an elegant solution which does not incurr the revoke process. About Mockito simulation of retry.... Unfortunately Mockito failed to mock the static ComputeServiceContextBuilder.build(clusterSpec) method, therefore I could write junit test for the retry. I could only test the retry and bad nodes cleanup by temporary hardcoding the exception then running it in live integration test. If somebody has an ideea how to mock all those static methods in BootstrapClusterAction, feel free to point me a solution.
        Hide
        Adrian Cole added a comment -

        jclouds only removes the security groups and placement groups when there are no nodes left in the tag. are you experiencing something different?

        Show
        Adrian Cole added a comment - jclouds only removes the security groups and placement groups when there are no nodes left in the tag. are you experiencing something different?
        Hide
        Tibor Kiss added a comment - - edited

        I had to destroy two failing nodes and only the first one has been terminated. Unfortunately I saw an exception which complained about the intent of delition of security group and that exception breaked my integration junit test. I'm not sure about keypair delitions. I had a sensation that it cleared all the keys for my cluster. I am repeating the test to keep stacktraces.

        Show
        Tibor Kiss added a comment - - edited I had to destroy two failing nodes and only the first one has been terminated. Unfortunately I saw an exception which complained about the intent of delition of security group and that exception breaked my integration junit test. I'm not sure about keypair delitions. I had a sensation that it cleared all the keys for my cluster. I am repeating the test to keep stacktraces.
        Hide
        Tibor Kiss added a comment - - edited

        Here it is. I repeated the simulation of 1+2 failing nodes on a 1+2 cluster and I use
        computeService.destroyNodesMatching(withIds(badIds))
        where Predicate<NodeMetadata> withIds(String... ids) is the filter.

        At first failing node deletion it breaks my integration test:

        org.apache.whirr.service.hadoop.integration.HadoopServiceTest Time elapsed: 0 sec <<< ERROR!
        org.jclouds.aws.AWSResponseException: request POST https://ec2.us-east-1.amazonaws.com/ HTTP/1.1 failed with code 400, error: AWSError{requestId='3fa71f7d-508f-4a95-aa9c-af0c3e060035', requestToken='null', code='InvalidGroup.InUse', message='There are active instances using security group 'jclouds#hadoopclustertest#us-east-1'', context='

        {Response=, Errors=}

        '}
        at org.jclouds.aws.handlers.ParseAWSErrorFromXmlContent.handleError(ParseAWSErrorFromXmlContent.java:80)
        at org.jclouds.http.handlers.DelegatingErrorHandler.handleError(DelegatingErrorHandler.java:70)
        at org.jclouds.http.internal.BaseHttpCommandExecutorService$HttpResponseCallable.shouldContinue(BaseHttpCommandExecutorService.java:193)
        at org.jclouds.http.internal.BaseHttpCommandExecutorService$HttpResponseCallable.call(BaseHttpCommandExecutorService.java:163)
        at org.jclouds.http.internal.BaseHttpCommandExecutorService$HttpResponseCallable.call(BaseHttpCommandExecutorService.java:132)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

        in jclouds-compute.log I have:
        2011-01-09 23:40:45,414 DEBUG [jclouds.compute] (main) >> destroying nodes matching(withIds([us-east-1/i-698f4a05]))
        2011-01-09 23:40:53,202 DEBUG [jclouds.compute] (user thread 13) >> destroying node(us-east-1/i-698f4a05)
        2011-01-09 23:41:24,186 DEBUG [jclouds.compute] (user thread 13) << destroyed node(us-east-1/i-698f4a05) success(false)
        2011-01-09 23:41:24,186 DEBUG [jclouds.compute] (main) << destroyed(1)
        2011-01-09 23:41:24,380 DEBUG [jclouds.compute] (main) >> deleting keyPair(jclouds#hadoopclustertest#us-east-1#40)
        2011-01-09 23:41:24,547 DEBUG [jclouds.compute] (main) << deleted keyPair(jclouds#hadoopclustertest#us-east-1#40)
        2011-01-09 23:41:24,547 DEBUG [jclouds.compute] (main) >> deleting keyPair(jclouds#hadoopclustertest#us-east-1#67)
        2011-01-09 23:41:24,716 DEBUG [jclouds.compute] (main) << deleted keyPair(jclouds#hadoopclustertest#us-east-1#67)
        2011-01-09 23:41:24,716 DEBUG [jclouds.compute] (main) >> deleting keyPair(jclouds#hadoopclustertest#us-east-1#63)
        2011-01-09 23:41:24,882 DEBUG [jclouds.compute] (main) << deleted keyPair(jclouds#hadoopclustertest#us-east-1#63)
        2011-01-09 23:41:24,883 DEBUG [jclouds.compute] (main) >> deleting keyPair(jclouds#hadoopclustertest#us-east-1#2)
        2011-01-09 23:41:25,055 DEBUG [jclouds.compute] (main) << deleted keyPair(jclouds#hadoopclustertest#us-east-1#2)
        2011-01-09 23:41:25,240 DEBUG [jclouds.compute] (main) >> deleting securityGroup(jclouds#hadoopclustertest#us-east-1)

        And it removes all of my 4 keypairs! I have 4 keypairs because I had to retry both roles.
        Apropo.. I hope there is no problem with different keypair per roles (due to reply it is created the second keypair!).

        Show
        Tibor Kiss added a comment - - edited Here it is. I repeated the simulation of 1+2 failing nodes on a 1+2 cluster and I use computeService.destroyNodesMatching(withIds(badIds)) where Predicate<NodeMetadata> withIds(String... ids) is the filter. At first failing node deletion it breaks my integration test: org.apache.whirr.service.hadoop.integration.HadoopServiceTest Time elapsed: 0 sec <<< ERROR! org.jclouds.aws.AWSResponseException: request POST https://ec2.us-east-1.amazonaws.com/ HTTP/1.1 failed with code 400, error: AWSError{requestId='3fa71f7d-508f-4a95-aa9c-af0c3e060035', requestToken='null', code='InvalidGroup.InUse', message='There are active instances using security group 'jclouds#hadoopclustertest#us-east-1'', context=' {Response=, Errors=} '} at org.jclouds.aws.handlers.ParseAWSErrorFromXmlContent.handleError(ParseAWSErrorFromXmlContent.java:80) at org.jclouds.http.handlers.DelegatingErrorHandler.handleError(DelegatingErrorHandler.java:70) at org.jclouds.http.internal.BaseHttpCommandExecutorService$HttpResponseCallable.shouldContinue(BaseHttpCommandExecutorService.java:193) at org.jclouds.http.internal.BaseHttpCommandExecutorService$HttpResponseCallable.call(BaseHttpCommandExecutorService.java:163) at org.jclouds.http.internal.BaseHttpCommandExecutorService$HttpResponseCallable.call(BaseHttpCommandExecutorService.java:132) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) in jclouds-compute.log I have: 2011-01-09 23:40:45,414 DEBUG [jclouds.compute] (main) >> destroying nodes matching(withIds( [us-east-1/i-698f4a05] )) 2011-01-09 23:40:53,202 DEBUG [jclouds.compute] (user thread 13) >> destroying node(us-east-1/i-698f4a05) 2011-01-09 23:41:24,186 DEBUG [jclouds.compute] (user thread 13) << destroyed node(us-east-1/i-698f4a05) success(false) 2011-01-09 23:41:24,186 DEBUG [jclouds.compute] (main) << destroyed(1) 2011-01-09 23:41:24,380 DEBUG [jclouds.compute] (main) >> deleting keyPair(jclouds#hadoopclustertest#us-east-1#40) 2011-01-09 23:41:24,547 DEBUG [jclouds.compute] (main) << deleted keyPair(jclouds#hadoopclustertest#us-east-1#40) 2011-01-09 23:41:24,547 DEBUG [jclouds.compute] (main) >> deleting keyPair(jclouds#hadoopclustertest#us-east-1#67) 2011-01-09 23:41:24,716 DEBUG [jclouds.compute] (main) << deleted keyPair(jclouds#hadoopclustertest#us-east-1#67) 2011-01-09 23:41:24,716 DEBUG [jclouds.compute] (main) >> deleting keyPair(jclouds#hadoopclustertest#us-east-1#63) 2011-01-09 23:41:24,882 DEBUG [jclouds.compute] (main) << deleted keyPair(jclouds#hadoopclustertest#us-east-1#63) 2011-01-09 23:41:24,883 DEBUG [jclouds.compute] (main) >> deleting keyPair(jclouds#hadoopclustertest#us-east-1#2) 2011-01-09 23:41:25,055 DEBUG [jclouds.compute] (main) << deleted keyPair(jclouds#hadoopclustertest#us-east-1#2) 2011-01-09 23:41:25,240 DEBUG [jclouds.compute] (main) >> deleting securityGroup(jclouds#hadoopclustertest#us-east-1) And it removes all of my 4 keypairs! I have 4 keypairs because I had to retry both roles. Apropo.. I hope there is no problem with different keypair per roles (due to reply it is created the second keypair!).
        Hide
        Adrian Cole added a comment -

        I think the problem in your stacktrace probably applies to jclouds master. we have a place we can test this. Do you mind filing a bug? http://code.google.com/p/jclouds/issues/entry
        you may also want to watch this issue, which is very much related: http://code.google.com/p/jclouds/issues/detail?id=365

        wrt the keys: I wouldn't worry about them. The keys are only used to install your keypair, so deleting them doesn't matter.

        wrt stub testing: it isn't very easy to stub something that requires a concert of scripts That said, here's what we use: https://github.com/jclouds/jclouds/blob/master/compute/src/test/java/org/jclouds/compute/StubComputeServiceIntegrationTest.java

        I hope this helps!

        Show
        Adrian Cole added a comment - I think the problem in your stacktrace probably applies to jclouds master. we have a place we can test this. Do you mind filing a bug? http://code.google.com/p/jclouds/issues/entry you may also want to watch this issue, which is very much related: http://code.google.com/p/jclouds/issues/detail?id=365 wrt the keys: I wouldn't worry about them. The keys are only used to install your keypair, so deleting them doesn't matter. wrt stub testing: it isn't very easy to stub something that requires a concert of scripts That said, here's what we use: https://github.com/jclouds/jclouds/blob/master/compute/src/test/java/org/jclouds/compute/StubComputeServiceIntegrationTest.java I hope this helps!
        Hide
        Tibor Kiss added a comment -

        Hi Adrian,
        Thank you for your help.
        I filled an issue http://code.google.com/p/jclouds/issues/detail?id=445&can=6

        Show
        Tibor Kiss added a comment - Hi Adrian, Thank you for your help. I filled an issue http://code.google.com/p/jclouds/issues/detail?id=445&can=6
        Hide
        Tom White added a comment -

        This looks like a good start.

        I would consider doing the retries from within the Callable. Replace the anonymous Callable class with a class that handles its own retries, then fails if the number of retries is exceeded. The code that handles the Future can then fail the entire cluster if it gets an ExecutionException. The way you have it at the moment the retries happen in lock step, whereas if they are independent they can happen concurrently. E.g. a master node fails quickly and is retired while the workers are all coming up. It would also allow for per-group retry strategies down the road, which might be useful.

        On the template syntax, rather than overload the instance template language, how about having separate properties to specify the percentages? This would be easier to parse, and would allow other properties to be associated with a template.

        whirr.instance-templates=1 jt+nn,4 dn+tt
        whirr.instance-template.jt+nn.max-percent-failure=100
        whirr.instance-template.dn+tt.max-percent-failure=60
        

        Alternatively (or in addition) we could have whirr.instance-template.<role>.minimum-number-of-instances.

        > Unfortunately Mockito failed to mock the static ComputeServiceContextBuilder.build(clusterSpec) method

        Make it non-static?

        Show
        Tom White added a comment - This looks like a good start. I would consider doing the retries from within the Callable. Replace the anonymous Callable class with a class that handles its own retries, then fails if the number of retries is exceeded. The code that handles the Future can then fail the entire cluster if it gets an ExecutionException. The way you have it at the moment the retries happen in lock step, whereas if they are independent they can happen concurrently. E.g. a master node fails quickly and is retired while the workers are all coming up. It would also allow for per-group retry strategies down the road, which might be useful. On the template syntax, rather than overload the instance template language, how about having separate properties to specify the percentages? This would be easier to parse, and would allow other properties to be associated with a template. whirr.instance-templates=1 jt+nn,4 dn+tt whirr.instance-template.jt+nn.max-percent-failure=100 whirr.instance-template.dn+tt.max-percent-failure=60 Alternatively (or in addition) we could have whirr.instance-template.<role>.minimum-number-of-instances . > Unfortunately Mockito failed to mock the static ComputeServiceContextBuilder.build(clusterSpec) method Make it non-static?
        Hide
        Tibor Kiss added a comment -

        Thanks Tom!
        I agree. In the weekend I can do it.

        Show
        Tibor Kiss added a comment - Thanks Tom! I agree. In the weekend I can do it.
        Hide
        Tibor Kiss added a comment -

        The following parameter construction

        whirr.instance-template.<role>.max-percent-failure=100

        can be implemented easyly in ClusterSpec, but there is another class, org.apache.whirr.cli.command.AbstractClusterSpecCommand which needs to parse the same set of options.
        But OptionParser.accepts(String) cannot handle the variable .<role>. and would result in bypassing these values if was passed as a command line parameter for LaunchClusterCommand.run.

        Somebody has an idea how to create an OptionSpec for such kind of variable parameter?

        Show
        Tibor Kiss added a comment - The following parameter construction whirr.instance-template.<role>.max-percent-failure=100 can be implemented easyly in ClusterSpec, but there is another class, org.apache.whirr.cli.command.AbstractClusterSpecCommand which needs to parse the same set of options. But OptionParser.accepts(String) cannot handle the variable .<role>. and would result in bypassing these values if was passed as a command line parameter for LaunchClusterCommand.run. Somebody has an idea how to create an OptionSpec for such kind of variable parameter?
        Hide
        Tom White added a comment -

        Can you use ClusterSpec#getInstanceTemplates() to get all the possible instance templates and generate the options from these?

        Show
        Tom White added a comment - Can you use ClusterSpec#getInstanceTemplates() to get all the possible instance templates and generate the options from these?
        Hide
        Tibor Kiss added a comment - - edited

        The OptionParser is already initialized the OptionSpecs in the constructor of AbstractClusterSpecCommand while the ConfigSpec is instantiated only after that, based on the collected OptionSpecs. As a result, ClusterSpec#getInstanceTemplates() cannot be used for generating the options.

        jOptSimple has to know exactly all the options we can pass through argument list.

        Something like the following can be constructed with jOptSimple api.

        whirr.instance-templates=1 jt+nn,4 dn+tt
        whirr.instance-templates-max-percent-failures=100 jt+nn,60 dn+tt
        whirr.instance-templates-minimum-number-of-instances=1 jt+nn,3 dn+tt
        

        or just enumerating only the roles which are not strictly limited to 100% successfull nodes.

        whirr.instance-templates=1 jt+nn,4 dn+tt
        whirr.instance-templates-max-percent-failures=60 dn+tt
        whirr.instance-templates-minimum-number-of-instances=3 dn+tt
        

        The Commons Configuration supports hierarchical configuration, the jOptSimple does not. Exactly the same constraint applies like in the case of whirr.instance-templates, in the sense that hierarchy can be expressed only in the value side.

        I was looking to the history of introducing these contructs. Commons Configuration it was introduced in WHIRR-75 then later the jOptSimple in WHIRR-102.

        Show
        Tibor Kiss added a comment - - edited The OptionParser is already initialized the OptionSpecs in the constructor of AbstractClusterSpecCommand while the ConfigSpec is instantiated only after that, based on the collected OptionSpecs. As a result, ClusterSpec#getInstanceTemplates() cannot be used for generating the options. jOptSimple has to know exactly all the options we can pass through argument list. Something like the following can be constructed with jOptSimple api. whirr.instance-templates=1 jt+nn,4 dn+tt whirr.instance-templates-max-percent-failures=100 jt+nn,60 dn+tt whirr.instance-templates-minimum-number-of-instances=1 jt+nn,3 dn+tt or just enumerating only the roles which are not strictly limited to 100% successfull nodes. whirr.instance-templates=1 jt+nn,4 dn+tt whirr.instance-templates-max-percent-failures=60 dn+tt whirr.instance-templates-minimum-number-of-instances=3 dn+tt The Commons Configuration supports hierarchical configuration, the jOptSimple does not. Exactly the same constraint applies like in the case of whirr.instance-templates, in the sense that hierarchy can be expressed only in the value side. I was looking to the history of introducing these contructs. Commons Configuration it was introduced in WHIRR-75 then later the jOptSimple in WHIRR-102 .
        Hide
        Tibor Kiss added a comment -

        I attached my second patch which for now has a two level of parallelization. There is a Callable per roles, each runs a StartupProcess.
        Inside the StartupProcess is handled the retry mechanism and failed nodes cleanup. The StartupProcess also uses a Callable for starting instances and also for cleaning up the failed nodes.

        In order to be able to JUnit tests using mockito, I had to extract some static constructs, for example the creation of ClusterActionHandler map is factored outside of ScriptBasedClusterAction. Also ComputeServiceContextBuilder got a new method to be able to feed a ComputeServiceContextFactory from the outside of ScriptBasedClusterAction. In this mode it is possible to JUnit test the internal algorithm of BootstrapClusterAction.

        Regarding the OptionSpec and hierarchical properties, I had to do it something similar with the instance-templates, in the sense that roles may appear only in the value side, as I mentioned in my previouse comment.

        Please review it again.

        Show
        Tibor Kiss added a comment - I attached my second patch which for now has a two level of parallelization. There is a Callable per roles, each runs a StartupProcess. Inside the StartupProcess is handled the retry mechanism and failed nodes cleanup. The StartupProcess also uses a Callable for starting instances and also for cleaning up the failed nodes. In order to be able to JUnit tests using mockito, I had to extract some static constructs, for example the creation of ClusterActionHandler map is factored outside of ScriptBasedClusterAction. Also ComputeServiceContextBuilder got a new method to be able to feed a ComputeServiceContextFactory from the outside of ScriptBasedClusterAction. In this mode it is possible to JUnit test the internal algorithm of BootstrapClusterAction. Regarding the OptionSpec and hierarchical properties, I had to do it something similar with the instance-templates, in the sense that roles may appear only in the value side, as I mentioned in my previouse comment. Please review it again.
        Hide
        Tom White added a comment -

        > As a result, ClusterSpec#getInstanceTemplates() cannot be used for generating the options.

        Good point. In the absence of some kind of two phase argument parsing, the approach you've taken looks like the best option.

        I've skimmed the patch and it looks good so far. The unit testing is good. Have you had success running real clusters or the integration tests?

        Show
        Tom White added a comment - > As a result, ClusterSpec#getInstanceTemplates() cannot be used for generating the options. Good point. In the absence of some kind of two phase argument parsing, the approach you've taken looks like the best option. I've skimmed the patch and it looks good so far. The unit testing is good. Have you had success running real clusters or the integration tests?
        Hide
        Tibor Kiss added a comment -

        I made some minor changes in the junit tests, for example I separated in two methods (in case the first cluster startup test fails, it may happened to not fail the test, because of @Test(expected = IOException.class) declaration.)

        I also made extensive testing with Hadoop, not only the integration test, but we effectively started to use on a 5 node small cluster based on CDH3b3 also. (I also built a rpm package from whirr and we are using in development environment.) I run some data tests and I also checked that each node works well.

        The integration test of course runs the Apache distribution of Hadoop. I attached the logs from integration tests, because the HBase test failed for some other reasons, not for the BootstrapClusterAction change. But it runs successfully the Cassandra, Hadoop and Zookeper tests.

        Eventually, if you have some ideea for the "kind of two phase argument parsing" mentioned, feel free to discuss.

        Show
        Tibor Kiss added a comment - I made some minor changes in the junit tests, for example I separated in two methods (in case the first cluster startup test fails, it may happened to not fail the test, because of @Test(expected = IOException.class) declaration.) I also made extensive testing with Hadoop, not only the integration test, but we effectively started to use on a 5 node small cluster based on CDH3b3 also. (I also built a rpm package from whirr and we are using in development environment.) I run some data tests and I also checked that each node works well. The integration test of course runs the Apache distribution of Hadoop. I attached the logs from integration tests, because the HBase test failed for some other reasons, not for the BootstrapClusterAction change. But it runs successfully the Cassandra, Hadoop and Zookeper tests. Eventually, if you have some ideea for the "kind of two phase argument parsing" mentioned, feel free to discuss.
        Hide
        Tom White added a comment -

        +1 This is looking good.

        • The code is swallowing NumberFormatExceptions in ClusterSpec, which is the right thing to do, but should have a comment, or a log statement.
        • The cases for NumberFormatException would benefit from unit tests.
        • Is ClusterSpec#parse(String... strings) still used? If not then we should remove it.
        • Please add some documentation for the new properties to src/site/confluence/configuration-guide.confluence.
        • Running "mvn checkstyle:checkstyle apache-rat:check" seems to produce some warnings. Can you fix these please.

        Thanks for running the integration tests.

        > the HBase test failed for some other reasons

        Yes, this is being tracked in WHIRR-201.

        > kind of two phase argument parsing

        I just meant that you'd have to parse the arguments once to pick up the templates (probably not using JOpt Simple), then again with a new options parser constructed with knowledge of the possible instance templates. But the current way is fine.

        Show
        Tom White added a comment - +1 This is looking good. The code is swallowing NumberFormatExceptions in ClusterSpec, which is the right thing to do, but should have a comment, or a log statement. The cases for NumberFormatException would benefit from unit tests. Is ClusterSpec#parse(String... strings) still used? If not then we should remove it. Please add some documentation for the new properties to src/site/confluence/configuration-guide.confluence. Running "mvn checkstyle:checkstyle apache-rat:check" seems to produce some warnings. Can you fix these please. Thanks for running the integration tests. > the HBase test failed for some other reasons Yes, this is being tracked in WHIRR-201 . > kind of two phase argument parsing I just meant that you'd have to parse the arguments once to pick up the templates (probably not using JOpt Simple), then again with a new options parser constructed with knowledge of the possible instance templates. But the current way is fine.
        Hide
        Tibor Kiss added a comment -
        • The code is swallowing NumberFormatExceptions in ClusterSpec, which is the right thing to do, but should have a comment, or a log statement.
          • We don't need NumberFormatException, which test I added to unit tests. So I removed from main code.
        • The cases for NumberFormatException would benefit from unit tests.
          • Now the unit test is verifying that the user is getting informed correctly if he/she creates badly formatted value string
        • Is ClusterSpec#parse(String... strings) still used? If not then we should remove it.
          • ClusterSpec#parse we don't have, just these:
            org.apache.whirr.service.ClusterSpec.InstanceTemplate.parse(String...)
            org.apache.whirr.service.ClusterSpec.InstanceTemplate.parse(CompositeConfiguration)
            

            called from

            setInstanceTemplates(InstanceTemplate.parse(c));
            
        • Please add some documentation for the new properties to src/site/confluence/configuration-guide.confluence.
          • I added.
        • Running "mvn checkstyle:checkstyle apache-rat:check" seems to produce some warnings. Can you fix these please.
          • For me does not signals anything wrong. I checked. If so could you point me out?

        Today I also started to use the patched version to create some clusters where I am running meaningful jobs and I experienced two cases.
        Initially I was setting up a

        whirr.instance-templates=1 jt+nn,8 dn+tt
        whirr.instance-templates-max-percent-failures=90% dn+tt
        

        It happened that 3 started nodes couldn't be accessible by ssh (nor with manual intervention), it started a retry mechanism where we exceed the 20 instance limit and the cluster startup effectively couldn't start it.
        Then later I reduced

        whirr.instance-templates=1 jt+nn,8 dn+tt
        whirr.instance-templates-max-percent-failures=60% dn+tt
        

        It happened to exceed the 20 instance limit, but only with one instance, of course the cluster goes online with 7 dn+tt without any retry.
        This is exactly what I would like to have. (Never mind about 20 instance limit, normally we get rid of this limit, but for these tests especially I keept on this account.)

        Show
        Tibor Kiss added a comment - The code is swallowing NumberFormatExceptions in ClusterSpec, which is the right thing to do, but should have a comment, or a log statement. We don't need NumberFormatException, which test I added to unit tests. So I removed from main code. The cases for NumberFormatException would benefit from unit tests. Now the unit test is verifying that the user is getting informed correctly if he/she creates badly formatted value string Is ClusterSpec#parse(String... strings) still used? If not then we should remove it. ClusterSpec#parse we don't have, just these: org.apache.whirr.service.ClusterSpec.InstanceTemplate.parse( String ...) org.apache.whirr.service.ClusterSpec.InstanceTemplate.parse(CompositeConfiguration) called from setInstanceTemplates(InstanceTemplate.parse(c)); Please add some documentation for the new properties to src/site/confluence/configuration-guide.confluence. I added. Running "mvn checkstyle:checkstyle apache-rat:check" seems to produce some warnings. Can you fix these please. For me does not signals anything wrong. I checked. If so could you point me out? Today I also started to use the patched version to create some clusters where I am running meaningful jobs and I experienced two cases. Initially I was setting up a whirr.instance-templates=1 jt+nn,8 dn+tt whirr.instance-templates-max-percent-failures=90% dn+tt It happened that 3 started nodes couldn't be accessible by ssh (nor with manual intervention), it started a retry mechanism where we exceed the 20 instance limit and the cluster startup effectively couldn't start it. Then later I reduced whirr.instance-templates=1 jt+nn,8 dn+tt whirr.instance-templates-max-percent-failures=60% dn+tt It happened to exceed the 20 instance limit, but only with one instance, of course the cluster goes online with 7 dn+tt without any retry. This is exactly what I would like to have. (Never mind about 20 instance limit, normally we get rid of this limit, but for these tests especially I keept on this account.)
        Hide
        Tibor Kiss added a comment -

        Also we have to take care of install and post-configure startup scripts. Especially the install which runs in the scope of BootstrapClusterAction retry mechanism, because currently does not check all the steps inside the script, so it may happen to not return a value different from 0, and the cluster cannot be recognised as badly started.
        For example, I keep my scripts and there I added a few lines which creates raid0 stipeset volume from 4 huge ephemeral storages (in case of m1.xlarge) resulting in 1.6TB volume which I am mounting on /mnt. Unfortunately this mdadm tool sometimes is getting busy or various side effects which causes that my /mnt finally is not mounted. Such kind of problems may need to be reviewed in the future.

        In the postconfigure phase a retry mechanism does not looks good, instead the script has to take care of repairing itself or something else.

        Show
        Tibor Kiss added a comment - Also we have to take care of install and post-configure startup scripts. Especially the install which runs in the scope of BootstrapClusterAction retry mechanism, because currently does not check all the steps inside the script, so it may happen to not return a value different from 0, and the cluster cannot be recognised as badly started. For example, I keep my scripts and there I added a few lines which creates raid0 stipeset volume from 4 huge ephemeral storages (in case of m1.xlarge) resulting in 1.6TB volume which I am mounting on /mnt. Unfortunately this mdadm tool sometimes is getting busy or various side effects which causes that my /mnt finally is not mounted. Such kind of problems may need to be reviewed in the future. In the postconfigure phase a retry mechanism does not looks good, instead the script has to take care of repairing itself or something else.
        Hide
        Tom White added a comment -

        Tibor, Sorry for not getting back to this sooner. The changes look good to me, +1. The patch unfortunately no longer applies, can you regenerate it, then we can get it committed. Thanks.

        I assume your previous comment about the install/configure scripts can be tackled separately? (By requiring the scripts' actions to be idempotent, for example.)

        Show
        Tom White added a comment - Tibor, Sorry for not getting back to this sooner. The changes look good to me, +1. The patch unfortunately no longer applies, can you regenerate it, then we can get it committed. Thanks. I assume your previous comment about the install/configure scripts can be tackled separately? (By requiring the scripts' actions to be idempotent, for example.)
        Hide
        Tibor Kiss added a comment -

        Of course Tom.
        In the following hours I will regenerate the patch and test it.
        By the way we are using this feature and it seems to be effective for us.

        Regarding the install/configure scripts tackled separately... I agree with you.

        Show
        Tibor Kiss added a comment - Of course Tom. In the following hours I will regenerate the patch and test it. By the way we are using this feature and it seems to be effective for us. Regarding the install/configure scripts tackled separately... I agree with you.
        Hide
        Tibor Kiss added a comment -

        I rebased from the trunk and rebuilt the patch for the current trunk.

        There is a precaution! I was running the integration test for a Hadoop cluster on EC2 but it fails with the same error as Andrei's failure in https://issues.apache.org/jira/browse/WHIRR-55?focusedCommentId=12978728&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12978728

        It seems that there is an another blocking issue before we are able to rerun the integration test.

        Anyway, here is the new patch.

        Show
        Tibor Kiss added a comment - I rebased from the trunk and rebuilt the patch for the current trunk. There is a precaution! I was running the integration test for a Hadoop cluster on EC2 but it fails with the same error as Andrei's failure in https://issues.apache.org/jira/browse/WHIRR-55?focusedCommentId=12978728&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12978728 It seems that there is an another blocking issue before we are able to rerun the integration test. Anyway, here is the new patch.
        Hide
        Andrei Savu added a comment -

        Tibor, thanks for the new patch. I have opened a new issue for fixing the failing integration tests: WHIRR-227. Really anxious to see this one committed to the trunk.

        Show
        Andrei Savu added a comment - Tibor, thanks for the new patch. I have opened a new issue for fixing the failing integration tests: WHIRR-227 . Really anxious to see this one committed to the trunk.
        Hide
        Tibor Kiss added a comment -

        There is an issue (WHIRR-227) which delayed running the integration test, but since that is a simple issue, I uploaded the correct scripts to my own s3, then I could run the integration test. I hope when WHIRR-227 will be fixed (I mean updated that simple script on s3), you also will be able to run integration tests with my current patch.

        This is the latest (rebuilt & retested) patch:
        whirr-167-5.patch

        Show
        Tibor Kiss added a comment - There is an issue ( WHIRR-227 ) which delayed running the integration test, but since that is a simple issue, I uploaded the correct scripts to my own s3, then I could run the integration test. I hope when WHIRR-227 will be fixed (I mean updated that simple script on s3), you also will be able to run integration tests with my current patch. This is the latest (rebuilt & retested) patch: whirr-167-5.patch
        Hide
        Andrei Savu added a comment -

        Looks great and it's almost ready. I have noticed some minor issues:

        • MAX_STARTUP_RETRIES has a fixed value. should we make this a config option?
        • integration tests depend on ~/.ssh/id_rsa. You could use the ClusterSpec.withTemporaryKeys factory method to avoid this.

        Also, thanks for fixing the short role names I missed in WHIRR-199. I haven't run the integration tests, I'm waiting for WHIRR-227.

        Show
        Andrei Savu added a comment - Looks great and it's almost ready. I have noticed some minor issues: MAX_STARTUP_RETRIES has a fixed value. should we make this a config option? integration tests depend on ~/.ssh/id_rsa . You could use the ClusterSpec.withTemporaryKeys factory method to avoid this. Also, thanks for fixing the short role names I missed in WHIRR-199 . I haven't run the integration tests, I'm waiting for WHIRR-227 .
        Hide
        Andrei Savu added a comment - - edited

        Tibor, unfortunately the patch no longer applies cleanly on the current trunk. Let me know when you've got some time to discuss how can we fix this. I'm looking forward to commit this to the trunk as soon as possible.

        Show
        Andrei Savu added a comment - - edited Tibor, unfortunately the patch no longer applies cleanly on the current trunk. Let me know when you've got some time to discuss how can we fix this. I'm looking forward to commit this to the trunk as soon as possible.
        Hide
        Tibor Kiss added a comment -

        Today I manage to rebuild the patch again.
        Regarding MAX_STARTUP_RETRIES, normally a value higher than 1 is not practically useful because in if some instance losses cannot be covered from one retry it can be considered severe enough to manually intervene. Eventually somebody really wants to set up
        {{

        { whirr.instance-templates-max-percent-failures=100% dn+tt }

        }}
        and really sacrifices more than one retries, it can be an option. But definitively not a recommended one. Instead of this, a new feature to add additional nodes can be more effective, compared to the possible loss of nodes. Because in my experience I saw that usually a failure to start instance is a temporary situation. When this is the situation, if one retry is not enough than probably a few seconds or minutes later the instance reservation and startup problems disappear again.

        A value of 0 it also makes sense to switch off completely the retry mechanism. So I will made a small change and add a new parameter.
        {{

        { whirr.max-startup-retries=1 }

        }}
        with default value 1.

        I also check the temporary keys.

        Show
        Tibor Kiss added a comment - Today I manage to rebuild the patch again. Regarding MAX_STARTUP_RETRIES, normally a value higher than 1 is not practically useful because in if some instance losses cannot be covered from one retry it can be considered severe enough to manually intervene. Eventually somebody really wants to set up {{ { whirr.instance-templates-max-percent-failures=100% dn+tt } }} and really sacrifices more than one retries, it can be an option. But definitively not a recommended one. Instead of this, a new feature to add additional nodes can be more effective, compared to the possible loss of nodes. Because in my experience I saw that usually a failure to start instance is a temporary situation. When this is the situation, if one retry is not enough than probably a few seconds or minutes later the instance reservation and startup problems disappear again. A value of 0 it also makes sense to switch off completely the retry mechanism. So I will made a small change and add a new parameter. {{ { whirr.max-startup-retries=1 } }} with default value 1. I also check the temporary keys.
        Hide
        Tibor Kiss added a comment -

        Here is the newly rebuilt patch for the current trunk.

        It contains the latest two requests. And I also attached the integration log. I made integration test for Hadoop service only.

        Show
        Tibor Kiss added a comment - Here is the newly rebuilt patch for the current trunk. It contains the latest two requests. And I also attached the integration log. I made integration test for Hadoop service only.
        Hide
        Andrei Savu added a comment -

        Thanks TIbor! I've took a quick look and it looks great! I'm now running the integration tests.

        Show
        Andrei Savu added a comment - Thanks TIbor! I've took a quick look and it looks great! I'm now running the integration tests.
        Hide
        Tom White added a comment -

        +1 Looks good to me.

        Show
        Tom White added a comment - +1 Looks good to me.
        Hide
        Andrei Savu added a comment -

        +1 I've been able to run all the unit and integration tests after fixing a typo from WHIRR-225 (rename installRunUrl to install_runurl).

        Show
        Andrei Savu added a comment - +1 I've been able to run all the unit and integration tests after fixing a typo from WHIRR-225 (rename installRunUrl to install_runurl).
        Hide
        Andrei Savu added a comment -

        I've just committed this. Thanks Tibor! Thanks Tom for reviewing.

        Show
        Andrei Savu added a comment - I've just committed this. Thanks Tibor! Thanks Tom for reviewing.
        Hide
        Adrian Cole added a comment -

        cleans up to use normal jclouds parallel ops + logs exceptions which were formerly not there

        Show
        Adrian Cole added a comment - cleans up to use normal jclouds parallel ops + logs exceptions which were formerly not there
        Hide
        Andrei Savu added a comment -

        I've moved this patch to WHIRR-423 - I want to keep the project history clean.

        Show
        Andrei Savu added a comment - I've moved this patch to WHIRR-423 - I want to keep the project history clean.

          People

          • Assignee:
            Tibor Kiss
            Reporter:
            Tibor Kiss
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development