Whirr
  1. Whirr
  2. WHIRR-243

Allow to run component tests in memory

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.7.0
    • Component/s: core
    • Labels:
      None

      Description

      While jclouds now features the awesome BYON mode it might be useful to be able to run compoenent and systems tests without jclouds.
      I tried jclouds stub features but unless I'm missing something (I might) there is no way of stubbing a compute service that allows to call runOnNodesMatching.

      So my idea is to alter ComputeServiceContextBuilder a bit to account for a new provider ("test") in which case an instance of a StubComputeServiceContext is returned (which features a StubComputeService internal class). This allows to make assertion regarding which nodes were started or not and on which order, which I think is useful.

      1. WHIRR-243.patch
        22 kB
        Andrei Savu
      2. WHIRR-243.patch
        25 kB
        David Alves
      3. 0001-WHIRR-243-Corrected-chkstyle-and-rat-problems.patch
        24 kB
        David Alves
      4. WHIRR-243.patch
        22 kB
        David Alves
      5. WHIRR-243-Mod-Delay-Init-Script-Return.log
        298 kB
        David Alves
      6. WHIRR-243-Mod-Delay-Init-Script-Return.patch
        22 kB
        David Alves
      7. WHIRR-243-LogOutput.txt
        62 kB
        David Alves
      8. WHIRR-243.patch
        19 kB
        David Alves
      9. WHIRR-243.patch
        16 kB
        Andrei Savu
      10. WHIRR-243.patch
        16 kB
        David Alves
      11. WHIRR-243.patch
        16 kB
        Adrian Cole
      12. WHIRR-243-only-LogDryRunModule.patch
        24 kB
        David Alves
      13. WHIRR-243.patch
        13 kB
        Adrian Cole
      14. WHIRR-243.patch
        10 kB
        Adrian Cole
      15. WHIRR-243.patch
        7 kB
        Adrian Cole
      16. WHIRR-243.patch
        11 kB
        David Alves

        Issue Links

          Activity

          Hide
          David Alves added a comment -

          Adds the option of building a "test" (assigned statically) provider to the ComputeServiceContextBuilder as well as the StubComputeServiceContext/StubComputeService itself.

          The Stubs are not complete, i.e. they just have the features I needed so far.

          Show
          David Alves added a comment - Adds the option of building a "test" (assigned statically) provider to the ComputeServiceContextBuilder as well as the StubComputeServiceContext/StubComputeService itself. The Stubs are not complete, i.e. they just have the features I needed so far.
          Hide
          David Alves added a comment -

          Forgot to grant license

          Show
          David Alves added a comment - Forgot to grant license
          Hide
          Adrian Cole added a comment -

          I wouldn't recommend creating a different impl for testing, as it is redundant to the existing jclouds stub.

          it is possible to use the jclouds stub to test runScriptOnNodes type of behaviour. In fact, that's what it is used for in our test case:
          https://github.com/jclouds/jclouds/blob/master/compute/src/test/java/org/jclouds/compute/StubComputeServiceIntegrationTest.java

          Main idea is that the stub provider exposes the lifecycle functionality. if you want to test scripts, you have to supply expectation behaviour. This would be assigned via a guice module you'd pass during createContext()

          ex. ComputeServiceContextFactory().createContext("stub",
          ImmutableSet.<Module> of(new HBaseTestExpectationsModule()), new Properties());

          In this test module, you'd need to bind 2 things, SocketOpen and SshClient.Factory

          I'd expect you'd like to just tell SocketOpen to always return true; SocketOpen is called directly before a ssh connection is attempted.

          To bind SocketOpen to always true, add this to your configure method of the module extending com.google.inject.AbstractModule

          bind(SocketOpen.class).toInstance(new SocketOpen() {

          @Override
          public boolean apply(IPSocket arg0)

          { return true; }

          });

          The ssh factory is important. it is here that you can test your expectations properly, as we do in the above test. factory.create will be called for any script execution, whether it is during createNode or runScriptOnNodesMatching. You can feel free to not use mocks and use some sort of hash = return map, but this is where to hook in. I'd recommend paying attention to order when it matters.

          At the end of the day, the only code you'd need to maintain would be the ssh expectations, which is the point in the first place, I believe. I hope this helps.

          Show
          Adrian Cole added a comment - I wouldn't recommend creating a different impl for testing, as it is redundant to the existing jclouds stub. it is possible to use the jclouds stub to test runScriptOnNodes type of behaviour. In fact, that's what it is used for in our test case: https://github.com/jclouds/jclouds/blob/master/compute/src/test/java/org/jclouds/compute/StubComputeServiceIntegrationTest.java Main idea is that the stub provider exposes the lifecycle functionality. if you want to test scripts, you have to supply expectation behaviour. This would be assigned via a guice module you'd pass during createContext() ex. ComputeServiceContextFactory().createContext("stub", ImmutableSet.<Module> of(new HBaseTestExpectationsModule()), new Properties()); In this test module, you'd need to bind 2 things, SocketOpen and SshClient.Factory I'd expect you'd like to just tell SocketOpen to always return true; SocketOpen is called directly before a ssh connection is attempted. To bind SocketOpen to always true, add this to your configure method of the module extending com.google.inject.AbstractModule bind(SocketOpen.class).toInstance(new SocketOpen() { @Override public boolean apply(IPSocket arg0) { return true; } }); The ssh factory is important. it is here that you can test your expectations properly, as we do in the above test. factory.create will be called for any script execution, whether it is during createNode or runScriptOnNodesMatching. You can feel free to not use mocks and use some sort of hash = return map, but this is where to hook in. I'd recommend paying attention to order when it matters. At the end of the day, the only code you'd need to maintain would be the ssh expectations, which is the point in the first place, I believe. I hope this helps.
          Hide
          Adrian Cole added a comment -

          I think the point is running tests in memory and not against a machine.

          Show
          Adrian Cole added a comment - I think the point is running tests in memory and not against a machine.
          Hide
          Adrian Cole added a comment -

          p.s. if you want to avoid passing an instance of a test expectation module (ex. to avoid whirr from needing to expose a means to pass a module), you can also do this via the property "jclouds.modules"

          see http://code.google.com/p/jclouds/wiki/UsingGoogleAppEngine#Using_property-based_configuration for an example

          Show
          Adrian Cole added a comment - p.s. if you want to avoid passing an instance of a test expectation module (ex. to avoid whirr from needing to expose a means to pass a module), you can also do this via the property "jclouds.modules" see http://code.google.com/p/jclouds/wiki/UsingGoogleAppEngine#Using_property-based_configuration for an example
          Hide
          David Alves added a comment -

          Hi Adrien

          I apologize if I have duplicated any of the features of jclouds. I confess jclouds has become really huge and I haven't had the time to dig into it as deep as I'd like.

          Anyway do you think the patch is still useful as it is, or would you recommend a rewrite as per your suggestions?

          Show
          David Alves added a comment - Hi Adrien I apologize if I have duplicated any of the features of jclouds. I confess jclouds has become really huge and I haven't had the time to dig into it as deep as I'd like. Anyway do you think the patch is still useful as it is, or would you recommend a rewrite as per your suggestions?
          Hide
          Adrian Cole added a comment -

          your patch is certainly useful, so I'll make a fusion between the two.

          Show
          Adrian Cole added a comment - your patch is certainly useful, so I'll make a fusion between the two.
          Hide
          Adrian Cole added a comment -

          new patch using existing jclouds stub impl

          Show
          Adrian Cole added a comment - new patch using existing jclouds stub impl
          Hide
          Adrian Cole added a comment -

          tested the attached patch on hbase and cassandra. david, this was a really good idea logging to info. you're right, it is much easier to see the order of execution. Note that I did class lookup by name as that way there's no hard dependency on test classes.

          Show
          Adrian Cole added a comment - tested the attached patch on hbase and cassandra. david, this was a really good idea logging to info. you're right, it is much easier to see the order of execution. Note that I did class lookup by name as that way there's no hard dependency on test classes.
          Hide
          David Alves added a comment - - edited

          Adrien

          I applied your patch and even though now I can see that the nodes start in the expected order in the logs, its hard to make assertions on this starting order. The previous patch maintained state on which nodes were started at which time. (grouped them in runs one for each instance template). How can this be done now?

          Reading/writing cluster state is a feature I also require for WHIRR-214, and Tom suggested that this could be done by reading the "~/.whirr/cluster-name/instances" file. On the other hand you suggested that cluster state could also be written/read to/from the cluster itself. Maybe we should create a more permanent way to read/write cluster state that could start by using the instances file and alternatively use the cluster itself, as you also suggested (if the cluster is to be responsible for its own scaling up/down ops. state must be stored there).

          Show
          David Alves added a comment - - edited Adrien I applied your patch and even though now I can see that the nodes start in the expected order in the logs, its hard to make assertions on this starting order. The previous patch maintained state on which nodes were started at which time. (grouped them in runs one for each instance template). How can this be done now? Reading/writing cluster state is a feature I also require for WHIRR-214 , and Tom suggested that this could be done by reading the "~/.whirr/cluster-name/instances" file. On the other hand you suggested that cluster state could also be written/read to/from the cluster itself. Maybe we should create a more permanent way to read/write cluster state that could start by using the instances file and alternatively use the cluster itself, as you also suggested (if the cluster is to be responsible for its own scaling up/down ops. state must be stored there).
          Hide
          Adrian Cole added a comment -

          here's the latest patch. notice that we are now naming tasks, so regardless of whether we are running live or stub, output should be more meaningful.

          Show
          Adrian Cole added a comment - here's the latest patch. notice that we are now naming tasks, so regardless of whether we are running live or stub, output should be more meaningful.
          Hide
          Adrian Cole added a comment -

          I've added task naming, which should make the logs much more meaningful. Also, there's an existing jclouds-compute.log (from logger jclouds.compute). Perhaps while in stub mode, we should assign this to console for more info?

          Glad to add more code, but I wanted your feedback first.

          Show
          Adrian Cole added a comment - I've added task naming, which should make the logs much more meaningful. Also, there's an existing jclouds-compute.log (from logger jclouds.compute). Perhaps while in stub mode, we should assign this to console for more info? Glad to add more code, but I wanted your feedback first.
          Hide
          Adrian Cole added a comment -

          another version of the patch which shows how to hook in and log things like node name

          Show
          Adrian Cole added a comment - another version of the patch which shows how to hook in and log things like node name
          Hide
          Adrian Cole added a comment -

          please note that there's a open jclouds task to incorporate anything we can to reduce this type of issue moving forward
          http://code.google.com/p/jclouds/issues/detail?id=490

          Show
          Adrian Cole added a comment - please note that there's a open jclouds task to incorporate anything we can to reduce this type of issue moving forward http://code.google.com/p/jclouds/issues/detail?id=490
          Hide
          David Alves added a comment -

          simple patch only to the LogDryRunModule class.

          Show
          David Alves added a comment - simple patch only to the LogDryRunModule class.
          Hide
          David Alves added a comment -

          I've changed the LogDryRunModule class and added a DryRun static object that stores which scripts were executed on which nodes and on which order. The DryRun is filled by an interceptor to RunScriptOnNodes.Factory.create(). I know the create() method is called before scipt execution, but in a stub environment maybe we can assume the script will be executed.

          Open to suggestions on a better way to do it, but solved my problem

          Show
          David Alves added a comment - I've changed the LogDryRunModule class and added a DryRun static object that stores which scripts were executed on which nodes and on which order. The DryRun is filled by an interceptor to RunScriptOnNodes.Factory.create(). I know the create() method is called before scipt execution, but in a stub environment maybe we can assume the script will be executed. Open to suggestions on a better way to do it, but solved my problem
          Hide
          David Alves added a comment -

          Sample use of what can be done with the result:

          Executed: bootstrap-role3 in test-8e7
          Executed: bootstrap-role3 in test-2a5
          Executed: bootstrap-role4_role5 in test-9fe
          Executed: bootstrap-role4_role5 in test-ef8
          Executed: bootstrap-role4_role5 in test-a0d
          Executed: bootstrap-role1_role2 in test-644
          Executed: configure-role1_role2 in test-644
          Executed: configure-role3 in test-8e7
          Executed: configure-role3 in test-2a5
          Executed: configure-role4_role5 in test-a0d
          Executed: configure-role4_role5 in test-a0d
          Executed: configure-role4_role5 in test-9fe
          Executed: configure-role4_role5 in test-9fe
          Executed: configure-role4_role5 in test-ef8
          Executed: configure-role4_role5 in test-ef8

          Show
          David Alves added a comment - Sample use of what can be done with the result: Executed: bootstrap-role3 in test-8e7 Executed: bootstrap-role3 in test-2a5 Executed: bootstrap-role4_role5 in test-9fe Executed: bootstrap-role4_role5 in test-ef8 Executed: bootstrap-role4_role5 in test-a0d Executed: bootstrap-role1_role2 in test-644 Executed: configure-role1_role2 in test-644 Executed: configure-role3 in test-8e7 Executed: configure-role3 in test-2a5 Executed: configure-role4_role5 in test-a0d Executed: configure-role4_role5 in test-a0d Executed: configure-role4_role5 in test-9fe Executed: configure-role4_role5 in test-9fe Executed: configure-role4_role5 in test-ef8 Executed: configure-role4_role5 in test-ef8
          Hide
          Adrian Cole added a comment -

          mainly polishing.

          David, is there anywhere else you'd like to take this?

          Show
          Adrian Cole added a comment - mainly polishing. David, is there anywhere else you'd like to take this?
          Hide
          David Alves added a comment -

          Hey, nice polishings!

          No, it does what I require, i.e. is allows to make assertions on which actions were performed on which nodes and on which order.

          Nice work, thanks!

          Show
          David Alves added a comment - Hey, nice polishings! No, it does what I require, i.e. is allows to make assertions on which actions were performed on which nodes and on which order. Nice work, thanks!
          Hide
          Adrian Cole added a comment -

          I'd like to have one person besides us try this out before committing.

          Show
          Adrian Cole added a comment - I'd like to have one person besides us try this out before committing.
          Hide
          Andrei Savu added a comment -

          I will take a look now over this patch (WHIRR-243.patch submitted on 02/Mar/11 03:48) Should I also review WHIRR-243-only-LogDryRunModule.patch or was it already merged?

          Show
          Andrei Savu added a comment - I will take a look now over this patch ( WHIRR-243 .patch submitted on 02/Mar/11 03:48) Should I also review WHIRR-243 -only-LogDryRunModule.patch or was it already merged?
          Hide
          David Alves added a comment -

          Already merged. Latest and (hopefully) final version is WHIRR-243.patch.

          Show
          David Alves added a comment - Already merged. Latest and (hopefully) final version is WHIRR-243 .patch.
          Hide
          Tom White added a comment -

          I had a quick look at this and it looks fine generally. A couple of minor things that I would like to see fixed before commit: the indentation in LogDryRunModule should be two spaces, not three; and there shouldn't be any @author tags.

          Looking forward to seeing how this will be used.

          Show
          Tom White added a comment - I had a quick look at this and it looks fine generally. A couple of minor things that I would like to see fixed before commit: the indentation in LogDryRunModule should be two spaces, not three; and there shouldn't be any @author tags. Looking forward to seeing how this will be used.
          Hide
          David Alves added a comment -

          changed patch as per Tom's recommendations (no @author tags, and two space indentation)

          Show
          David Alves added a comment - changed patch as per Tom's recommendations (no @author tags, and two space indentation)
          Hide
          Tom White added a comment -

          Sorry for the slow response. I'm getting a NPE when running the unit tests:

          testDoActionRetriesSucceed(org.apache.whirr.cluster.actions.BootstrapClusterActionTest)  Time elapsed: 3.309 sec  <<< ERROR!
          java.lang.NullPointerException
                  at org.apache.whirr.cluster.actions.BootstrapClusterAction.doAction(BootstrapClusterAction.java:115)
                  at org.apache.whirr.cluster.actions.ScriptBasedClusterAction.execute(ScriptBasedClusterAction.java:74)
                  at org.apache.whirr.cluster.actions.BootstrapClusterActionTest.testDoActionRetriesSucceed(BootstrapClusterActionTest.java:136)
                  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
                  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                  at java.lang.reflect.Method.invoke(Method.java:597)
                  at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
                  at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
                  at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
                  at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
                  at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
                  at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
                  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
                  at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
                  at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
                  at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
                  at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
                  at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
                  at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:62)
                  at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:140)
          
          

          Also, the indentation in ComputeServiceContextBuilder is still 3 spaces.

          Show
          Tom White added a comment - Sorry for the slow response. I'm getting a NPE when running the unit tests: testDoActionRetriesSucceed(org.apache.whirr.cluster.actions.BootstrapClusterActionTest) Time elapsed: 3.309 sec <<< ERROR! java.lang.NullPointerException at org.apache.whirr.cluster.actions.BootstrapClusterAction.doAction(BootstrapClusterAction.java:115) at org.apache.whirr.cluster.actions.ScriptBasedClusterAction.execute(ScriptBasedClusterAction.java:74) at org.apache.whirr.cluster.actions.BootstrapClusterActionTest.testDoActionRetriesSucceed(BootstrapClusterActionTest.java:136) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:62) at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:140) Also, the indentation in ComputeServiceContextBuilder is still 3 spaces.
          Hide
          Andrei Savu added a comment -

          Updated patch for the current trunk.

          Show
          Andrei Savu added a comment - Updated patch for the current trunk.
          Hide
          David Alves added a comment -

          Here's my first re-attempt.

          Couple of issues (the codebase has changed quite a lot since then)

          • Once upon a time there was a complete test and other tests that used this (this was lost ) so bear with me while I try to rebuild things.
          • I added a simple test that launches a stub cluster and tests that two scripts were executed on one (stub) node. The test is failing because the map returns two entries instead of one because NodeMetadata's equals contract says that two nodes are different if the credentials are different. Is this correct?
          • Also before RunScriptOnNodeAsInitScriptUsingSsh used to have a name which was nice to see which scripts were being ran where, but apparently that is no longer the case.
          • Finally aside from the main changes (to ComputeCache's wiring) there are two additional changes to BootstrapClusterAction and ConfigureClusterAction that name the tasks accordingly. This change was introduced by adrian with some purpose. Do you remember which adrian?
          Show
          David Alves added a comment - Here's my first re-attempt. Couple of issues (the codebase has changed quite a lot since then) Once upon a time there was a complete test and other tests that used this (this was lost ) so bear with me while I try to rebuild things. I added a simple test that launches a stub cluster and tests that two scripts were executed on one (stub) node. The test is failing because the map returns two entries instead of one because NodeMetadata's equals contract says that two nodes are different if the credentials are different. Is this correct? Also before RunScriptOnNodeAsInitScriptUsingSsh used to have a name which was nice to see which scripts were being ran where, but apparently that is no longer the case. Finally aside from the main changes (to ComputeCache's wiring) there are two additional changes to BootstrapClusterAction and ConfigureClusterAction that name the tasks accordingly. This change was introduced by adrian with some purpose. Do you remember which adrian?
          Hide
          David Alves added a comment -

          Also a nice test for this issue would be to check the issues that reignited the interest in the issue in the first place (see mailing list). If someone would clarify precisely what should be tested I'd be glad to write the test.

          Show
          David Alves added a comment - Also a nice test for this issue would be to check the issues that reignited the interest in the issue in the first place (see mailing list). If someone would clarify precisely what should be tested I'd be glad to write the test.
          Hide
          Andrei Savu added a comment -

          We recently discovered a critical bug in phase execution. The client side code ensures that there is no overlap between the install & configure phases but unfortunately this requirement is not enforced server side.

          As you know we are pushing the install script as user script and it's executed in background. Sometimes (quite often on ec2) we see an overlap at runtime with the configure scripts.

          I'm not sure if you can write a test that shows that without adding more synchronisation the install & the configure phase can actually run at the same time on the remote server.

          Show
          Andrei Savu added a comment - We recently discovered a critical bug in phase execution. The client side code ensures that there is no overlap between the install & configure phases but unfortunately this requirement is not enforced server side. As you know we are pushing the install script as user script and it's executed in background. Sometimes (quite often on ec2) we see an overlap at runtime with the configure scripts. I'm not sure if you can write a test that shows that without adding more synchronisation the install & the configure phase can actually run at the same time on the remote server.
          Hide
          David Alves added a comment - - edited

          If the problem is precisely that running a bootstrap script returns as completed while in fact it is not, then you're right but that is a jclouds specific issue right (meaning we shouldn't even need to test it).

          Still I have a question, let's say I build a cluster as such:

          CompositeConfiguration config = new CompositeConfiguration();
          config.setProperty("whirr.provider", "stub");
                  config.setProperty("whirr.cluster-name", "stub-test");
                  config.setProperty("whirr.instance-templates", "10 noop");
                  ClusterSpec clusterSpec = ClusterSpec.withTemporaryKeys(config);
                  ClusterController controller = new ClusterController();
                  controller.launchCluster(clusterSpec);
          

          Launching such a cluster should happen in parallel right, with a barrier between phases?

          in pseudo-code:
          fork 10 threads
          each run bootstrap script
          join 10 threads
          fork 10 threads
          each run config script
          join 10 threads

          But when I launch the cluster I just introduced, I get a nice interleaved output, where all the initscripts are executed by a SINGLE thread in a thread pool and all the config scripts are executed by main (like the attached log output illustrates). Is this supposed to happen?

          Show
          David Alves added a comment - - edited If the problem is precisely that running a bootstrap script returns as completed while in fact it is not, then you're right but that is a jclouds specific issue right (meaning we shouldn't even need to test it). Still I have a question, let's say I build a cluster as such: CompositeConfiguration config = new CompositeConfiguration(); config.setProperty( "whirr.provider" , "stub" ); config.setProperty( "whirr.cluster-name" , "stub-test" ); config.setProperty( "whirr.instance-templates" , "10 noop" ); ClusterSpec clusterSpec = ClusterSpec.withTemporaryKeys(config); ClusterController controller = new ClusterController(); controller.launchCluster(clusterSpec); Launching such a cluster should happen in parallel right, with a barrier between phases? in pseudo-code: fork 10 threads each run bootstrap script join 10 threads fork 10 threads each run config script join 10 threads But when I launch the cluster I just introduced, I get a nice interleaved output, where all the initscripts are executed by a SINGLE thread in a thread pool and all the config scripts are executed by main (like the attached log output illustrates). Is this supposed to happen?
          Hide
          Andrei Savu added a comment -

          We should have a barrier between phases (later I think we can eliminate that with a DAG of dependencies). Can you add this scenario as a failing test so that we can work on a fix? Thanks.

          Show
          Andrei Savu added a comment - We should have a barrier between phases (later I think we can eliminate that with a DAG of dependencies). Can you add this scenario as a failing test so that we can work on a fix? Thanks.
          Hide
          David Alves added a comment -

          So I think an answer to my fork/join question is that this is currently done at the jclouds level? (1 thread per instance template in whirr side launches multiple, equal nodes in parallel from jclouds) right?. So how it actually runs in whirr side is:

          fork 1 thread per InstanceTemplate
          each run bootstrap script
          join threads
          main runs config script for all ITs in sequence

          So I tried adding some complexity, booting 10 noop+noop3,10 noop2+noop,10 noop3+noop2, but the barrier seems to work nicely.

          The bootstrap is done by three different threads in the whirr side, and only after these complete does the configuration run starts.

          Andrei: are you sure this is not a bug in jclouds? couldn't it be the case that some bootstrap script is reported done while in fact it isn't?

          Show
          David Alves added a comment - So I think an answer to my fork/join question is that this is currently done at the jclouds level? (1 thread per instance template in whirr side launches multiple, equal nodes in parallel from jclouds) right?. So how it actually runs in whirr side is: fork 1 thread per InstanceTemplate each run bootstrap script join threads main runs config script for all ITs in sequence So I tried adding some complexity, booting 10 noop+noop3,10 noop2+noop,10 noop3+noop2, but the barrier seems to work nicely. The bootstrap is done by three different threads in the whirr side, and only after these complete does the configuration run starts. Andrei: are you sure this is not a bug in jclouds? couldn't it be the case that some bootstrap script is reported done while in fact it isn't?
          Hide
          Andrei Savu added a comment -

          I think the overlap we are seeing sometimes is generated by the fact that the service install scripts are still executing on the remote machine when entering the configure phase. As far as a know we don't track install script execution - we only worry about provisioning (seeing the machine running). I am also starting to think that cloudservers reports a machine as being available after user script execution but this is not true for Amazon (Adrian were can I find a confirmation?).

          Show
          Andrei Savu added a comment - I think the overlap we are seeing sometimes is generated by the fact that the service install scripts are still executing on the remote machine when entering the configure phase. As far as a know we don't track install script execution - we only worry about provisioning (seeing the machine running). I am also starting to think that cloudservers reports a machine as being available after user script execution but this is not true for Amazon (Adrian were can I find a confirmation?).
          Hide
          David Alves added a comment -

          As far as a know we don't track install script execution - we only worry about provisioning (seeing the machine running)

          AFAIK that is not the case. JClouds actually keeps calling status on init scripts until it receives exitCode 1. All init scripts must be done before going into configuration phase.

          The slightly modified LogDryRunModule (which delays the emission of exitCode=1 until 10 calls have been made) and the attached log illustrate this behavior.

          Even with this modification the barrier still seems to work well, i.e., all init scripts are finished before executing the configuration ones.

          Show
          David Alves added a comment - As far as a know we don't track install script execution - we only worry about provisioning (seeing the machine running) AFAIK that is not the case. JClouds actually keeps calling status on init scripts until it receives exitCode 1. All init scripts must be done before going into configuration phase. The slightly modified LogDryRunModule (which delays the emission of exitCode=1 until 10 calls have been made) and the attached log illustrate this behavior. Even with this modification the barrier still seems to work well, i.e., all init scripts are finished before executing the configuration ones.
          Hide
          Andrei Savu added a comment -

          David thanks for investigating this. Great work! So it seems like we are affected by something else and I have no idea what - we need more debugging info. Is this patch ready for review now?

          Show
          Andrei Savu added a comment - David thanks for investigating this. Great work! So it seems like we are affected by something else and I have no idea what - we need more debugging info. Is this patch ready for review now?
          Hide
          David Alves added a comment -

          Just need to modify the test so that it actually tests something meaningful and passes. Will reattach patch in a bit.

          Show
          David Alves added a comment - Just need to modify the test so that it actually tests something meaningful and passes. Will reattach patch in a bit.
          Hide
          Andrei Savu added a comment -

          Just got this from Adrian:

          Check the runscriptonnodeandblockuntilcomplete class. Does it enforce timeouts? There's a loop where it checks to see if the shell script is still running (might be a predicate). It is possible that this loop exits with warning as opposed to exception on timeout. can you verify?

          I will take a look. David can a test check that behaviour?

          Show
          Andrei Savu added a comment - Just got this from Adrian: Check the runscriptonnodeandblockuntilcomplete class. Does it enforce timeouts? There's a loop where it checks to see if the shell script is still running (might be a predicate). It is possible that this loop exits with warning as opposed to exception on timeout. can you verify? I will take a look. David can a test check that behaviour?
          Hide
          David Alves added a comment -

          Regarding a test on RunScriptOnNodeAsInitScriptUsingSshAndBlockUntilComplete if its used directly by the ssh module it can't be tested. A stub cluster does not load the ssh module as there is nothing to ssh into.

          Still it might be a misleading name and the class may be part of the higher level stack. Stub tests can demonstrate multiple calls on status of a init script so you might try and look for a timeout above that.

          Show
          David Alves added a comment - Regarding a test on RunScriptOnNodeAsInitScriptUsingSshAndBlockUntilComplete if its used directly by the ssh module it can't be tested. A stub cluster does not load the ssh module as there is nothing to ssh into. Still it might be a misleading name and the class may be part of the higher level stack. Stub tests can demonstrate multiple calls on status of a init script so you might try and look for a timeout above that.
          Hide
          David Alves added a comment -

          This patch provides a completed LogDryRunModule.

          Andrei the test is incomplete (I have some other urgent matters today) but it shows where we should change the jclouds property and where we can test the correct behavior with LogDryRunModule.

          Regarding what is correct behavior IMHO what could be done is:

          • Either a bootstrap completely fails on some init script timeout-ing or
          • Timeout-ing nodes are shutdown and removed from the pool of nodes that will be configured or
          • Simply wait until all is done (increase timeout)
          Show
          David Alves added a comment - This patch provides a completed LogDryRunModule. Andrei the test is incomplete (I have some other urgent matters today) but it shows where we should change the jclouds property and where we can test the correct behavior with LogDryRunModule. Regarding what is correct behavior IMHO what could be done is: Either a bootstrap completely fails on some init script timeout-ing or Timeout-ing nodes are shutdown and removed from the pool of nodes that will be configured or Simply wait until all is done (increase timeout)
          Hide
          David Alves added a comment -

          Some additional features should be handy (like a local blobstorecache and a more complete execution trace) but maybe these could be included in other jira's?
          As is it already helps with debugging by allowing to visualize a complete execution trace and to retrieve some information on what was executed where.
          Please review...

          Show
          David Alves added a comment - Some additional features should be handy (like a local blobstorecache and a more complete execution trace) but maybe these could be included in other jira's? As is it already helps with debugging by allowing to visualize a complete execution trace and to retrieve some information on what was executed where. Please review...
          Hide
          David Alves added a comment -

          Corrected style problems.

          Show
          David Alves added a comment - Corrected style problems.
          Hide
          Andrei Savu added a comment -

          Overall it looks good. I will refresh the patch tomorrow to match the current trunk and remove some of the code comments.

          Can you look into WHIRR-49? It would be great if we could release this in 0.7.0. I know that you have a busy schedule and I want you to know that your contributions are highly appreciated.

          Nit: please use git diff --no-prefix branch_name or something similar. We need to be able to apply the diff using the unix patch command.

          Show
          Andrei Savu added a comment - Overall it looks good. I will refresh the patch tomorrow to match the current trunk and remove some of the code comments. Can you look into WHIRR-49 ? It would be great if we could release this in 0.7.0. I know that you have a busy schedule and I want you to know that your contributions are highly appreciated. Nit: please use git diff --no-prefix branch_name or something similar. We need to be able to apply the diff using the unix patch command.
          Hide
          David Alves added a comment -

          Andrei: Is there still a need to test the timeout of init scripts/ barrier enforcing?

          I will try and look into WHIRR-49 today.

          Nit: please use git diff --no-prefix branch_name or something similar. We need to be able to apply the diff using the unix patch command.

          will do!

          Show
          David Alves added a comment - Andrei: Is there still a need to test the timeout of init scripts/ barrier enforcing? I will try and look into WHIRR-49 today. Nit: please use git diff --no-prefix branch_name or something similar. We need to be able to apply the diff using the unix patch command. will do!
          Hide
          Andrei Savu added a comment -

          Is there still a need to test the timeout of init scripts/ barrier enforcing?

          It would be useful to have a confirmation as a test that everything is ok.

          Show
          Andrei Savu added a comment - Is there still a need to test the timeout of init scripts/ barrier enforcing? It would be useful to have a confirmation as a test that everything is ok.
          Hide
          David Alves added a comment -

          I made the final changes against the latest trunk, maven is build is fine.

          Here's what I changed:

          • I renamed LogDryRunModule to DryRunModule (it does more that logging since it captures script execution in order)
          • I moved it to src/main. This might be a bit controversial but Noop is already there and this way there is no need to build a test jar to use DryRun in other modules.
          • Finished the a test that makes sure that no setup scripts are executed after the first configure script.

          That should be all.

          PS: I'm new to git diff I hope all is well please notify me if that is not the case.

          Show
          David Alves added a comment - I made the final changes against the latest trunk, maven is build is fine. Here's what I changed: I renamed LogDryRunModule to DryRunModule (it does more that logging since it captures script execution in order) I moved it to src/main. This might be a bit controversial but Noop is already there and this way there is no need to build a test jar to use DryRun in other modules. Finished the a test that makes sure that no setup scripts are executed after the first configure script. That should be all. PS: I'm new to git diff I hope all is well please notify me if that is not the case.
          Hide
          Andrei Savu added a comment -

          Looks good. I don't understand why do we need to move DryRun outside test? (Noop is not intended to be used for testing). Also you should probably rename the test class to DryRunTest. I have been able to apply the latest patch without problems.

          Show
          Andrei Savu added a comment - Looks good. I don't understand why do we need to move DryRun outside test? (Noop is not intended to be used for testing). Also you should probably rename the test class to DryRunTest. I have been able to apply the latest patch without problems.
          Hide
          David Alves added a comment -

          DryRun comes in handy when testing/developing new services, I'm using it to test a couple of handlers.
          If DryRun is in the tests its not available outside of whirr-core unless:

          • we build a test jar (seems a bit too much to add a dependency for a single class)
          • we place DryRun in the main jar...

          what do you think?

          Show
          David Alves added a comment - DryRun comes in handy when testing/developing new services, I'm using it to test a couple of handlers. If DryRun is in the tests its not available outside of whirr-core unless: we build a test jar (seems a bit too much to add a dependency for a single class) we place DryRun in the main jar... what do you think?
          Hide
          Andrei Savu added a comment -

          I understand your reasons. Tom, Adrian, Karel what's your take on this?

          Show
          Andrei Savu added a comment - I understand your reasons. Tom, Adrian, Karel what's your take on this?
          Hide
          Karel Vervaeke added a comment -

          The argument about adding a dependency for a single class is a bit weak. The decision is about where the DryRun class belongs, architecturally.

          Creating a test-jar isn't that hard... see http://maven.apache.org/guides/mini/guide-attached-tests.html.

          You could also put DryRun in a separate module (called test-framework or test-util), and put DryRun in there.
          Technically this is very similar to creating a test jar based on whirr-core/src/test, but dedicating a separate module to it carries the message that it is also intended for non-whirr-internal use, that it is a less subject to (API) changes than whirr-core/src/test).

          If there aren't any technical difficulties involved (classloading issues, maven being stubborn, ...), I prefer the test-jar solution.

          Show
          Karel Vervaeke added a comment - The argument about adding a dependency for a single class is a bit weak. The decision is about where the DryRun class belongs, architecturally. Creating a test-jar isn't that hard... see http://maven.apache.org/guides/mini/guide-attached-tests.html . You could also put DryRun in a separate module (called test-framework or test-util), and put DryRun in there. Technically this is very similar to creating a test jar based on whirr-core/src/test, but dedicating a separate module to it carries the message that it is also intended for non-whirr-internal use, that it is a less subject to (API) changes than whirr-core/src/test). If there aren't any technical difficulties involved (classloading issues, maven being stubborn, ...), I prefer the test-jar solution.
          Hide
          David Alves added a comment -

          I know creating a test jar is easy (I use it all the time). The thing is that it then must be published along with the regular jar wherever (repo1 someday?) so creating a new deployment artifact for a public project and for a single class is something we might want to avoid.

          Then there is the design aspect of whether DryRunModule is really a test class. That, I think, is debatable and could go either way. For instance StubProviders in jclouds lives in the jclouds-compute jar not on a test jar.

          All this being said, all I would really like was for the class to be packaged so that it could be used outside whirr-core. I personally would choose the main jar (less hassle, less artifacts) but I will follow the majority

          Show
          David Alves added a comment - I know creating a test jar is easy (I use it all the time). The thing is that it then must be published along with the regular jar wherever (repo1 someday?) so creating a new deployment artifact for a public project and for a single class is something we might want to avoid. Then there is the design aspect of whether DryRunModule is really a test class. That, I think, is debatable and could go either way. For instance StubProviders in jclouds lives in the jclouds-compute jar not on a test jar. All this being said, all I would really like was for the class to be packaged so that it could be used outside whirr-core. I personally would choose the main jar (less hassle, less artifacts) but I will follow the majority
          Hide
          Karel Vervaeke added a comment -

          Having heard that argument, together with the fact that no-one is ever going to use whirr-core-test without using whirr-core (main) makes me reconsider: We can go with main as far as I am concerned.

          Show
          Karel Vervaeke added a comment - Having heard that argument, together with the fact that no-one is ever going to use whirr-core-test without using whirr-core (main) makes me reconsider: We can go with main as far as I am concerned.
          Hide
          Andrei Savu added a comment -

          +1

          Show
          Andrei Savu added a comment - +1
          Hide
          Andrei Savu added a comment - - edited

          Minor change: renamed LogDryRunModuleTest to DryRunModuleTest for consistency. All unit tests are passing. I'm going to commit this.

          Show
          Andrei Savu added a comment - - edited Minor change: renamed LogDryRunModuleTest to DryRunModuleTest for consistency. All unit tests are passing. I'm going to commit this.
          Hide
          Andrei Savu added a comment -

          Committed. Thanks guys!

          Show
          Andrei Savu added a comment - Committed. Thanks guys!

            People

            • Assignee:
              David Alves
              Reporter:
              David Alves
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development