Whirr
  1. Whirr
  2. WHIRR-693

Control order of actions with waves of whirr.instance-templates

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: core
    • Labels:
      None

      Description

      A cluster can be specified with "waves" by supporting an optional suffix on whirr.instance-templates property in order to provide running base-level services (like ZK or a master-master mysql) to another layer of services during the configuration phase. This enables storing cluster-wide information in zookeeper or doing db creation beforehand.

      An example of a two "wave" (avoiding "phase" which already has a meaning) cluster:

      whirr.instance-templates.0=1 zookeeper, 2 zookeeper, 4 noop

      The .0 templates run like normal during "whirr launch-cluster" and then, during the same Whirr run, the .1 template is applied as a modification of the same cluster so that new host instances are not allocated:

      whirr.instance-templates.1=1 my-master, 2 my-gateway, 4 my-worker

      In the second wave, instance provisioning is inhibited, the instance-templates must have the same number of commas and same sequence of leading numbers, and a role is only allowed to appear in one wave.

      Here is another example:

      whirr.instance-templates.0=1 mysql-master+zookeeper, 1 mysql-master+zookeeper,4 noop
      whirr.instance-templates.1=1 hadoop-namenode+hadoop-jobtracker, 1 hbase-master+hadoop-secondarynn,4 hadoop-tasktracker+hadoop-datanode+hbase-regionserver

      In the first wave, the two mysql-masters form a multi-master ensemble which keeps state information about the setup in the Whirr process (much like ZooKeeperCluster.getHosts(cluster) informs services/hbase of the quorum).

      In the second wave, nodes are not allocated (BootstrapClusterAction.doAction() is inhibited if instances already exist), but all phases for LaunchClusterCommand, including beforeBootstrap() and afterBootstrap() callbacks, are executed. If other whirr.cli.command.*Command are run, they would see a combined whirr.instance-templates that works like normal.

      Obviously, there would be no reason to limit this to 2 waves, but I do not expect more than 10 waves to be useful, so the pattern could enable a suffix of "\.[0-9]".

      Instead of WHIRR-221 which aims to specify a single global order for service startup, the wave format has the advantage that it relies on the implicit synchronization barriers at phases already supported by Whirr, the phases still run actions within phases in parallel, and state generated by a wave is available to later waves.

      Conceptually, this merely splits the whirr.instance-templates into waves of (bootstrap,install,configure,start). If no .[0-9] suffixes are present, then Whirr would behave just like normal.

        Activity

        Hide
        Paul Baclace added a comment -

        I have implemented this on a github.com fork. Existing tests that run by default are passing.

        See discussion on dev@whirr.apache.org on 20121210-20121211. The best alternative for supporting a running, multi-node service to exist before the configuration of a new service is to "run Whirr multiple times" (which requires coordinating security groups, cluster state storage, etc.)

        Show
        Paul Baclace added a comment - I have implemented this on a github.com fork. Existing tests that run by default are passing. See discussion on dev@whirr.apache.org on 20121210-20121211. The best alternative for supporting a running, multi-node service to exist before the configuration of a new service is to "run Whirr multiple times" (which requires coordinating security groups, cluster state storage, etc.)
        Hide
        Roman Shaposhnik added a comment -

        This looks pretty useful for orchestration of puppet-based provisioning that I'm working on. Would definitely like to see this in the trunk.

        Show
        Roman Shaposhnik added a comment - This looks pretty useful for orchestration of puppet-based provisioning that I'm working on. Would definitely like to see this in the trunk.
        Hide
        Steve Loughran added a comment -
        1. This could be good for more complex deployments, though there's a risk of workflow-related feature creep (see below).
        2. This could mark the time for moving beyond .properties files, as with multiple templates, overridden attributes and now sequences, the requirements of more configuration management languages crop up. Rather than propose a new one, have a look at JSON.
        3. There really needs to be probes in Whirr to trigger starting one phase on the observed state of the previous set. That way rather than just saying "run HDFS", the HDFS phase is considered to have successfully completed when a probe of the NN returned the filesystem was up (Hitting URLS and getting 200 responses is the single most relevant probe here).
        4. Teardown gets trickier as you'd have to go back in order from what state you got to, worry about timing etc. This is why a "say no to teardown" policy, relying on the PaaS infrastructure to kill your VMs, is the only rational tactic here.

        Having probes to defined the barriers between phases I'm using that term as its the traditional one makes it possible to define phase transition not as "all the previously started action scripts finished", but "the cluster was put in the state we needed for the next actions". URLs are the obvious choice

        Show
        Steve Loughran added a comment - This could be good for more complex deployments, though there's a risk of workflow-related feature creep (see below). This could mark the time for moving beyond .properties files, as with multiple templates, overridden attributes and now sequences, the requirements of more configuration management languages crop up. Rather than propose a new one, have a look at JSON. There really needs to be probes in Whirr to trigger starting one phase on the observed state of the previous set. That way rather than just saying "run HDFS", the HDFS phase is considered to have successfully completed when a probe of the NN returned the filesystem was up (Hitting URLS and getting 200 responses is the single most relevant probe here). Teardown gets trickier as you'd have to go back in order from what state you got to, worry about timing etc. This is why a "say no to teardown" policy, relying on the PaaS infrastructure to kill your VMs, is the only rational tactic here. Having probes to defined the barriers between phases I'm using that term as its the traditional one makes it possible to define phase transition not as "all the previously started action scripts finished", but "the cluster was put in the state we needed for the next actions". URLs are the obvious choice
        Hide
        Tom White added a comment -

        This sounds like a useful addition. It looks like a compatible change - so old recipes will still work.

        Paul, are you willing to provide a patch for this?

        Show
        Tom White added a comment - This sounds like a useful addition. It looks like a compatible change - so old recipes will still work. Paul, are you willing to provide a patch for this?
        Hide
        Andrew Bayer added a comment -

        Ping - this would be pretty nifty, but we need a patch submission from Paul. =)

        Show
        Andrew Bayer added a comment - Ping - this would be pretty nifty, but we need a patch submission from Paul. =)
        Hide
        Tom White added a comment -

        Someone else could prepare the patch, as long as Paul is willing to grant it to the ASF (a statement to that effect on this JIRA would be sufficient).

        Show
        Tom White added a comment - Someone else could prepare the patch, as long as Paul is willing to grant it to the ASF (a statement to that effect on this JIRA would be sufficient).
        Hide
        Roman Shaposhnik added a comment -

        @tom, I'd be game for working on the patch as long as we get Paul's permission.

        Show
        Roman Shaposhnik added a comment - @tom, I'd be game for working on the patch as long as we get Paul's permission.
        Hide
        Paul Baclace added a comment - - edited

        I'm just now getting back to this work. First I will update the github fork I made and see how a patch from that looks.

        Steve: I agree that supporting "waves" brings up the need for distributed barrier synchronization, but this is a separable concern at the Whirr function/plugin level. In case anyone starts down the path of stronger and more structured semantics here, notice that when "/etc/init.d/foo start" returns, it is not guaranteed that foo is running; based on de facto semantics, foo could be ready, still loading, hung, or failed. The strong point of Whirr is that it does not require stronger semantics like that provided by upstart or osgi or juju, etc. That makes Whirr flexible, although there is still a big need for recipe testing.

        Ultimately, whether a service dependency is ready to handle a dependee is application-specific. In that spirit, the best approach is to provide Whirr function/plugins that can do something like "from master node, connect to mysql with user U to db D and count the number of rows in table T, retrying until timeout Tout". Such condition checkers can go a long way, as long as they can be easily parameterized to use details from the provisioning phase.

        Show
        Paul Baclace added a comment - - edited I'm just now getting back to this work. First I will update the github fork I made and see how a patch from that looks. Steve: I agree that supporting "waves" brings up the need for distributed barrier synchronization, but this is a separable concern at the Whirr function/plugin level. In case anyone starts down the path of stronger and more structured semantics here, notice that when "/etc/init.d/foo start" returns, it is not guaranteed that foo is running; based on de facto semantics, foo could be ready, still loading, hung, or failed. The strong point of Whirr is that it does not require stronger semantics like that provided by upstart or osgi or juju, etc. That makes Whirr flexible, although there is still a big need for recipe testing. Ultimately, whether a service dependency is ready to handle a dependee is application-specific. In that spirit, the best approach is to provide Whirr function/plugins that can do something like "from master node, connect to mysql with user U to db D and count the number of rows in table T, retrying until timeout Tout". Such condition checkers can go a long way, as long as they can be easily parameterized to use details from the provisioning phase.
        Hide
        Paul Baclace added a comment -

        I noticed that I have referred to "install" in this issue description as a first class action, but it is not yet like that (see WHIRR-294). Instead, the practice is to have services addStatement() to queue their resources/functions/install*.sh in beforeBootstrap(); this will eventually enable a switch to having ClusterActionHandlerSupport.ACTION_INSTALL.

        Show
        Paul Baclace added a comment - I noticed that I have referred to "install" in this issue description as a first class action, but it is not yet like that (see WHIRR-294 ). Instead, the practice is to have services addStatement() to queue their resources/functions/install*.sh in beforeBootstrap(); this will eventually enable a switch to having ClusterActionHandlerSupport.ACTION_INSTALL.
        Hide
        Roman Shaposhnik added a comment -

        Now with the talk of 0.8.2 and/or 0.9.0 I tool a liberty of assigning this to them.

        Paul, any chance you can provide a patch so these release(s) could benefit from waves functionality?

        Show
        Roman Shaposhnik added a comment - Now with the talk of 0.8.2 and/or 0.9.0 I tool a liberty of assigning this to them. Paul, any chance you can provide a patch so these release(s) could benefit from waves functionality?
        Hide
        Andrew Bayer added a comment -

        Paul Baclace - any update on this? I'd really like to get this in for 0.8.2, but I'd also really like to get 0.8.2 out soon. If you've got a partial patch, I'm happy to work on getting it the rest of the way there.

        Show
        Andrew Bayer added a comment - Paul Baclace - any update on this? I'd really like to get this in for 0.8.2, but I'd also really like to get 0.8.2 out soon. If you've got a partial patch, I'm happy to work on getting it the rest of the way there.
        Hide
        Roman Shaposhnik added a comment -

        Paul Baclace I'd be more than willing to pitch in for this!

        Show
        Roman Shaposhnik added a comment - Paul Baclace I'd be more than willing to pitch in for this!
        Hide
        Andrew Bayer added a comment -

        ping - Paul, this is what I was referring to last night. =)

        Show
        Andrew Bayer added a comment - ping - Paul, this is what I was referring to last night. =)
        Hide
        David Zabner added a comment -

        Has there been any further movement on this? Would it be possible to get the latest work on this so that I can finish it up?

        Show
        David Zabner added a comment - Has there been any further movement on this? Would it be possible to get the latest work on this so that I can finish it up?

          People

          • Assignee:
            Unassigned
            Reporter:
            Paul Baclace
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:

              Development