Whirr
  1. Whirr
  2. WHIRR-214

Add/Remove nodes to/from running clusters

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: None
    • Labels:

      Description

      I would like to be able to add a node to a running cluster.
      For example if I have created a hadoop, hbase, zookeeper cluster I would like to be able to add region servers to hbase, Zookeeper nodes to the quorum, and task nodes to hadoop.
      Something akin to the functionality of the hbase-ec2 script. Where if I launch a node on an already running cluster. It is configured to join the cluster.

      We should also provide hooks for removing nodes from the cluster. In many situations removing nodes is much harder that adding them, however this issue should mostly care with cloud mgmt and providing the hooks for addition/deletion separate issues should handle the add/remove process for each service, so it won't be a gargantuous task.

      1. zookeeper-repair-or-extend-cluster.patch
        42 kB
        Andrei Savu
      2. WHIRR-214-refactoring-1.patch
        36 kB
        Andrei Savu
      3. WHIRR-214-zookeeeper-only.patch
        11 kB
        Andrei Savu
      4. WHIRR-214-refactoring-1.patch
        36 kB
        Andrei Savu

        Issue Links

          Activity

          Hide
          varunshaji added a comment -

          Hello!

          Whats the status of this issue. Is there any point in waiting for this feature. If possibel I am willing to devout some time for this feature development.

          Show
          varunshaji added a comment - Hello! Whats the status of this issue. Is there any point in waiting for this feature. If possibel I am willing to devout some time for this feature development.
          Hide
          Andrew Bayer added a comment -

          Alright. I'll dive through the experiments and see what I can come up with from here.

          Show
          Andrew Bayer added a comment - Alright. I'll dive through the experiments and see what I can come up with from here.
          Hide
          Andrei Savu added a comment -

          As far as I remember besides the refactoring work (committed already) everything else are just experiments.

          Show
          Andrei Savu added a comment - As far as I remember besides the refactoring work (committed already) everything else are just experiments.
          Hide
          Andrew Bayer added a comment -

          So are the existing patches something we can resurrect, or should we start over? I want to be sure that this can work with BYON as well - there is now the ability to programmatically add/remove BYON nodes to/from a cluster, and the work I've got sitting half-done for JCLOUDS-128 should help further.

          Show
          Andrew Bayer added a comment - So are the existing patches something we can resurrect, or should we start over? I want to be sure that this can work with BYON as well - there is now the ability to programmatically add/remove BYON nodes to/from a cluster, and the work I've got sitting half-done for JCLOUDS-128 should help further.
          Hide
          Adrian Cole added a comment -

          nope. there's currently a rally to ship a long overdue 0.8.0. I'm sure that folks (like me) would love to see this completed in 0.9.0

          Show
          Adrian Cole added a comment - nope. there's currently a rally to ship a long overdue 0.8.0. I'm sure that folks (like me) would love to see this completed in 0.9.0
          Hide
          Otis Gospodnetic added a comment -

          This seems to be the most popular Whirr issue by far. Is anyone (actively) working on it?

          Show
          Otis Gospodnetic added a comment - This seems to be the most popular Whirr issue by far. Is anyone (actively) working on it?
          Hide
          Rodrigo Nogueira added a comment -

          Hi guys,

          I am a undergraduate student of computer science from Portugal. I am currently doing research related to large scale data processing.
          I've looked at Whirr gsoc ideas and this one really took my attention.
          This idea seems to have a significant relevance to the Whirr project and I really want to give my contribute through Gsoc.

          The possibility of whirr add or remove nodes in runtime is very good, on this way the user can manage your cluster and do what is better in a certain time. In the other words adapt the number of nodes as required the computation level that are needed.
          And I am very excited with this possibility.

          Also I am looking for a mentor to help me along those months to do a good work.

          Best regards.

          Rodrigo Nogueira

          Show
          Rodrigo Nogueira added a comment - Hi guys, I am a undergraduate student of computer science from Portugal. I am currently doing research related to large scale data processing. I've looked at Whirr gsoc ideas and this one really took my attention. This idea seems to have a significant relevance to the Whirr project and I really want to give my contribute through Gsoc. The possibility of whirr add or remove nodes in runtime is very good, on this way the user can manage your cluster and do what is better in a certain time. In the other words adapt the number of nodes as required the computation level that are needed. And I am very excited with this possibility. Also I am looking for a mentor to help me along those months to do a good work. Best regards. Rodrigo Nogueira
          Hide
          Andrei Savu added a comment -

          I have committed the code refactoring patch. IMO some of the next steps are:

          • make sure that each role has scripts for install, configure, start, stop, cleanup
          • ?? define more events for each role after/before {Start,Stop,Cleanup}

          What do you think? What steps should we take to make this happen?

          Show
          Andrei Savu added a comment - I have committed the code refactoring patch. IMO some of the next steps are: make sure that each role has scripts for install, configure, start, stop, cleanup ?? define more events for each role after/before {Start,Stop,Cleanup} What do you think? What steps should we take to make this happen?
          Hide
          Tom White added a comment -

          +1 to the revised patch.

          Show
          Tom White added a comment - +1 to the revised patch.
          Hide
          Andrei Savu added a comment -

          Updated patch to be in sync with trunk. Integration tests are passing for zookeeper on aws-ec2 and cloudservers-us.

          Show
          Andrei Savu added a comment - Updated patch to be in sync with trunk. Integration tests are passing for zookeeper on aws-ec2 and cloudservers-us.
          Hide
          Andrei Savu added a comment -

          I will remove addInstance() tomorrow morning. I like the idea of using "converge" - maybe later we can deprecate launch-cluster and use "converge" for everything.

          Show
          Andrei Savu added a comment - I will remove addInstance() tomorrow morning. I like the idea of using "converge" - maybe later we can deprecate launch-cluster and use "converge" for everything.
          Hide
          Tom White added a comment -

          The refactoring patch looks good, although I don't see the addInstance() method being used anywhere (even in the other patch).

          Would "repair" be better named "converge", since it covers both cases of repairing a cluster where some of the instances didn't come up, and extending a cluster to a larger size. We could add shrinking a cluster sometime, but for the moment it should throw an exception.

          The patch contains a default for the tarball URL, which would be better put in a properties file.

          Show
          Tom White added a comment - The refactoring patch looks good, although I don't see the addInstance() method being used anywhere (even in the other patch). Would "repair" be better named "converge", since it covers both cases of repairing a cluster where some of the instances didn't come up, and extending a cluster to a larger size. We could add shrinking a cluster sometime, but for the moment it should throw an exception. The patch contains a default for the tarball URL, which would be better put in a properties file.
          Hide
          Andrei Savu added a comment -

          I have split the previous patch in two new ones.

          WHIRR-214-refactoring-1.patch is something that we can commit if it looks good to you.

          WHIRR-214-zookeeper-only.patch adds a new command that can repair a Zookeeper cluster. I'm not sure if this is still working now.

          Show
          Andrei Savu added a comment - I have split the previous patch in two new ones. WHIRR-214 -refactoring-1.patch is something that we can commit if it looks good to you. WHIRR-214 -zookeeper-only.patch adds a new command that can repair a Zookeeper cluster. I'm not sure if this is still working now.
          Hide
          Andrei Savu added a comment -

          Attached a proof-of-concept implementation that adds a new command (repair-cluster) which can be used to repair or extend a ZooKeeper cluster with small downtime (reconfig + restart) and no data loss.

          Show
          Andrei Savu added a comment - Attached a proof-of-concept implementation that adds a new command (repair-cluster) which can be used to repair or extend a ZooKeeper cluster with small downtime (reconfig + restart) and no data loss.
          Hide
          Tom White added a comment -

          > I also agree with having some service support from the beginning but this should probably go on another issue, right?

          It's hard if not impossible to design a general interface without having a concrete implementation so it's better to write a simple implementation as a part of the initial JIRA IMO.

          Show
          Tom White added a comment - > I also agree with having some service support from the beginning but this should probably go on another issue, right? It's hard if not impossible to design a general interface without having a concrete implementation so it's better to write a simple implementation as a part of the initial JIRA IMO.
          Hide
          David Alves added a comment -

          It's also important to clearly explain the behaviour to users, since there may be data loss scenarios.

          Important point. What we could do here is to change either ClusterActionHandlerSupport or ClusterActionHandler to add new lifecycle stages that would execute before the Destroy stage. The rationale is that destroying a cluster is different than decommissioning nodes from a cluster. By making these changes and throwing UnsupportedOperationException by default services would mandatorily have to deal with them in order for decommissioning to work.

          I also agree with having some service support from the beginning but this should probably go on another issue, right?

          Show
          David Alves added a comment - It's also important to clearly explain the behaviour to users, since there may be data loss scenarios. Important point. What we could do here is to change either ClusterActionHandlerSupport or ClusterActionHandler to add new lifecycle stages that would execute before the Destroy stage. The rationale is that destroying a cluster is different than decommissioning nodes from a cluster. By making these changes and throwing UnsupportedOperationException by default services would mandatorily have to deal with them in order for decommissioning to work. I also agree with having some service support from the beginning but this should probably go on another issue, right?
          Hide
          Tom White added a comment -

          Fine by me as long as there is some service integration from the beginning (even if it is rudimentary, and e.g. just destroys the running nodes) to prove the abstraction. It's also important to clearly explain the behaviour to users, since there may be data loss scenarios.

          Show
          Tom White added a comment - Fine by me as long as there is some service integration from the beginning (even if it is rudimentary, and e.g. just destroys the running nodes) to prove the abstraction. It's also important to clearly explain the behaviour to users, since there may be data loss scenarios.
          Hide
          David Alves added a comment -

          My suggestion would be to make the focus of this issue the non-service-specific part of expansion and contraction and to define hooks that specific services can use to handle adding and removing nodes from a running cluster, which as Tom said can be tricky in a series of cases.

          Show
          David Alves added a comment - My suggestion would be to make the focus of this issue the non-service-specific part of expansion and contraction and to define hooks that specific services can use to handle adding and removing nodes from a running cluster, which as Tom said can be tricky in a series of cases.
          Hide
          Tom White added a comment -

          We should handle both, but not necessarily in this JIRA. In Hadoop, for example, contraction is harder than expansion since you have to manage the decommissioning process.

          Show
          Tom White added a comment - We should handle both, but not necessarily in this JIRA. In Hadoop, for example, contraction is harder than expansion since you have to manage the decommissioning process.
          Hide
          Andrei Savu added a comment -

          I believe we should handle both of them. A scalability monitor will need to able expand but also to contract the cluster as the performance requirements and the amount of data / requests changes.

          Show
          Andrei Savu added a comment - I believe we should handle both of them. A scalability monitor will need to able expand but also to contract the cluster as the performance requirements and the amount of data / requests changes.
          Hide
          David Alves added a comment - - edited

          Should this issue take only care of expansion or also of contraction? Seems that, at least from a cluster management point of view (i.e., not looking into specific services) that the issue could easily take care of both.

          Show
          David Alves added a comment - - edited Should this issue take only care of expansion or also of contraction? Seems that, at least from a cluster management point of view (i.e., not looking into specific services) that the issue could easily take care of both.
          Hide
          David Alves added a comment - - edited

          Got it. Service creates it and DestroyInstanceCommand updates it.
          Thanks

          Show
          David Alves added a comment - - edited Got it. Service creates it and DestroyInstanceCommand updates it. Thanks
          Hide
          Tom White added a comment -

          > Is there a way to rebuild instance templates based on the current cluster states (instead of configuration)

          When a cluster is started it writes the list of instances (including roles) to ~/.whirr/<cluster-name>/instances. You could use this to get the cluster status. Also, when the cluster is changed this file should be updated. (It might be nice to make this pluggable, but this probably belongs to another issue.)

          Show
          Tom White added a comment - > Is there a way to rebuild instance templates based on the current cluster states (instead of configuration) When a cluster is started it writes the list of instances (including roles) to ~/.whirr/<cluster-name>/instances. You could use this to get the cluster status. Also, when the cluster is changed this file should be updated. (It might be nice to make this pluggable, but this probably belongs to another issue.)
          Hide
          David Alves added a comment -

          Thanks Adrian, no worries. I think I got it from HadoopNameNodeClusterActionHandler.beforeConfigure(ClusterActionEvent) that writes the hadoop config files to tmp/.

          I'll ask for further assistance if I get stuck.

          Show
          David Alves added a comment - Thanks Adrian, no worries. I think I got it from HadoopNameNodeClusterActionHandler.beforeConfigure(ClusterActionEvent) that writes the hadoop config files to tmp/. I'll ask for further assistance if I get stuck.
          Hide
          Adrian Cole added a comment -

          no, but depending on how patient you are, I could create an example for this

          Show
          Adrian Cole added a comment - no, but depending on how patient you are, I could create an example for this
          Hide
          David Alves added a comment -

          Makes a lot of sense

          Is persisting data in the nodes to keep cluster state an out-of-the-box functionality in jclouds? if so can you point me to any examples?

          Show
          David Alves added a comment - Makes a lot of sense Is persisting data in the nodes to keep cluster state an out-of-the-box functionality in jclouds? if so can you point me to any examples?
          Hide
          Adrian Cole added a comment -

          one way to resize a cluster based on a current one is to persist the "template" of the cluster onto one of the machines (or all of the machines). Ex. serialize the template to json and copy it to disk. When going to resize the cluster, pick any member and the deserialize the template from this location.

          ex. put(".template", Payloads.newStringPayload(json.asJson(template)))

          make sense?

          Show
          Adrian Cole added a comment - one way to resize a cluster based on a current one is to persist the "template" of the cluster onto one of the machines (or all of the machines). Ex. serialize the template to json and copy it to disk. When going to resize the cluster, pick any member and the deserialize the template from this location. ex. put(".template", Payloads.newStringPayload(json.asJson(template))) make sense?
          Hide
          David Alves added a comment -

          Some questions:
          Cluster expansion/contraction is supposed to be in a new command right (e.g. ModifyClusterCommand)?
          As I see it it could be something like:
          whirr modify-cluster --instance-templates "20 dn+tt"
          (where instance-templates was before 10 dn+tt)

          My question is:
          What are the "old" instance templates from which we are calculating the difference? is it the current cluster instance templates or the configured instance templates?

          Is there a way to rebuild instance templates based on the current cluster states (instead of configuration). From what I gather in ListClusterCommand the only thing that is retrieved is current node metadata directly from jclouds (node ids, addresses, etc, not the actual services running on the nodes).

          Show
          David Alves added a comment - Some questions: Cluster expansion/contraction is supposed to be in a new command right (e.g. ModifyClusterCommand)? As I see it it could be something like: whirr modify-cluster --instance-templates "20 dn+tt" (where instance-templates was before 10 dn+tt) My question is: What are the "old" instance templates from which we are calculating the difference? is it the current cluster instance templates or the configured instance templates? Is there a way to rebuild instance templates based on the current cluster states (instead of configuration). From what I gather in ListClusterCommand the only thing that is retrieved is current node metadata directly from jclouds (node ids, addresses, etc, not the actual services running on the nodes).
          Hide
          Tom White added a comment -

          > A general solution would be to just have the user change the instance teamplates and then have whirr figure out the difference.

          Indeed. To clarify a little further, I think we're talking about changing the cardinality (e.g. "10 dn+tt" to "20 dn+tt") rather than the cardinality and the groups (e.g. "10 dn+tt" to "20 dn+tt+hbase-regionserver"). In the latter case, Whirr would just start 20 new nodes.

          Show
          Tom White added a comment - > A general solution would be to just have the user change the instance teamplates and then have whirr figure out the difference. Indeed. To clarify a little further, I think we're talking about changing the cardinality (e.g. "10 dn+tt" to "20 dn+tt") rather than the cardinality and the groups (e.g. "10 dn+tt" to "20 dn+tt+hbase-regionserver"). In the latter case, Whirr would just start 20 new nodes.
          Hide
          Elliott Clark added a comment -

          In thinking about it more. A general solution would be to just have the user change the instance teamplates and then have whirr figure out the difference.
          Spinning up a new cluster is just the difference between 0 nodes and the templates
          Adding to a cluster is the difference between whats currently running and the template.

          Like Tom said this is idempotent so it could be part of a work flow.

          Show
          Elliott Clark added a comment - In thinking about it more. A general solution would be to just have the user change the instance teamplates and then have whirr figure out the difference. Spinning up a new cluster is just the difference between 0 nodes and the templates Adding to a cluster is the difference between whats currently running and the template. Like Tom said this is idempotent so it could be part of a work flow.
          Hide
          Tom White added a comment -

          This could be implemented as an "add n nodes" command (like the one in the older scripts), or as a "expand cluster to n nodes" command (pallet takes this approach). The latter is idempotent which might be nice.

          Show
          Tom White added a comment - This could be implemented as an "add n nodes" command (like the one in the older scripts), or as a "expand cluster to n nodes" command (pallet takes this approach). The latter is idempotent which might be nice.

            People

            • Assignee:
              Andrew Bayer
              Reporter:
              Elliott Clark
            • Votes:
              18 Vote for this issue
              Watchers:
              24 Start watching this issue

              Dates

              • Created:
                Updated:

                Development