Whirr
  1. Whirr
  2. WHIRR-63

Support EC2 Cluster Compute Groups for Hadoop

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: service/hadoop
    • Labels:
      None

      Description

      We should support the new EC2 cluster compute groups which have high bandwidth between nodes in the cluster. See http://docs.amazonwebservices.com/AWSEC2/latest/DeveloperGuide/index.html?using_cluster_computing.html

      1. WHIRR-63.patch
        6 kB
        Andrew Bayer

        Issue Links

          Activity

          Hide
          Andrei Savu added a comment -

          Does jclouds support EC2 placement groups so that we can implement this one?

          Show
          Andrei Savu added a comment - Does jclouds support EC2 placement groups so that we can implement this one?
          Hide
          Adrian Cole added a comment -

          if you choose templateBuilder.fastest() using the aws-ec2 provider, a cluster compute instance will be used and setup placement group automatically.

          if you want to specify placementgroup manually use AWSEC2TemplateOptions.placementGroup

          Show
          Adrian Cole added a comment - if you choose templateBuilder.fastest() using the aws-ec2 provider, a cluster compute instance will be used and setup placement group automatically. if you want to specify placementgroup manually use AWSEC2TemplateOptions.placementGroup
          Hide
          Andrei Savu added a comment -

          Tom would it be enough if we expose the "fastest" template selection strategy as a configuration property?

          Show
          Andrei Savu added a comment - Tom would it be enough if we expose the "fastest" template selection strategy as a configuration property?
          Hide
          Tom White added a comment -

          Yes, that would do it.

          Show
          Tom White added a comment - Yes, that would do it.
          Hide
          Evan Pollan added a comment - - edited

          cc1.4xlarge instance price was just halved. Even explicitly specifying us-east-1/ami-e4a7558d (the only non-RHEL image suggested by AWS for cluster compute instances – see WHIRR-148) gives the following error:

          Exception in thread "main" java.util.NoSuchElementException: imageId(us-east-1/ami-e4a7558d) not found
          	at org.jclouds.compute.domain.internal.TemplateBuilderImpl.build(TemplateBuilderImpl.java:567)
          	at org.apache.whirr.actions.BootstrapClusterAction.buildTemplate(BootstrapClusterAction.java:168)
          	at org.apache.whirr.actions.BootstrapClusterAction.doAction(BootstrapClusterAction.java:114)
          	at org.apache.whirr.actions.ScriptBasedClusterAction.execute(ScriptBasedClusterAction.java:80)
          	at org.apache.whirr.ClusterController.launchCluster(ClusterController.java:106)
          	at org.apache.whirr.cli.command.LaunchClusterCommand.run(LaunchClusterCommand.java:62)
          	at org.apache.whirr.cli.Main.run(Main.java:64)
          	at org.apache.whirr.cli.Main.main(Main.java:97)
          
          Show
          Evan Pollan added a comment - - edited cc1.4xlarge instance price was just halved. Even explicitly specifying us-east-1/ami-e4a7558d (the only non-RHEL image suggested by AWS for cluster compute instances – see WHIRR-148 ) gives the following error: Exception in thread "main" java.util.NoSuchElementException: imageId(us-east-1/ami-e4a7558d) not found at org.jclouds.compute.domain.internal.TemplateBuilderImpl.build(TemplateBuilderImpl.java:567) at org.apache.whirr.actions.BootstrapClusterAction.buildTemplate(BootstrapClusterAction.java:168) at org.apache.whirr.actions.BootstrapClusterAction.doAction(BootstrapClusterAction.java:114) at org.apache.whirr.actions.ScriptBasedClusterAction.execute(ScriptBasedClusterAction.java:80) at org.apache.whirr.ClusterController.launchCluster(ClusterController.java:106) at org.apache.whirr.cli.command.LaunchClusterCommand.run(LaunchClusterCommand.java:62) at org.apache.whirr.cli.Main.run(Main.java:64) at org.apache.whirr.cli.Main.main(Main.java:97)
          Hide
          Andrei Savu added a comment -

          An AMI that can be used for cluster compute instances in jclouds should match the following queries:

          See providers/aws-ec2/src/main/java/org/jclouds/aws/ec2/AWSEC2PropertiesBuilder.java

                // amazon, alestic, canonical, and rightscale
                properties.setProperty(PROPERTY_EC2_AMI_QUERY,
                         "owner-id=137112412989,063491364108,099720109477,411009282317;state=available;image-type=machine");
                // amis that work with the cluster instances
                properties.setProperty(PROPERTY_EC2_CC_REGIONS, Region.US_EAST_1);
                properties
                         .setProperty(
                                  PROPERTY_EC2_CC_AMI_QUERY,
                                  "virtualization-type=hvm;architecture=x86_64;owner-id=137112412989,099720109477;hypervisor=xen;state=available;image-type=machine;root-device-type=ebs");
          

          I suggest you try with a 64bit AMI provided either by Amazon or Canonical.

          Show
          Andrei Savu added a comment - An AMI that can be used for cluster compute instances in jclouds should match the following queries: See providers/aws-ec2/src/main/java/org/jclouds/aws/ec2/AWSEC2PropertiesBuilder.java // amazon, alestic, canonical, and rightscale properties.setProperty(PROPERTY_EC2_AMI_QUERY, "owner-id=137112412989,063491364108,099720109477,411009282317;state=available;image-type=machine" ); // amis that work with the cluster instances properties.setProperty(PROPERTY_EC2_CC_REGIONS, Region.US_EAST_1); properties .setProperty( PROPERTY_EC2_CC_AMI_QUERY, "virtualization-type=hvm;architecture=x86_64;owner-id=137112412989,099720109477;hypervisor=xen;state=available;image-type=machine;root-device-type=ebs" ); I suggest you try with a 64bit AMI provided either by Amazon or Canonical.
          Hide
          Adrian Cole added a comment -

          you can override these properties, but it looks like SLES comes with different pricing, so we shouldn't make this default, right?
          http://aws.amazon.com/suse/

          Right now, you can use ubuntu or amazon linux with cluster compute instance by default.

          Show
          Adrian Cole added a comment - you can override these properties, but it looks like SLES comes with different pricing, so we shouldn't make this default, right? http://aws.amazon.com/suse/ Right now, you can use ubuntu or amazon linux with cluster compute instance by default.
          Hide
          Evan Pollan added a comment -

          The only AMI returned by a cluster compute search was the one I tried above. I tried a search for HVM AMIs, and the only hit I got was an Amazon AMI (ami-0da96764).

          I got further, but still wasn't able to get a cluster up and running:

          <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed
          <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed
          <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed
          <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed
          <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed
          <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed
          <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed
          <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed
          <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed
          <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed
          <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed
          <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed
          <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed
          <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed
          <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed
          Dying because - java.net.SocketTimeoutException: Read timed out
          Dying because - java.net.SocketTimeoutException: Read timed out
          Dying because - java.net.SocketTimeoutException: Read timed out
          Dying because - java.net.SocketTimeoutException: Read timed out
          Dying because - java.net.SocketTimeoutException: Read timed out
          Dying because - java.net.SocketTimeoutException: Read timed out
          Dying because - java.net.SocketTimeoutException: Read timed out
          Dying because - java.net.SocketTimeoutException: Read timed out
          Dying because - java.net.SocketTimeoutException: Read timed out
          Dying because - java.net.SocketTimeoutException: Read timed out
          <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed
          Dying because - java.net.SocketTimeoutException: Read timed out
          Dying because - java.net.SocketTimeoutException: Read timed out
          _...lots of lines redacted..._
          Dying because - java.net.SocketTimeoutException: Read timed out
          Dying because - java.net.SocketTimeoutException: Read timed out
          <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed
          << (ec2-user@75.101.226.57:22) error acquiring SSHClient(ec2-user@75.101.226.57:22): Exhausted available authentication methods
          net.schmizz.sshj.userauth.UserAuthException: Exhausted available authentication methods
          	at net.schmizz.sshj.userauth.UserAuthImpl.authenticate(UserAuthImpl.java:114)
          	at net.schmizz.sshj.SSHClient.auth(SSHClient.java:204)
          	at net.schmizz.sshj.SSHClient.authPublickey(SSHClient.java:304)
          	at net.schmizz.sshj.SSHClient.authPublickey(SSHClient.java:323)
          	at org.jclouds.sshj.SshjSshClient$1.create(SshjSshClient.java:183)
          	at org.jclouds.sshj.SshjSshClient$1.create(SshjSshClient.java:155)
          	at org.jclouds.sshj.SshjSshClient.acquire(SshjSshClient.java:204)
          	at org.jclouds.sshj.SshjSshClient.connect(SshjSshClient.java:229)
          	at org.jclouds.compute.callables.RunScriptOnNodeAsInitScriptUsingSsh.call(RunScriptOnNodeAsInitScriptUsingSsh.java:107)
          	at org.jclouds.compute.strategy.RunScriptOnNodeAndAddToGoodMapOrPutExceptionIntoBadMap.call(RunScriptOnNodeAndAddToGoodMapOrPutExceptionIntoBadMap.java:69)
          	at org.jclouds.compute.strategy.RunScriptOnNodeAndAddToGoodMapOrPutExceptionIntoBadMap.call(RunScriptOnNodeAndAddToGoodMapOrPutExceptionIntoBadMap.java:44)
          	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
          	at java.lang.Thread.run(Thread.java:636)
          Caused by: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed
          	at net.schmizz.sshj.userauth.UserAuthImpl.handle(UserAuthImpl.java:157)
          	at net.schmizz.sshj.transport.TransportImpl.handle(TransportImpl.java:474)
          	at net.schmizz.sshj.transport.Decoder.decode(Decoder.java:127)
          	at net.schmizz.sshj.transport.Decoder.received(Decoder.java:195)
          	at net.schmizz.sshj.transport.Reader.run(Reader.java:72)
          << problem applying options to node(us-east-1/i-3bceb758): 
          

          Is there a Hadoop-friendly AMI that's known to work on cluster compute nodes?

          Show
          Evan Pollan added a comment - The only AMI returned by a cluster compute search was the one I tried above. I tried a search for HVM AMIs, and the only hit I got was an Amazon AMI (ami-0da96764). I got further, but still wasn't able to get a cluster up and running: <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed Dying because - java.net.SocketTimeoutException: Read timed out Dying because - java.net.SocketTimeoutException: Read timed out Dying because - java.net.SocketTimeoutException: Read timed out Dying because - java.net.SocketTimeoutException: Read timed out Dying because - java.net.SocketTimeoutException: Read timed out Dying because - java.net.SocketTimeoutException: Read timed out Dying because - java.net.SocketTimeoutException: Read timed out Dying because - java.net.SocketTimeoutException: Read timed out Dying because - java.net.SocketTimeoutException: Read timed out Dying because - java.net.SocketTimeoutException: Read timed out <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed Dying because - java.net.SocketTimeoutException: Read timed out Dying because - java.net.SocketTimeoutException: Read timed out _...lots of lines redacted..._ Dying because - java.net.SocketTimeoutException: Read timed out Dying because - java.net.SocketTimeoutException: Read timed out <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed << (ec2-user@75.101.226.57:22) error acquiring SSHClient(ec2-user@75.101.226.57:22): Exhausted available authentication methods net.schmizz.sshj.userauth.UserAuthException: Exhausted available authentication methods at net.schmizz.sshj.userauth.UserAuthImpl.authenticate(UserAuthImpl.java:114) at net.schmizz.sshj.SSHClient.auth(SSHClient.java:204) at net.schmizz.sshj.SSHClient.authPublickey(SSHClient.java:304) at net.schmizz.sshj.SSHClient.authPublickey(SSHClient.java:323) at org.jclouds.sshj.SshjSshClient$1.create(SshjSshClient.java:183) at org.jclouds.sshj.SshjSshClient$1.create(SshjSshClient.java:155) at org.jclouds.sshj.SshjSshClient.acquire(SshjSshClient.java:204) at org.jclouds.sshj.SshjSshClient.connect(SshjSshClient.java:229) at org.jclouds.compute.callables.RunScriptOnNodeAsInitScriptUsingSsh.call(RunScriptOnNodeAsInitScriptUsingSsh.java:107) at org.jclouds.compute.strategy.RunScriptOnNodeAndAddToGoodMapOrPutExceptionIntoBadMap.call(RunScriptOnNodeAndAddToGoodMapOrPutExceptionIntoBadMap.java:69) at org.jclouds.compute.strategy.RunScriptOnNodeAndAddToGoodMapOrPutExceptionIntoBadMap.call(RunScriptOnNodeAndAddToGoodMapOrPutExceptionIntoBadMap.java:44) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang. Thread .run( Thread .java:636) Caused by: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed at net.schmizz.sshj.userauth.UserAuthImpl.handle(UserAuthImpl.java:157) at net.schmizz.sshj.transport.TransportImpl.handle(TransportImpl.java:474) at net.schmizz.sshj.transport.Decoder.decode(Decoder.java:127) at net.schmizz.sshj.transport.Decoder.received(Decoder.java:195) at net.schmizz.sshj.transport.Reader.run(Reader.java:72) << problem applying options to node(us-east-1/i-3bceb758): Is there a Hadoop-friendly AMI that's known to work on cluster compute nodes?
          Hide
          Sean Zhang added a comment -

          As Adrian suggested
          "
          if you choose templateBuilder.fastest() using the aws-ec2 provider, a cluster compute instance will be used and setup placement group automatically.

          if you want to specify placementgroup manually use AWSEC2TemplateOptions.placementGroup
          "

          'fastest' function could be a nice thing to expose, but it is not the solution to this problem. I think we have to go with calling placementGroup explicitly, but placementGroup is not exposed in the TemplateOptions interface. Any suggestions?

          Show
          Sean Zhang added a comment - As Adrian suggested " if you choose templateBuilder.fastest() using the aws-ec2 provider, a cluster compute instance will be used and setup placement group automatically. if you want to specify placementgroup manually use AWSEC2TemplateOptions.placementGroup " 'fastest' function could be a nice thing to expose, but it is not the solution to this problem. I think we have to go with calling placementGroup explicitly, but placementGroup is not exposed in the TemplateOptions interface. Any suggestions?
          Hide
          Adrian Cole added a comment -

          per note above AWSEC2TemplateOptions is what you want. it is a subclass of TemplateOptions

          Show
          Adrian Cole added a comment - per note above AWSEC2TemplateOptions is what you want. it is a subclass of TemplateOptions
          Hide
          Andrew Bayer added a comment -

          So, making sure I understand this, we'd create a placement group first and then add that placement group to the TemplateOptions, akin to how we do spot pricing currently?

          Show
          Andrew Bayer added a comment - So, making sure I understand this, we'd create a placement group first and then add that placement group to the TemplateOptions, akin to how we do spot pricing currently?
          Hide
          Andrew Bayer added a comment -

          Hrm - looks like I'm hitting a similar problem to one Sean reported on the mailing list. I can't get any HVM AMIs to actually start via jclouds. I'll keep digging, but this is, unsurprisingly, annoying.

          Show
          Andrew Bayer added a comment - Hrm - looks like I'm hitting a similar problem to one Sean reported on the mailing list. I can't get any HVM AMIs to actually start via jclouds. I'll keep digging, but this is, unsurprisingly, annoying.
          Hide
          Sean Zhang added a comment -

          Placement group is the first step. Then, you need to make sure you claim the storage that you are entitled to. When you raise the instances, you need to, if you use the API, specify the -b option. see http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/InstanceStorage.html

          I hope that I can help you to avoid some of the pitfalls. I ended up doing most of the stuff manually. I only did my testing once, so not too bad. Good luck.

          Show
          Sean Zhang added a comment - Placement group is the first step. Then, you need to make sure you claim the storage that you are entitled to. When you raise the instances, you need to, if you use the API, specify the -b option. see http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/InstanceStorage.html I hope that I can help you to avoid some of the pitfalls. I ended up doing most of the stuff manually. I only did my testing once, so not too bad. Good luck.
          Hide
          Andrew Bayer added a comment -

          Yeah, I've verified - it's not possible to create at least a Linux HVM instance at this point - see http://code.google.com/p/jclouds/issues/detail?id=1024

          Show
          Andrew Bayer added a comment - Yeah, I've verified - it's not possible to create at least a Linux HVM instance at this point - see http://code.google.com/p/jclouds/issues/detail?id=1024
          Hide
          Andrew Bayer added a comment -

          I've got a possible fix on the jclouds side, but that'll require moving to jclouds 1.5, which means WHIRR-593 needs to be put to bed first.

          Show
          Andrew Bayer added a comment - I've got a possible fix on the jclouds side, but that'll require moving to jclouds 1.5, which means WHIRR-593 needs to be put to bed first.
          Hide
          Andrew Bayer added a comment -

          This is a preliminary patch - it depends on Whirr going to jclouds 1.5, and jclouds 1.5 itself fixing http://code.google.com/p/jclouds/issues/detail?id=1024 (which may be done shortly anyway), but with those in place, it works. If you specify whirr.hardware-id=cc1.4xlarge and a valid HVM AMI (which admittedly is a bit hairy at the moment - there's a Precise HVM AMI, but no Lucid HVM AMI), it'll automatically create a placement group and put all the nodes from the created cluster in it. And if you specify whirr.aws-ec2-placement-group=foo, it'll look for an existing placement group named "foo" and put the nodes in that.

          Show
          Andrew Bayer added a comment - This is a preliminary patch - it depends on Whirr going to jclouds 1.5, and jclouds 1.5 itself fixing http://code.google.com/p/jclouds/issues/detail?id=1024 (which may be done shortly anyway), but with those in place, it works. If you specify whirr.hardware-id=cc1.4xlarge and a valid HVM AMI (which admittedly is a bit hairy at the moment - there's a Precise HVM AMI, but no Lucid HVM AMI), it'll automatically create a placement group and put all the nodes from the created cluster in it. And if you specify whirr.aws-ec2-placement-group=foo, it'll look for an existing placement group named "foo" and put the nodes in that.
          Hide
          Andrei Savu added a comment -

          Looks good to me. +1 to commit as soon as we upgrade to jclouds 1.5.0-beta.7

          Show
          Andrei Savu added a comment - Looks good to me. +1 to commit as soon as we upgrade to jclouds 1.5.0-beta.7
          Hide
          Andrew Bayer added a comment -

          Committed.

          Show
          Andrew Bayer added a comment - Committed.

            People

            • Assignee:
              Andrew Bayer
              Reporter:
              Tom White
            • Votes:
              2 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development