Hadoop Common
  1. Hadoop Common
  2. HADOOP-2410

Make EC2 cluster nodes more independent of each other

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.16.1
    • Fix Version/s: 0.17.0
    • Component/s: contrib/cloud
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Hide
      The command "hadoop-ec2 run" has been replaced by "hadoop-ec2 launch-cluster <group> <number of instances>", and "hadoop-ec2 start-hadoop" has been removed since Hadoop is started on instance start up. See http://wiki.apache.org/hadoop/AmazonEC2 for details.
      Show
      The command "hadoop-ec2 run" has been replaced by "hadoop-ec2 launch-cluster <group> <number of instances>", and "hadoop-ec2 start-hadoop" has been removed since Hadoop is started on instance start up. See http://wiki.apache.org/hadoop/AmazonEC2 for details.

      Description

      The cluster start up scripts currently wait for each node to start up before appointing a master (to run the namenode and jobtracker on), and copying private keys to all the nodes, and writing the private IP address of the master to the hadoop-site.xml file (which is then copied to the slaves via rsync). Only once this is all done is hadoop started on the cluster (from the master). This can fail if any of the nodes fails to come up, which can happen as EC2 doesn't guarantee that you get a cluster of the size you ask for (I've seen this happen).

      The process would be more robust if each node was told the address of the master as user metadata and then started its own daemons. This is complicated by the fact that the public DNS alias of the master resolves to a public IP address so cannot be used by EC2 nodes (see http://docs.amazonwebservices.com/AWSEC2/2007-08-29/DeveloperGuide/instance-addressing.html). Instead we need to use a trick (http://developer.amazonwebservices.com/connect/message.jspa?messageID=71126#71126) to find the private IP, and what's more we need to attempt to resolve the private IP in a loop until it is available since the DNS will only be set up after the master has started.

      This change will also mean the private key doesn't need to be copied to each node, which can be slow and has dubious security. Configuration can be handled using the mechanism described in HADOOP-2409.

      1. concurrent-clusters.patch
        36 kB
        Chris K Wensel
      2. concurrent-clusters-2.patch
        36 kB
        Chris K Wensel
      3. concurrent-clusters-3.patch
        36 kB
        Chris K Wensel
      4. ec2.tgz
        7 kB
        Chris K Wensel

        Activity

        Tom White created issue -
        Tom White made changes -
        Field Original Value New Value
        Component/s contrib/ec2 [ 12311822 ]
        Hide
        Tom White added a comment -

        Another motivation for this change is to make it more straightforward to add nodes on the fly to an existing cluster. The only change needed would be an extra parameter in the launch script to indicate whether to start a master node - if set to true instance 0 would start as the master (as it does at the moment), otherwise all the new instances would connect to an existing master.

        Show
        Tom White added a comment - Another motivation for this change is to make it more straightforward to add nodes on the fly to an existing cluster. The only change needed would be an extra parameter in the launch script to indicate whether to start a master node - if set to true instance 0 would start as the master (as it does at the moment), otherwise all the new instances would connect to an existing master.
        Hide
        Tom White added a comment -

        Amazon has just released a new feature, EC2 Availability Zones (http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1347&categoryID=112), which give some control over instance placement. This would be useful for adding nodes on the fly to an existing cluster to ensure that nodes are in the same zone.

        Also, it is now possible to use the public IP address of EC2 nodes from within the EC2 cluster (contrary to the comment in the description above). However, this will incur data transfer costs, which can be avoided by using the private IP address. See http://docs.amazonwebservices.com/AWSEC2/2008-02-01/DeveloperGuide/instance-addressing.html.

        Show
        Tom White added a comment - Amazon has just released a new feature, EC2 Availability Zones ( http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1347&categoryID=112 ), which give some control over instance placement. This would be useful for adding nodes on the fly to an existing cluster to ensure that nodes are in the same zone. Also, it is now possible to use the public IP address of EC2 nodes from within the EC2 cluster (contrary to the comment in the description above). However, this will incur data transfer costs, which can be avoided by using the private IP address. See http://docs.amazonwebservices.com/AWSEC2/2008-02-01/DeveloperGuide/instance-addressing.html .
        Hide
        Chris K Wensel added a comment -

        Here is a first cut at supporting multiple concurrent clusters, the image instance sizes, zone availability, and ganglia.

        Show
        Chris K Wensel added a comment - Here is a first cut at supporting multiple concurrent clusters, the image instance sizes, zone availability, and ganglia.
        Chris K Wensel made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Affects Version/s 0.16.1 [ 12312927 ]
        Hide
        Chris K Wensel added a comment -

        Patch

        Show
        Chris K Wensel added a comment - Patch
        Chris K Wensel made changes -
        Attachment concurrent-clusters.patch [ 12378761 ]
        Hide
        Chris K Wensel added a comment -

        This patch represents a fair number of changes and will need accompanying documentation.

        The typical usecase is this:

        > hadoop-ec2 launch-cluster my-group 5
        > hadoop-ec2 push my-group path/to/some.jar
        > hadoop-ec2 login my-group
        > hadoop-ec2 terminate-cluster my-group

        In another window (after launch-cluster), this is quite useful, and works will with FoxyProxy:
        > hadoop-ec2 proxy my-group

        There are still some rough edges I think.

        > hadoop-ec2
        Usage: hadoop-ec2 COMMAND
        where COMMAND is one of:
        list list all running Hadoop EC2 clusters
        launch-cluster <group> <num slaves> launch a cluster of Hadoop EC2 instances - launch-master then launch-slaves
        launch-master <group> launch or find a cluster master
        launch-slaves <group> <num slaves> launch the cluster slaves
        terminate-cluster terminate all Hadoop EC2 instances
        login <group|instance id> login to the master node of the Hadoop EC2 cluster
        screen <group|instance id> start or attach 'screen' on the master node of the Hadoop EC2 cluster
        proxy <group|instance id> start a socks proxy on localhost:6666 (use w/foxyproxy)
        push <group> <file> scp a file to the master node of the Hadoop EC2 cluster
        <shell cmd> <group|instance id> execute any command remotely on the master
        create-image create a Hadoop AMI

        Show
        Chris K Wensel added a comment - This patch represents a fair number of changes and will need accompanying documentation. The typical usecase is this: > hadoop-ec2 launch-cluster my-group 5 > hadoop-ec2 push my-group path/to/some.jar > hadoop-ec2 login my-group > hadoop-ec2 terminate-cluster my-group In another window (after launch-cluster), this is quite useful, and works will with FoxyProxy: > hadoop-ec2 proxy my-group There are still some rough edges I think. > hadoop-ec2 Usage: hadoop-ec2 COMMAND where COMMAND is one of: list list all running Hadoop EC2 clusters launch-cluster <group> <num slaves> launch a cluster of Hadoop EC2 instances - launch-master then launch-slaves launch-master <group> launch or find a cluster master launch-slaves <group> <num slaves> launch the cluster slaves terminate-cluster terminate all Hadoop EC2 instances login <group|instance id> login to the master node of the Hadoop EC2 cluster screen <group|instance id> start or attach 'screen' on the master node of the Hadoop EC2 cluster proxy <group|instance id> start a socks proxy on localhost:6666 (use w/foxyproxy) push <group> <file> scp a file to the master node of the Hadoop EC2 cluster <shell cmd> <group|instance id> execute any command remotely on the master create-image create a Hadoop AMI
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12378761/concurrent-clusters.patch
        against trunk revision 619744.

        @author +1. The patch does not contain any @author tags.

        tests included -1. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        patch -1. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2084/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12378761/concurrent-clusters.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included -1. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. patch -1. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2084/console This message is automatically generated.
        Hide
        Chris K Wensel added a comment -

        submitted with correct paths

        Show
        Chris K Wensel added a comment - submitted with correct paths
        Chris K Wensel made changes -
        Attachment concurrent-clusters.patch [ 12378766 ]
        Chris K Wensel made changes -
        Attachment concurrent-clusters.patch [ 12378766 ]
        Chris K Wensel made changes -
        Attachment concurrent-clusters.patch [ 12378761 ]
        Hide
        Chris K Wensel added a comment -

        correct path offset

        Show
        Chris K Wensel added a comment - correct path offset
        Chris K Wensel made changes -
        Attachment concurrent-clusters.patch [ 12378767 ]
        Hide
        Chris K Wensel added a comment -

        removed DFS_WRITE_RETRIES as it's not really necessary with the new kernels.

        Show
        Chris K Wensel added a comment - removed DFS_WRITE_RETRIES as it's not really necessary with the new kernels.
        Chris K Wensel made changes -
        Attachment concurrent-clusters-2.patch [ 12378798 ]
        Chris K Wensel made changes -
        Attachment ec2.zip [ 12378799 ]
        Chris K Wensel made changes -
        Attachment ec2.zip [ 12378799 ]
        Hide
        Chris K Wensel added a comment -

        a tar of all relevant files.

        Show
        Chris K Wensel added a comment - a tar of all relevant files.
        Chris K Wensel made changes -
        Attachment ec2.tgz [ 12378800 ]
        Chris K Wensel made changes -
        Comment [ A zip archive of the scripts so people don't have to futz with patching trunk. ]
        Hide
        Tom White added a comment -

        Chris,

        These changes look great. Thanks! I'll try them out next week.

        Show
        Tom White added a comment - Chris, These changes look great. Thanks! I'll try them out next week.
        Hide
        Chris K Wensel added a comment -

        this version checks both current releases and archives for the hadoop distro

        Show
        Chris K Wensel added a comment - this version checks both current releases and archives for the hadoop distro
        Chris K Wensel made changes -
        Attachment concurrent-clusters-3.patch [ 12379340 ]
        Hide
        Tom White added a comment -

        I've just committed this. Thanks Chris!

        I tried out the new scripts and they worked fine. I changed the version of Hadoop in the env file to be 0.17.0 so that it picks up the new AMI when it is created (after 0.17.0 is released). I also changed the version of Java to 1.6.0_05.

        Chris, could you update the documentation on the wiki page with the changes please? It would be worth keeping the instructions for the older scripts around on the same page.

        Show
        Tom White added a comment - I've just committed this. Thanks Chris! I tried out the new scripts and they worked fine. I changed the version of Hadoop in the env file to be 0.17.0 so that it picks up the new AMI when it is created (after 0.17.0 is released). I also changed the version of Java to 1.6.0_05. Chris, could you update the documentation on the wiki page with the changes please? It would be worth keeping the instructions for the older scripts around on the same page.
        Tom White made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hadoop Flags [Incompatible change, Reviewed]
        Release Note The command "hadoop-ec2 run" has been replaced by "hadoop-ec2 launch-cluster <group> <number of instances>", and "hadoop-ec2 start-hadoop" has been removed since Hadoop is started on instance start up. See http://wiki.apache.org/hadoop/AmazonEC2 for details.
        Assignee Chris K Wensel [ cwensel ]
        Fix Version/s 0.17.0 [ 12312913 ]
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #451 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/451/ )
        Hide
        Nate Carlson added a comment -

        One change I've had to made is to add the memory option for the processes the cluster launches in the hadoop-site.xml that gets generated.. this would probably be a good thing to make configurable for the end user.

        I also have manually installed nph-proxy on the master node, with http authentication – makes it much easier to get around the slave nodes.

        Show
        Nate Carlson added a comment - One change I've had to made is to add the memory option for the processes the cluster launches in the hadoop-site.xml that gets generated.. this would probably be a good thing to make configurable for the end user. I also have manually installed nph-proxy on the master node, with http authentication – makes it much easier to get around the slave nodes.
        Hide
        Chris K Wensel added a comment -

        Good idea re configurable memory for the trackers and datanode services, though I find the defaults fine (so far). But I tend to pass my child vm option in per job since they vary. Still a good idea to provide the option.

        Note that:

        hadoop-ec2 proxy <cluster-name>

        starts a local SOCKS tunnel. Used with FoxyProxy FF plugin, you can browse your cluster.

        Show
        Chris K Wensel added a comment - Good idea re configurable memory for the trackers and datanode services, though I find the defaults fine (so far). But I tend to pass my child vm option in per job since they vary. Still a good idea to provide the option. Note that: hadoop-ec2 proxy <cluster-name> starts a local SOCKS tunnel. Used with FoxyProxy FF plugin, you can browse your cluster.
        Nigel Daley made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Chris K Wensel
            Reporter:
            Tom White
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development