Hadoop Common
  1. Hadoop Common
  2. HADOOP-884

Create scripts to run Hadoop on Amazon EC2

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.1
    • Fix Version/s: 0.11.0
    • Component/s: scripts
    • Labels:
      None

      Description

      It is already possible to run Hadoop on Amazon EC2 (http://wiki.apache.org/lucene-hadoop/AmazonEC2), however it is a rather involved, largely manual process. By writing scripts to automate (as far as is possible) image creation and cluster launch it will make it much easier to use Hadoop on EC2.

      1. hadoop-ec2-v1.tar.gz
        3 kB
        Tom White
      2. hadoop-884.patch
        11 kB
        Tom White

        Issue Links

          Activity

          Hide
          Tom White added a comment -

          I've attached a collection of scripts for this feature. It is still rough round the edges, and not ready for inclusion yet (indeed they should probalby be separate from the hadoop distribution), but the scripts work for me on Mac OS X and ubuntu. I've added instructions to the wiki at http://wiki.apache.org/lucene-hadoop/AmazonEC2.

          There are lots of improvements that could be made.

          Show
          Tom White added a comment - I've attached a collection of scripts for this feature. It is still rough round the edges, and not ready for inclusion yet (indeed they should probalby be separate from the hadoop distribution), but the scripts work for me on Mac OS X and ubuntu. I've added instructions to the wiki at http://wiki.apache.org/lucene-hadoop/AmazonEC2 . There are lots of improvements that could be made. Create a Hadoop AMI that runs a parameterized launch to set cluster size and master hostname. See http://docs.amazonwebservices.com/AmazonEC2/dg/2006-10-01/AESDG-chapter-instancedata.html . Such an instance would modify the Hadoop config files on startup to reflect cluster size and master hostname. Setting up DNS is a pain. We could either automate the DNS configuration using DynDNS's webservice ( https://www.dyndns.com/developers/specs/syntax.html ), or do away with having to set up DNS altogether. Create a public Hadoop AMI (for each Hadoop version) so people don't need to build their own. See http://developer.amazonwebservices.com/connect/entry.jspa?entryID=530&ref=featured . Adapt `run-hadoop-cluster` to take the jar containing the MapReduce job as a parameter.
          Hide
          James P. White added a comment -

          I'm quite sure the solution to the DNS problem is Zeroconf.

          http://www.ifcx.org/wiki/LocalNetworking.html

          http://zeroconf.org/

          Amazon is already using it for the parameterized launch. That where the funny "169.254.169.254" address comes from.

          http://docs.amazonwebservices.com/AmazonEC2/dg/2006-10-01/TechnicalFAQ.html#d0e14061

          There are several ways that this can be approached. The one that would help the most people would be to make Hadoop Zeroconf-aware (slaves using service discovery to find the master), but probably the place to start is to just enhance these EC2 scripts.

          Show
          James P. White added a comment - I'm quite sure the solution to the DNS problem is Zeroconf. http://www.ifcx.org/wiki/LocalNetworking.html http://zeroconf.org/ Amazon is already using it for the parameterized launch. That where the funny "169.254.169.254" address comes from. http://docs.amazonwebservices.com/AmazonEC2/dg/2006-10-01/TechnicalFAQ.html#d0e14061 There are several ways that this can be approached. The one that would help the most people would be to make Hadoop Zeroconf-aware (slaves using service discovery to find the master), but probably the place to start is to just enhance these EC2 scripts.
          Hide
          Doug Cutting added a comment -

          I don't think these should go in the normal bin/ directory, but I think including them in the distribution tarfile might be good. They could perhaps go in contrib/ec2/bin/?

          Show
          Doug Cutting added a comment - I don't think these should go in the normal bin/ directory, but I think including them in the distribution tarfile might be good. They could perhaps go in contrib/ec2/bin/?
          Hide
          Tom White added a comment -

          Yes, contrib/ec2/bin/ sounds like the right place.

          Show
          Tom White added a comment - Yes, contrib/ec2/bin/ sounds like the right place.
          Hide
          Doug Cutting added a comment -

          Please mark this as "Patch Available" when you feel these scripts are ready for inclusion. Hopefully they'll make the 0.11 release in two weeks.

          Show
          Doug Cutting added a comment - Please mark this as "Patch Available" when you feel these scripts are ready for inclusion. Hopefully they'll make the 0.11 release in two weeks.
          Hide
          Lee Faris added a comment -

          I was thinking more along the lines of calling the EC2 web service directly via Java. The command line tools are thin wrappers around the web service.

          Show
          Lee Faris added a comment - I was thinking more along the lines of calling the EC2 web service directly via Java. The command line tools are thin wrappers around the web service.
          Hide
          Tom White added a comment -

          I agree that long term it would be more efficient to call the EC2 web service via Java, and these scripts could be the basis for this. At the moment, I'm focusing on getting the scripts working smoothly.

          Show
          Tom White added a comment - I agree that long term it would be more efficient to call the EC2 web service via Java, and these scripts could be the basis for this. At the moment, I'm focusing on getting the scripts working smoothly.
          Hide
          Tom White added a comment -

          The attached patch includes the Hadoop EC2 scripts in contrib/ec2/bin. I think they are ready for inclusion in the main distribution now.

          I have extended the scripts since the version in the tar.gz file by making them more robust: they no longer have to be unpacked and invoked from the user's home directory. More significantly, I have used a parameterized launch to set cluster size and master hostname. Previously, you had to build an image for a particular cluster size and hostname - now you can build one image and choose the cluster size and host name at launch time. (This is a step towards shared Hadoop images.)

          As for the other improvements, I will create new Jira issues for them, since the basic scripts are in a working state (although I would love feedback if anyone tries them out).

          James - thank you for the suggestion about Zeroconf. I've not had any experience with it, so any help would be appreciated.

          Show
          Tom White added a comment - The attached patch includes the Hadoop EC2 scripts in contrib/ec2/bin. I think they are ready for inclusion in the main distribution now. I have extended the scripts since the version in the tar.gz file by making them more robust: they no longer have to be unpacked and invoked from the user's home directory. More significantly, I have used a parameterized launch to set cluster size and master hostname. Previously, you had to build an image for a particular cluster size and hostname - now you can build one image and choose the cluster size and host name at launch time. (This is a step towards shared Hadoop images.) As for the other improvements, I will create new Jira issues for them, since the basic scripts are in a working state (although I would love feedback if anyone tries them out). James - thank you for the suggestion about Zeroconf. I've not had any experience with it, so any help would be appreciated.
          Hide
          Hadoop QA added a comment -

          +1, because http://issues.apache.org/jira/secure/attachment/12349853/hadoop-884.patch applied and successfully tested against trunk revision r501182.

          Show
          Hadoop QA added a comment - +1, because http://issues.apache.org/jira/secure/attachment/12349853/hadoop-884.patch applied and successfully tested against trunk revision r501182.
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Tom!

          A couple of future improvements to ponder:

          • perhaps the env file shouldn't be in subversion, but rather a template should be that's copied into place. That way we don't risk checking in an editted version.
          • bit of documentation, perhaps just a README, should ideally be bundled with this.
          Show
          Doug Cutting added a comment - I just committed this. Thanks, Tom! A couple of future improvements to ponder: perhaps the env file shouldn't be in subversion, but rather a template should be that's copied into place. That way we don't risk checking in an editted version. bit of documentation, perhaps just a README, should ideally be bundled with this.

            People

            • Assignee:
              Tom White
              Reporter:
              Tom White
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development