Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2396

Spark EC2 scripts fail when trying to log in to EC2 instances

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 1.0.0
    • None
    • EC2
    • Windows 8, Cygwin and command prompt, Python 2.7

    Description

      I cannot seem to successfully start up a Spark EC2 cluster using the spark-ec2 script.

      I'm using variations on the following command:
      ./spark-ec2 --instance-type=m1.small --region=us-west-1 --spot-price=0.05 --spark-version=1.0.0 -k my-key-name -i my-key-name.pem -s 1 launch spark-test-cluster

      The script always allocates the EC2 instances without much trouble, but can never seem to complete the SSH step to install Spark on the cluster. It always complains about my SSH key. If I try to log in with my ssh key doing something like this:

      ssh -i my-key-name.pem root@<insert ip of my instance here>

      it fails. However, if I log in to the AWS console, click on my instance and select "connect", it displays the instructions for SSHing into my instance (which are no different from the ssh command from above). So, if I rerun the SSH command from above, I'm able to log in.

      Next, if I try to rerun the spark-ec2 command from above (replacing "launch" with "start"), the script logs in and starts installing Spark. However, it eventually errors out with the following output:

      Cloning into 'spark-ec2'...
      remote: Counting objects: 1465, done.
      remote: Compressing objects: 100% (697/697), done.
      remote: Total 1465 (delta 485), reused 1465 (delta 485)
      Receiving objects: 100% (1465/1465), 228.51 KiB | 287 KiB/s, done.
      Resolving deltas: 100% (485/485), done.
      Connection to ec2-<my-clusters-ip>.us-west-1.compute.amazonaws.com closed.
      Searching for existing cluster spark-test-cluster...
      Found 1 master(s), 1 slaves
      Starting slaves...
      Starting master...
      Waiting for instances to start up...
      Waiting 120 more seconds...
      Deploying files to master...
      Traceback (most recent call last):
      File "./spark_ec2.py", line 823, in <module>
      main()
      File "./spark_ec2.py", line 815, in main
      real_main()
      File "./spark_ec2.py", line 806, in real_main
      setup_cluster(conn, master_nodes, slave_nodes, opts, False)
      File "./spark_ec2.py", line 450, in setup_cluster
      deploy_files(conn, "deploy.generic", opts, master_nodes, slave_nodes, modules)
      File "./spark_ec2.py", line 593, in deploy_files
      subprocess.check_call(command)
      File "E:\windows_programs\Python27\lib\subprocess.py", line 535, in check_call
      retcode = call(*popenargs, **kwargs)
      File "E:\windows_programs\Python27\lib\subprocess.py", line 522, in call
      return Popen(*popenargs, **kwargs).wait()
      File "E:\windows_programs\Python27\lib\subprocess.py", line 710, in _init_
      errread, errwrite)
      File "E:\windows_programs\Python27\lib\subprocess.py", line 958, in _execute_child
      startupinfo)
      WindowsError: [Error 2] The system cannot find the file specified

      So, in short, am I missing something or is this a bug? Any help would be appreciated.

      Other notes:
      -I've tried both us-west-1 and us-east-1 regions.
      -I've tried several different instance types.
      -I've tried playing with the permissions on the ssh key (600, 400, etc.), but to no avail

      Attachments

        Activity

          People

            Unassigned Unassigned
            enraged_ginger Stephen M. Hopper
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: