[SPARK-2396] Spark EC2 scripts fail when trying to log in to EC2 instances - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 1.0.0
Fix Version/s: None
Component/s: EC2
Labels:
- aws
- ec2
- ssh
Environment:

Windows 8, Cygwin and command prompt, Python 2.7

Description

I cannot seem to successfully start up a Spark EC2 cluster using the spark-ec2 script.

I'm using variations on the following command:
./spark-ec2 --instance-type=m1.small --region=us-west-1 --spot-price=0.05 --spark-version=1.0.0 -k my-key-name -i my-key-name.pem -s 1 launch spark-test-cluster

The script always allocates the EC2 instances without much trouble, but can never seem to complete the SSH step to install Spark on the cluster. It always complains about my SSH key. If I try to log in with my ssh key doing something like this:

ssh -i my-key-name.pem root@<insert ip of my instance here>

it fails. However, if I log in to the AWS console, click on my instance and select "connect", it displays the instructions for SSHing into my instance (which are no different from the ssh command from above). So, if I rerun the SSH command from above, I'm able to log in.

Next, if I try to rerun the spark-ec2 command from above (replacing "launch" with "start"), the script logs in and starts installing Spark. However, it eventually errors out with the following output:

Cloning into 'spark-ec2'...
remote: Counting objects: 1465, done.
remote: Compressing objects: 100% (697/697), done.
remote: Total 1465 (delta 485), reused 1465 (delta 485)
Receiving objects: 100% (1465/1465), 228.51 KiB | 287 KiB/s, done.
Resolving deltas: 100% (485/485), done.
Connection to ec2-<my-clusters-ip>.us-west-1.compute.amazonaws.com closed.
Searching for existing cluster spark-test-cluster...
Found 1 master(s), 1 slaves
Starting slaves...
Starting master...
Waiting for instances to start up...
Waiting 120 more seconds...
Deploying files to master...
Traceback (most recent call last):
File "./spark_ec2.py", line 823, in <module>
main()
File "./spark_ec2.py", line 815, in main
real_main()
File "./spark_ec2.py", line 806, in real_main
setup_cluster(conn, master_nodes, slave_nodes, opts, False)
File "./spark_ec2.py", line 450, in setup_cluster
deploy_files(conn, "deploy.generic", opts, master_nodes, slave_nodes, modules)
File "./spark_ec2.py", line 593, in deploy_files
subprocess.check_call(command)
File "E:\windows_programs\Python27\lib\subprocess.py", line 535, in check_call
retcode = call(*popenargs, **kwargs)
File "E:\windows_programs\Python27\lib\subprocess.py", line 522, in call
return Popen(*popenargs, **kwargs).wait()
File "E:\windows_programs\Python27\lib\subprocess.py", line 710, in _init_
errread, errwrite)
File "E:\windows_programs\Python27\lib\subprocess.py", line 958, in _execute_child
startupinfo)
WindowsError: [Error 2] The system cannot find the file specified

So, in short, am I missing something or is this a bug? Any help would be appreciated.

Other notes:
-I've tried both us-west-1 and us-east-1 regions.
-I've tried several different instance types.
-I've tried playing with the permissions on the ssh key (600, 400, etc.), but to no avail

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Stephen M. Hopper

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 07/Jul/14 23:49

Updated:: 19/Jan/15 03:16

Resolved:: 19/Jan/15 03:16