Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-749

spark-ec2 fails to detect cluster after ssh error during launch

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.8.1
    • EC2
    • None

    Description

      I tried to launch an EC2 cluster using Patrick's new version of the EC2 script, running this command

       ./spark-ec2 -i ~/.ssh/id_rsa -s 10 -t m1.large --spot-price 0.25 --region us-west-1 launch test_cluster
      

      This launched a cluster, but ssh failed because I didn't configure things properly

      Setting up security groups...
      Searching for existing cluster test_cluster...
      Spark AMI: ami-61ffd024
      Launching instances...
      Requesting 10 slaves as spot instances with price $0.250
      Waiting for spot instances to be granted...
      0 of 10 slaves granted, waiting longer
      ...
      All 10 slaves granted
      Launched master in us-west-1a, regid = r-790f2320
      Waiting for instances to start up...
      Waiting 120 more seconds...
      Copying SSH key /Users/joshrosen/.ssh/id_rsa to master...
      Warning: Permanently added 'ec2-54-241-179-230.us-west-1.compute.amazonaws.com,54.241.179.230' (RSA) to the list of known hosts.
      Permission denied (publickey).
      

      When I tried to shut this cluster down, the EC2 script claims that it couldn't find a cluster. I see the same message when using the 0.7 version of spark-ec2.

      If I view the EC2 administration console, it tells me that I have a number of machines running in us-west-1 with the right security groups.

      Although the script finds the active instance requests, it can't find any security group names for those requests, so it fails to identify those machines as belonging to our spark-ec2 cluster.

      I think the culprit is the line

      group_names = [g.name for g in res.groups]
      

      in get_existing_cluster().

      If I replace this with

      group_names = list(set(g.name for g in i.groups for i in res.instances))
      

      then it finds the group names and detects the cluster.

      Does anyone know what's going on here? I'm not familiar enough with Boto to explain why the Instance would have groups attribute while its Request's groups are empty.

      Attachments

        Activity

          People

            ilikerps Aaron Davidson
            joshrosen Josh Rosen
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: