Whirr
  1. Whirr
  2. WHIRR-268

whirr hangs when the file '$HOME/.ssh/known_hosts' includes an obsolete identifier for a certain ip address host.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3.0
    • Fix Version/s: 0.5.0
    • Component/s: core
    • Labels:
      None

      Description

      my properties file is ...

      $ cat cluster.properties 
      whirr.cluster-name=mycluster
      whirr.instance-templates=1 jt+nn,10 dn+tt
      whirr.provider=ec2
      whirr.identity=XXXXXXXXXXXXXXXXXXXX
      whirr.credential=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
      whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
      whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
      whirr.location-id=us-east-1d
      #whirr.hardware-id=m1.small
      whirr.hardware-id=c1.medium
      whirr.service-name=hadoop
      # for m1.small
      #whirr.image-id=us-east-1/ami-2caa5845
      whirr.image-id=us-east-1/ami-7000f019
      
      $ whirr/bin/whirr launch-cluster --config cluster.properties
      Bootstrapping cluster
      Configuring template
      Starting 10 node(s) with roles [tt, dn]
      Configuring template
      Starting 1 node(s) with roles [jt, nn]
      Nodes started: [[id=us-east-1/i-ba63a7d5, providerId=i-ba63a7d5, tag=mycluster, name=null, 
      location=[id=us-east-1a, scope=ZONE, description=us-east-1a, parent=us-east-1], uri=null, 
      imageId=us-east-1/ami-7000f019, os=[name=null, family=ubuntu, version=10.04, arch=paravirtual, 
      is64Bit=false, description=ubuntu-images-us/ubuntu-lucid-10.04-i386-server-
      20110201.1.manifest.xml], userMetadata={}, state=RUNNING, privateAddresses=[10.245.106.99], 
      publicAddresses=[184.72.166.132], hardware=[id=c1.medium, providerId=c1.medium, name=c1.medium, 
      processors=[[cores=2.0, speed=2.5]], ram=1740, volumes=[[id=null, type=LOCAL, size=10.0, 
      device=/dev/sda1, durable=false, isBootDevice=true], [id=null, type=LOCAL, size=340.0, 
      device=/dev/sda2, durable=false, isBootDevice=false]], supportsImage=Not(is64Bit())]]]
      

      As you can see the above message, whirr is trying to startup a host whose ip address is '10.245.106.99'. But, whirr hangs and doesn't startup the hadoop service. So, I tried to login to the host '10.245.106.99' via ssh.

      hadoop@domU-12-31-39-00-A5-21:~$ ssh ubuntu@10.245.106.99
      @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
      @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
      @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
      IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
      Someone could be eavesdropping on you right now (man-in-the-middle attack)!
      It is also possible that the RSA host key has just been changed.
      The fingerprint for the RSA key sent by the remote host is
      b1:62:ad:fd:3f:a7:29:df:7f:0c:91:ca:ed:66:8e:3a.
      Please contact your system administrator.
      Add correct host key in /home/hadoop/.ssh/known_hosts to get rid of this message.
      Offending key in /home/hadoop/.ssh/known_hosts:8
      RSA host key for 10.245.106.99 has changed and you have requested strict checking.
      Host key verification failed.
      

      I expected that whirr hangs when the file '$HOME/.ssh/known_hosts' includes the obsolete entry. Although such case may occur rarely, someone who employs many instances in EC2 would meet such case. whirr needs to prevent hanging caused by obsolete identifiers of ssh.

      1. WHIRR-268.patch
        3 kB
        Adrian Cole
      2. WHIRR-268.patch
        0.7 kB
        Andrei Savu

        Activity

        Hide
        Tom White added a comment -

        I've seen this occasionally and have had to edit known_hosts manually. Is there a better way to deal with this? At the very least we should put guidance in the docs.

        Show
        Tom White added a comment - I've seen this occasionally and have had to edit known_hosts manually. Is there a better way to deal with this? At the very least we should put guidance in the docs.
        Hide
        Adrian Cole added a comment -

        I'm guessing by the log output that this is coming from the ssh proxy, which uses commandline ssh.

        we should be able to work around this by adding the following to the ssh args:
        -o StrictHostKeyChecking=no

        http://www.symantec.com/connect/articles/ssh-host-key-protection

        Show
        Adrian Cole added a comment - I'm guessing by the log output that this is coming from the ssh proxy, which uses commandline ssh. we should be able to work around this by adding the following to the ssh args: -o StrictHostKeyChecking=no http://www.symantec.com/connect/articles/ssh-host-key-protection
        Hide
        Tom White added a comment -

        Even when StrictHostKeyChecking is set to 'no' the keys are added to known_hosts - the problem is that nothing removes them, so there is a risk of conflicts if you get the same host for another cluster.

        Show
        Tom White added a comment - Even when StrictHostKeyChecking is set to 'no' the keys are added to known_hosts - the problem is that nothing removes them, so there is a risk of conflicts if you get the same host for another cluster.
        Hide
        Adrian Cole added a comment -

        ahh.. sorry. I see what you mean. this is from within the cluster, and not the user's laptop?

        maybe on configure() we can address this?

        Show
        Adrian Cole added a comment - ahh.. sorry. I see what you mean. this is from within the cluster, and not the user's laptop? maybe on configure() we can address this?
        Hide
        Tom White added a comment -

        I think the hanging may be coincidental since jclouds doesn't use known_hosts AFAIK. But it is a problem with the Hadoop proxy, which is run from the user's laptop, or if the user tries to ssh into a machine in the cluster.

        Show
        Tom White added a comment - I think the hanging may be coincidental since jclouds doesn't use known_hosts AFAIK. But it is a problem with the Hadoop proxy, which is run from the user's laptop, or if the user tries to ssh into a machine in the cluster.
        Hide
        Hyunsik Choi added a comment - - edited

        I found an ad-hoc solution. The solution is to add the ssh option '-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no'. Another way is to modify the global ssh configuration file (/etc/ssh/ssh_config) or user-specific configuration file ($

        {HOME}

        /.ssh/config') with the following lines:

        Host 10.*.*.*
        StrictHostKeyChecking no
        UserKnownHostsFile=/dev/null

        How about this way?

        Show
        Hyunsik Choi added a comment - - edited I found an ad-hoc solution. The solution is to add the ssh option '-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no'. Another way is to modify the global ssh configuration file (/etc/ssh/ssh_config) or user-specific configuration file ($ {HOME} /.ssh/config') with the following lines: Host 10.*.*.* StrictHostKeyChecking no UserKnownHostsFile=/dev/null How about this way?
        Hide
        Tom White added a comment -

        Looks like the best solution I've seen.

        Show
        Tom White added a comment - Looks like the best solution I've seen.
        Hide
        Adrian Cole added a comment -

        lets do this for all private ranges?
        // 24-bit Block (/8 prefix, 1/A) 10.0.0.0 10.255.255.255
        // 20-bit Block (/12 prefix, 16/B) 172.16.0.0 172.31.255.255
        // 16-bit Block (/16 prefix, 256/C) 192.168.0.0 192.168.255.255 65536

        Show
        Adrian Cole added a comment - lets do this for all private ranges? // 24-bit Block (/8 prefix, 1/A) 10.0.0.0 10.255.255.255 // 20-bit Block (/12 prefix, 16/B) 172.16.0.0 172.31.255.255 // 16-bit Block (/16 prefix, 256/C) 192.168.0.0 192.168.255.255 65536
        Hide
        Hyunsik Choi added a comment -

        @Adrian That's good idea. Intuitively, I know how to set both A and C class private ranges. They will be as follows:

        Host 10.*.*.*
        StrictHostKeyChecking no
        UserKnownHostsFile=/dev/null
        
        Host 192.168.*.*
        StrictHostKeyChecking no
        UserKnownHostsFile=/dev/null
        

        However, I cannot find how to set B class private range. According to ssh_config man page, openssh seems to only support '*' and '?' for IP range patterns.

        Anyone knows that?

        Show
        Hyunsik Choi added a comment - @Adrian That's good idea. Intuitively, I know how to set both A and C class private ranges. They will be as follows: Host 10.*.*.* StrictHostKeyChecking no UserKnownHostsFile=/dev/null Host 192.168.*.* StrictHostKeyChecking no UserKnownHostsFile=/dev/null However, I cannot find how to set B class private range. According to ssh_config man page, openssh seems to only support '*' and '?' for IP range patterns. Anyone knows that?
        Hide
        Andrei Savu added a comment -

        Trivial patch as discussed.

        Show
        Andrei Savu added a comment - Trivial patch as discussed.
        Hide
        Adrian Cole added a comment -

        patch includes prior changes and then adds the /etc/ssh/ssh_config rules.

        tested on cloudservers-us

        Show
        Adrian Cole added a comment - patch includes prior changes and then adds the /etc/ssh/ssh_config rules. tested on cloudservers-us
        Hide
        Andrei Savu added a comment -

        Adrian, I don't understand why we need to update ssh_config (global ssh client settings) on the nodes. We are only seeing this problem on the client running Whirr and nodes always start with an empty known_hosts file. By doing this we make nodes vulnerable to man-in-the-middle attacks inside the cloud provider network. The first patch should be enough to solve the reported issue.

        Show
        Andrei Savu added a comment - Adrian, I don't understand why we need to update ssh_config (global ssh client settings) on the nodes. We are only seeing this problem on the client running Whirr and nodes always start with an empty known_hosts file. By doing this we make nodes vulnerable to man-in-the-middle attacks inside the cloud provider network. The first patch should be enough to solve the reported issue.
        Hide
        Adrian Cole added a comment -

        no problem using the previous patch. I thought that hadoop itself depends in ssh inter-connectivity inside the cloud. just ignore or remove the patch I sent.

        Show
        Adrian Cole added a comment - no problem using the previous patch. I thought that hadoop itself depends in ssh inter-connectivity inside the cloud. just ignore or remove the patch I sent.
        Hide
        Andrei Savu added a comment -

        I've just committed the first version of the patch. Thanks guys!

        Show
        Andrei Savu added a comment - I've just committed the first version of the patch. Thanks guys!

          People

          • Assignee:
            Andrei Savu
            Reporter:
            Hyunsik Choi
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development