Whirr
  1. Whirr
  2. WHIRR-410

Review automatic image selection

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.7.0
    • Component/s: None
    • Labels:
      None

      Description

      While I was testing WHIRR-400 I have noticed that the ZooKeeper integration tests are failing on aws-ec2 with the automatically selected AMI but they are working as expected with the Amazon Linux AMI. The tests are also working as expected on cloudservers-us. This makes me think the failure is not related to our code changes and we should look for an external factor as the root cause.

      As part of this issue we should think about how to improve the automatic AMI selection mechanism in order to make it more robust and less likely to fail due to AMI upgrades and other external changes.

      1. jclouds-ssh.log
        184 kB
        Andrei Savu
      2. WHIRR-410.patch
        0.7 kB
        Andrei Savu
      3. WHIRR-410-workaround.patch
        3 kB
        Andrei Savu

        Activity

        Hide
        Adrian Cole added a comment -

        not sure it is what you want, but imageVersion will lock this in without having to code image per region. ex. imageVersion=20110930 is the one that failed today, built on 9/30/2011

        Show
        Adrian Cole added a comment - not sure it is what you want, but imageVersion will lock this in without having to code image per region. ex. imageVersion=20110930 is the one that failed today, built on 9/30/2011
        Hide
        Andrei Savu added a comment -

        So then why is this failing now and it wasn't failing a few days ago?

        Show
        Andrei Savu added a comment - So then why is this failing now and it wasn't failing a few days ago?
        Hide
        Adrian Cole added a comment -

        <shrug> we didn't log what image passed testing before. the date on the image is user-defined, not a system date. It is possible that it wasn't published on 9/30, rather later. until we have data that supports which image ids were tested, it is all guesses.

        Show
        Adrian Cole added a comment - <shrug> we didn't log what image passed testing before. the date on the image is user-defined, not a system date. It is possible that it wasn't published on 9/30, rather later. until we have data that supports which image ids were tested, it is all guesses.
        Hide
        Andrei Savu added a comment -

        I will investigate more tomorrow to find some facts - some sort of explanation.

        Show
        Andrei Savu added a comment - I will investigate more tomorrow to find some facts - some sort of explanation.
        Hide
        Andrei Savu added a comment -

        I've done some testing on the 0.6.0 branch. The ZooKeeper integration tests are passing as expected both with jclouds 1.1.1 and 1.2.1. This makes me think that there is something else broken in trunk. I will tweak a bit TemplateBuilderStrategy and get back with more feedback.

        Show
        Andrei Savu added a comment - I've done some testing on the 0.6.0 branch. The ZooKeeper integration tests are passing as expected both with jclouds 1.1.1 and 1.2.1. This makes me think that there is something else broken in trunk. I will tweak a bit TemplateBuilderStrategy and get back with more feedback.
        Hide
        Andrei Savu added a comment -

        I think that the automatic AMI selection works a bit different in 1.2.1. With the following change the integration tests are working as expected:

        --- core/src/main/java/org/apache/whirr/service/jclouds/TemplateBuilderStrategy.java
        +++ core/src/main/java/org/apache/whirr/service/jclouds/TemplateBuilderStrategy.java
        @@ -33,6 +33,7 @@ public class TemplateBuilderStrategy {
             if (clusterSpec.getImageId() != null) {
               templateBuilder.imageId(clusterSpec.getImageId());
             } else {
        +      templateBuilder.os64Bit(true);
               templateBuilder.osFamily(OsFamily.UBUNTU);
               templateBuilder.osVersionMatches("10.04");
        
        Show
        Andrei Savu added a comment - I think that the automatic AMI selection works a bit different in 1.2.1. With the following change the integration tests are working as expected: --- core/src/main/java/org/apache/whirr/service/jclouds/TemplateBuilderStrategy.java +++ core/src/main/java/org/apache/whirr/service/jclouds/TemplateBuilderStrategy.java @@ -33,6 +33,7 @@ public class TemplateBuilderStrategy { if (clusterSpec.getImageId() != null ) { templateBuilder.imageId(clusterSpec.getImageId()); } else { + templateBuilder.os64Bit( true ); templateBuilder.osFamily(OsFamily.UBUNTU); templateBuilder.osVersionMatches( "10.04" );
        Hide
        Andrei Savu added a comment - - edited

        Attached patch. I will look through some of the code changes committed in jclouds between 1.1.1 and 1.2.1 relevant to this change in behaviour.

        Show
        Andrei Savu added a comment - - edited Attached patch. I will look through some of the code changes committed in jclouds between 1.1.1 and 1.2.1 relevant to this change in behaviour.
        Hide
        Andrei Savu added a comment -

        I still don't understand why I was unable to make it work before by specify the image-id and the location-id to an Ubuntu 10.04 AMI published by Canonical.

        Show
        Andrei Savu added a comment - I still don't understand why I was unable to make it work before by specify the image-id and the location-id to an Ubuntu 10.04 AMI published by Canonical.
        Hide
        Andrei Savu added a comment -

        I'm running checks using the ZooKeeper integration tests.

        With the os64Bit(true) call in place I get two machines t1.micro running AMI: ebs/ubuntu-images/ubuntu-lucid-10.04-amd64-server-20110930 (ami-ad36fbc4) ant the integration test works as expected.

        Without the os64Bit(true) call I get two machines t1.micro running AMI: ebs/ubuntu-images/ubuntu-lucid-10.04-i386-server-20110930 (ami-ab36fbc2) in us-east-1c and ssh connectivity issues:

        2011-10-25 21:34:54,166 WARN  [jclouds.ssh] (user thread 1) << (ubuntu:rsa[fingerprint(75:ae:b7:5a:27:be:a2:4e:3f:a8:65:f0:08:51:2c:7e),sha1(09:45:59:83:4f:02:72:10:e6:6c:10:05:2f:21:51:1c:4d:95:21:5e)]@50.16.38.124:22) error acquiring Session(): Timeout expired
        net.schmizz.sshj.connection.ConnectionException: Timeout expired
        	at net.schmizz.sshj.connection.ConnectionException$1.chain(ConnectionException.java:32)
        	at net.schmizz.sshj.connection.ConnectionException$1.chain(ConnectionException.java:26)
        	at net.schmizz.concurrent.Promise.retrieve(Promise.java:139)
        	at net.schmizz.concurrent.Event.await(Event.java:103)
        	at net.schmizz.sshj.connection.channel.direct.AbstractDirectChannel.open(AbstractDirectChannel.java:68)
        	at net.schmizz.sshj.SSHClient.startSession(SSHClient.java:643)
        	at org.jclouds.sshj.SshjSshClient$4.create(SshjSshClient.java:447)
        	at org.jclouds.sshj.SshjSshClient$4.create(SshjSshClient.java:434)
        	at org.jclouds.sshj.SshjSshClient.acquire(SshjSshClient.java:220)
        	at org.jclouds.sshj.SshjSshClient$ExecConnection.create(SshjSshClient.java:477)
        	at org.jclouds.sshj.SshjSshClient$ExecConnection.create(SshjSshClient.java:460)
        	at org.jclouds.sshj.SshjSshClient.acquire(SshjSshClient.java:220)
        	at org.jclouds.sshj.SshjSshClient.exec(SshjSshClient.java:496)
        	at org.jclouds.compute.callables.SudoAwareInitManager.runCommand(SudoAwareInitManager.java:103)
        	at org.jclouds.compute.callables.SudoAwareInitManager.runAction(SudoAwareInitManager.java:85)
        	at org.jclouds.compute.callables.BlockUntilInitScriptStatusIsZeroThenReturnOutput$1.apply(BlockUntilInitScriptStatusIsZeroThenReturnOutput.java:79)
        	at org.jclouds.compute.callables.BlockUntilInitScriptStatusIsZeroThenReturnOutput$1.apply(BlockUntilInitScriptStatusIsZeroThenReturnOutput.java:75)
        	at org.jclouds.predicates.RetryablePredicate.apply(RetryablePredicate.java:70)
        	at org.jclouds.compute.callables.BlockUntilInitScriptStatusIsZeroThenReturnOutput$3.run(BlockUntilInitScriptStatusIsZeroThenReturnOutput.java:113)
        	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        	at java.lang.Thread.run(Thread.java:680)
        Caused by: java.util.concurrent.TimeoutException: Timeout expired
        	... 23 more
        

        Do we have something in the sshkey install script that fails on 32bit machines?

        Show
        Andrei Savu added a comment - I'm running checks using the ZooKeeper integration tests. With the os64Bit(true) call in place I get two machines t1.micro running AMI: ebs/ubuntu-images/ubuntu-lucid-10.04-amd64-server-20110930 (ami-ad36fbc4) ant the integration test works as expected. Without the os64Bit(true) call I get two machines t1.micro running AMI: ebs/ubuntu-images/ubuntu-lucid-10.04-i386-server-20110930 (ami-ab36fbc2) in us-east-1c and ssh connectivity issues: 2011-10-25 21:34:54,166 WARN [jclouds.ssh] (user thread 1) << (ubuntu:rsa[fingerprint(75:ae:b7:5a:27:be:a2:4e:3f:a8:65:f0:08:51:2c:7e),sha1(09:45:59:83:4f:02:72:10:e6:6c:10:05:2f:21:51:1c:4d:95:21:5e)]@50.16.38.124:22) error acquiring Session(): Timeout expired net.schmizz.sshj.connection.ConnectionException: Timeout expired at net.schmizz.sshj.connection.ConnectionException$1.chain(ConnectionException.java:32) at net.schmizz.sshj.connection.ConnectionException$1.chain(ConnectionException.java:26) at net.schmizz.concurrent.Promise.retrieve(Promise.java:139) at net.schmizz.concurrent.Event.await(Event.java:103) at net.schmizz.sshj.connection.channel.direct.AbstractDirectChannel.open(AbstractDirectChannel.java:68) at net.schmizz.sshj.SSHClient.startSession(SSHClient.java:643) at org.jclouds.sshj.SshjSshClient$4.create(SshjSshClient.java:447) at org.jclouds.sshj.SshjSshClient$4.create(SshjSshClient.java:434) at org.jclouds.sshj.SshjSshClient.acquire(SshjSshClient.java:220) at org.jclouds.sshj.SshjSshClient$ExecConnection.create(SshjSshClient.java:477) at org.jclouds.sshj.SshjSshClient$ExecConnection.create(SshjSshClient.java:460) at org.jclouds.sshj.SshjSshClient.acquire(SshjSshClient.java:220) at org.jclouds.sshj.SshjSshClient.exec(SshjSshClient.java:496) at org.jclouds.compute.callables.SudoAwareInitManager.runCommand(SudoAwareInitManager.java:103) at org.jclouds.compute.callables.SudoAwareInitManager.runAction(SudoAwareInitManager.java:85) at org.jclouds.compute.callables.BlockUntilInitScriptStatusIsZeroThenReturnOutput$1.apply(BlockUntilInitScriptStatusIsZeroThenReturnOutput.java:79) at org.jclouds.compute.callables.BlockUntilInitScriptStatusIsZeroThenReturnOutput$1.apply(BlockUntilInitScriptStatusIsZeroThenReturnOutput.java:75) at org.jclouds.predicates.RetryablePredicate.apply(RetryablePredicate.java:70) at org.jclouds.compute.callables.BlockUntilInitScriptStatusIsZeroThenReturnOutput$3.run(BlockUntilInitScriptStatusIsZeroThenReturnOutput.java:113) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:680) Caused by: java.util.concurrent.TimeoutException: Timeout expired ... 23 more Do we have something in the sshkey install script that fails on 32bit machines?
        Hide
        Andrei Savu added a comment -

        My conclusion for tonight: there is nothing wrong with the automatic selection mechanism and jclouds 1.2.1 works fine. Until now we haven't used 32bit AMIs on t1.micro and I suspect that the user / sshkey setup script has a problem - or maybe something else is not working as expected in Whirr.

        Show
        Andrei Savu added a comment - My conclusion for tonight: there is nothing wrong with the automatic selection mechanism and jclouds 1.2.1 works fine. Until now we haven't used 32bit AMIs on t1.micro and I suspect that the user / sshkey setup script has a problem - or maybe something else is not working as expected in Whirr.
        Hide
        Adrian Cole added a comment -

        well, we'd have to isolate the issue wrt the 32bit OS dist. This isn't the first ubuntu dist that's had ssh connectivity issues (ex. the first Natty cluster compute one did as well). clearly there's something screwy on that image, and it is possibly worth isolating (whether it is getting a copy of sshd logs, etc.). What makes it worth it is that screwy images do exist, and the time to repair the configuration is likely less than time spent troubleshooting this every so often, then saying use another image.

        That said, I think the jclouds templateBuilder can be a bit smarter wrt 64bit. Currently, it doesn't sort on 64bit unless you've explicitly specified. So, instead, the image is sorting lexicographically. Since amd64 is before i386, we end up prefering 32bit amis (at least those with this naming convention). This isn't ideal. I suspect this was in place to ensure that m1.small can be selected by default, as it is a 32 bit machine. However, it is no longer the cheapest, and only in aws-ec2 is m1.small 32bit (ex. in eucalyptus, etc. it is 64bit). Thoughts are that we could try changing jclouds TemplateBuilderImpl to prefer 64bit (squishily), just that we should make sure it doesn't artificially fail when only a 32bit image is available. This can be tested by playing with "stub" computeService so that it only has 32 bit images and see if the expressions hold. (ex. StubComputeServiceAdapter line 131)

        Show
        Adrian Cole added a comment - well, we'd have to isolate the issue wrt the 32bit OS dist. This isn't the first ubuntu dist that's had ssh connectivity issues (ex. the first Natty cluster compute one did as well). clearly there's something screwy on that image, and it is possibly worth isolating (whether it is getting a copy of sshd logs, etc.). What makes it worth it is that screwy images do exist, and the time to repair the configuration is likely less than time spent troubleshooting this every so often, then saying use another image. That said, I think the jclouds templateBuilder can be a bit smarter wrt 64bit. Currently, it doesn't sort on 64bit unless you've explicitly specified. So, instead, the image is sorting lexicographically. Since amd64 is before i386, we end up prefering 32bit amis (at least those with this naming convention). This isn't ideal. I suspect this was in place to ensure that m1.small can be selected by default, as it is a 32 bit machine. However, it is no longer the cheapest, and only in aws-ec2 is m1.small 32bit (ex. in eucalyptus, etc. it is 64bit). Thoughts are that we could try changing jclouds TemplateBuilderImpl to prefer 64bit (squishily), just that we should make sure it doesn't artificially fail when only a 32bit image is available. This can be tested by playing with "stub" computeService so that it only has 32 bit images and see if the expressions hold. (ex. StubComputeServiceAdapter line 131)
        Hide
        Andrei Savu added a comment -

        It looks like this is no longer a problem. Integration tests are working as expected in trunk.

        Show
        Andrei Savu added a comment - It looks like this is no longer a problem. Integration tests are working as expected in trunk.
        Hide
        Andrei Savu added a comment -

        I was wrong. This is still happening for ami-ab36fbc2 - with the chef tests I've used a larger instance and probably a 64bit OS.

        Show
        Andrei Savu added a comment - I was wrong. This is still happening for ami-ab36fbc2 - with the chef tests I've used a larger instance and probably a 64bit OS.
        Hide
        Alex Heneveld added a comment -

        I'm currently seeing ERROR messages intermittently but they seem to be benign (chef tests passing), using default 32-bit Ubuntu ami-ab36fbc2 us-east small. (I think this should be classed as a bug in jclouds, described more below, but not serious.

        Part of me suspects there are other issues with these Ubuntu images because I've seen more worrying failures (timeouts a couple minutes in to setup) but I can't isolate them and they might be my own fault (in some cases I know they are my own fault, whirr picking up active servers from previous test runs).

        As for jclouds default image selection strategy – give me some decent standard recent Linux – I like it and the querying is really useful when I care. To guarantee image reliability my preference would be to see jclouds maintain and publish a list of favourite images for popular OS/arch combos at popular providers+locations, and to support looking these up at runtime (and also maybe caching in ~/.jclouds could be faster than current query strategy?). Would be interesting to know if jclouds users would like that.

        Benign ERROR Messages

        The worrying but benign messages I'm seeing are below.

        Based on auth.log on the server, I think these are due to sshj connection attempts while the server has a race: sshd started but the host key not set up (auth.log shows error: Could not load host key: /etc/ssh/ssh_host_rsa_key and same for dsa), causing sshd to break the transport, and sshj to complain loudly. These should probably be logged as <=INFO in jclouds, suppressing sshj's eagerness to log ERROR.

        terminal
        2011-10-26 11:54:14,606 ERROR [net.schmizz.sshj.transport.TransportImpl] (reader) Dying because - net.schmizz.sshj.transport.TransportException: Broken transport; encountered EOF
        2011-10-26 11:54:14,607 ERROR [net.schmizz.concurrent.Promise] (user thread 3) <<kex done>> woke to: net.schmizz.sshj.transport.TransportException: Broken transport; encountered EOF
        2011-10-26 11:54:14,632 WARN  [jclouds.ssh] (user thread 3) << (ubuntu:rsa[fingerprint(e8:73:76:c8:b7:3a:22:dc:64:f1:10:60:c9:bc:a9:2d),sha1(ca:30:55:b6:ba:34:83:78:9d:15:2c:a1:c9:20:4e:4f:37:4a:8a:d4)]@107.22.32.196:22) error acquiring SSHClient(timeout=60000): Broken transport; encountered EOF
        net.schmizz.sshj.transport.TransportException: Broken transport; encountered EOF
             at net.schmizz.sshj.transport.Reader.run(Reader.java:70)
        ...
        

        The above doesn't end up in any of the test-data/ logs (using default whirr test logging), but I do see:

        jclouds.log
        2011-10-26 11:54:14,634 DEBUG [org.jclouds.http.handlers.BackoffLimitedRetryHandler] (user thread 3) Retry 1/7: delaying for 200 ms: (ubuntu:rsa[fingerprint(e8:73:76:c8:b7:3a:22:dc:64:f1:10:60:c9:bc:a9:2d),sha1(ca:30:55:b6:ba:34:83:78:9d:15:2c:a1:c9:20:4e:4f:37:4a:8a:d4)]@107.22.32.196:22) error acquiring SSHClient(timeout=60000): Broken transport; encountered EOF
        2011-10-26 11:54:15,099 DEBUG [org.jclouds.http.handlers.BackoffLimitedRetryHandler] (user thread 3) Retry 2/7: delaying for 800 ms: (ubuntu:rsa[fingerprint(e8:73:76:c8:b7:3a:22:dc:64:f1:10:60:c9:bc:a9:2d),sha1(ca:30:55:b6:ba:34:83:78:9d:15:2c:a1:c9:20:4e:4f:37:4a:8a:d4)]@107.22.32.196:22) error acquiring SSHClient(timeout=60000): Broken transport; encountered EOF
        2011-10-26 11:54:16,184 DEBUG [org.jclouds.http.handlers.BackoffLimitedRetryHandler] (user thread 3) Retry 3/7: delaying for 1800 ms: (ubuntu:rsa[fingerprint(e8:73:76:c8:b7:3a:22:dc:64:f1:10:60:c9:bc:a9:2d),sha1(ca:30:55:b6:ba:34:83:78:9d:15:2c:a1:c9:20:4e:4f:37:4a:8a:d4)]@107.22.32.196:22) error acquiring SSHClient(timeout=60000): Broken transport; encountered EOF
        2011-10-26 11:54:18,343 DEBUG [org.jclouds.http.handlers.BackoffLimitedRetryHandler] (user thread 3) Retry 4/7: delaying for 2000 ms: (ubuntu:rsa[fingerprint(e8:73:76:c8:b7:3a:22:dc:64:f1:10:60:c9:bc:a9:2d),sha1(ca:30:55:b6:ba:34:83:78:9d:15:2c:a1:c9:20:4e:4f:37:4a:8a:d4)]@107.22.32.196:22) error acquiring SSHClient(timeout=60000): Broken transport; encountered EOF
        

        We could probably do with cleaner logging in jclouds as there's no clue what the sshj is doing (e.g. machine status, script, or script status check) and that would be fantastically useful.

        (Will raise this issue at jclouds once we're clear what is jclouds, what is whirr, and what is ubuntu/amazon.)

        Show
        Alex Heneveld added a comment - I'm currently seeing ERROR messages intermittently but they seem to be benign (chef tests passing), using default 32-bit Ubuntu ami-ab36fbc2 us-east small. (I think this should be classed as a bug in jclouds, described more below, but not serious. Part of me suspects there are other issues with these Ubuntu images because I've seen more worrying failures (timeouts a couple minutes in to setup) but I can't isolate them and they might be my own fault (in some cases I know they are my own fault, whirr picking up active servers from previous test runs). As for jclouds default image selection strategy – give me some decent standard recent Linux – I like it and the querying is really useful when I care. To guarantee image reliability my preference would be to see jclouds maintain and publish a list of favourite images for popular OS/arch combos at popular providers+locations, and to support looking these up at runtime (and also maybe caching in ~/.jclouds could be faster than current query strategy?). Would be interesting to know if jclouds users would like that. Benign ERROR Messages The worrying but benign messages I'm seeing are below. Based on auth.log on the server, I think these are due to sshj connection attempts while the server has a race: sshd started but the host key not set up ( auth.log shows error: Could not load host key: /etc/ssh/ssh_host_rsa_key and same for dsa ), causing sshd to break the transport, and sshj to complain loudly. These should probably be logged as <=INFO in jclouds, suppressing sshj's eagerness to log ERROR. terminal 2011-10-26 11:54:14,606 ERROR [net.schmizz.sshj.transport.TransportImpl] (reader) Dying because - net.schmizz.sshj.transport.TransportException: Broken transport; encountered EOF 2011-10-26 11:54:14,607 ERROR [net.schmizz.concurrent.Promise] (user thread 3) <<kex done>> woke to: net.schmizz.sshj.transport.TransportException: Broken transport; encountered EOF 2011-10-26 11:54:14,632 WARN [jclouds.ssh] (user thread 3) << (ubuntu:rsa[fingerprint(e8:73:76:c8:b7:3a:22:dc:64:f1:10:60:c9:bc:a9:2d),sha1(ca:30:55:b6:ba:34:83:78:9d:15:2c:a1:c9:20:4e:4f:37:4a:8a:d4)]@107.22.32.196:22) error acquiring SSHClient(timeout=60000): Broken transport; encountered EOF net.schmizz.sshj.transport.TransportException: Broken transport; encountered EOF at net.schmizz.sshj.transport.Reader.run(Reader.java:70) ... The above doesn't end up in any of the test-data/ logs (using default whirr test logging), but I do see: jclouds.log 2011-10-26 11:54:14,634 DEBUG [org.jclouds.http.handlers.BackoffLimitedRetryHandler] (user thread 3) Retry 1/7: delaying for 200 ms: (ubuntu:rsa[fingerprint(e8:73:76:c8:b7:3a:22:dc:64:f1:10:60:c9:bc:a9:2d),sha1(ca:30:55:b6:ba:34:83:78:9d:15:2c:a1:c9:20:4e:4f:37:4a:8a:d4)]@107.22.32.196:22) error acquiring SSHClient(timeout=60000): Broken transport; encountered EOF 2011-10-26 11:54:15,099 DEBUG [org.jclouds.http.handlers.BackoffLimitedRetryHandler] (user thread 3) Retry 2/7: delaying for 800 ms: (ubuntu:rsa[fingerprint(e8:73:76:c8:b7:3a:22:dc:64:f1:10:60:c9:bc:a9:2d),sha1(ca:30:55:b6:ba:34:83:78:9d:15:2c:a1:c9:20:4e:4f:37:4a:8a:d4)]@107.22.32.196:22) error acquiring SSHClient(timeout=60000): Broken transport; encountered EOF 2011-10-26 11:54:16,184 DEBUG [org.jclouds.http.handlers.BackoffLimitedRetryHandler] (user thread 3) Retry 3/7: delaying for 1800 ms: (ubuntu:rsa[fingerprint(e8:73:76:c8:b7:3a:22:dc:64:f1:10:60:c9:bc:a9:2d),sha1(ca:30:55:b6:ba:34:83:78:9d:15:2c:a1:c9:20:4e:4f:37:4a:8a:d4)]@107.22.32.196:22) error acquiring SSHClient(timeout=60000): Broken transport; encountered EOF 2011-10-26 11:54:18,343 DEBUG [org.jclouds.http.handlers.BackoffLimitedRetryHandler] (user thread 3) Retry 4/7: delaying for 2000 ms: (ubuntu:rsa[fingerprint(e8:73:76:c8:b7:3a:22:dc:64:f1:10:60:c9:bc:a9:2d),sha1(ca:30:55:b6:ba:34:83:78:9d:15:2c:a1:c9:20:4e:4f:37:4a:8a:d4)]@107.22.32.196:22) error acquiring SSHClient(timeout=60000): Broken transport; encountered EOF We could probably do with cleaner logging in jclouds as there's no clue what the sshj is doing (e.g. machine status, script, or script status check) and that would be fantastically useful. (Will raise this issue at jclouds once we're clear what is jclouds, what is whirr, and what is ubuntu/amazon.)
        Hide
        David Alves added a comment -

        I can confirm alex's errors are benign and common. They happened to me too when running itests against ec2 without specifying anything, which means the default m1.medium image.
        The tests all passed though (and I double checked that they couldn't pass if what was supposed to happen didn't actually happen).

        Show
        David Alves added a comment - I can confirm alex's errors are benign and common. They happened to me too when running itests against ec2 without specifying anything, which means the default m1.medium image. The tests all passed though (and I double checked that they couldn't pass if what was supposed to happen didn't actually happen).
        Hide
        Alex Heneveld added a comment -

        Have created jclouds issue for the benign-but-scary log messages: http://code.google.com/p/jclouds/issues/detail?id=738

        Show
        Alex Heneveld added a comment - Have created jclouds issue for the benign-but-scary log messages: http://code.google.com/p/jclouds/issues/detail?id=738
        Hide
        Andrei Savu added a comment -

        I think we should start by only logging ERROR level for jclouds.ssh.

        Show
        Andrei Savu added a comment - I think we should start by only logging ERROR level for jclouds.ssh.
        Hide
        Adrian Cole added a comment -

        also: net.schmizz: level=warn

        Show
        Adrian Cole added a comment - also: net.schmizz: level=warn
        Hide
        Alex Heneveld added a comment -

        In light of:

        • logging fixes upstream in jclouds
        • support for explicit image ID's in whirr
        • no recent reported issues with amazon ubuntu

        Can this can be closed?

        Show
        Alex Heneveld added a comment - In light of: logging fixes upstream in jclouds support for explicit image ID's in whirr no recent reported issues with amazon ubuntu Can this can be closed?
        Hide
        Andrei Savu added a comment -

        Checking now if this is still an issue on Amazon.

        Show
        Andrei Savu added a comment - Checking now if this is still an issue on Amazon.
        Hide
        Andrei Savu added a comment -

        This is stil a problem when using t1.micro with a 32bit OS but I think I've found the root cause: we are flooding the ssh server with too many status update requests for the init script (executed on bootstrap) - check the attached log file.

        Is there a way to rate limit this? (how often jclouds check if a script finished executing)

        Show
        Andrei Savu added a comment - This is stil a problem when using t1.micro with a 32bit OS but I think I've found the root cause: we are flooding the ssh server with too many status update requests for the init script (executed on bootstrap) - check the attached log file. Is there a way to rate limit this? (how often jclouds check if a script finished executing)
        Hide
        Andrei Savu added a comment -

        I'm going to try a possible workaround: setting jclouds.user-threads = 1

        Show
        Andrei Savu added a comment - I'm going to try a possible workaround: setting jclouds.user-threads = 1
        Hide
        Andrei Savu added a comment -

        The class we need to look at is org.jclouds.compute.callables.BlockUntilInitScriptStatusIsZeroThenReturnOutput - we need to specify the time between status checks.

        Show
        Andrei Savu added a comment - The class we need to look at is org.jclouds.compute.callables.BlockUntilInitScriptStatusIsZeroThenReturnOutput - we need to specify the time between status checks.
        Hide
        Andrei Savu added a comment -

        I can't get this to work, tried with 1,2,3,4 as values for jclouds.user-threads. I guess this is too small. We should address this in jclouds for 1.3.0: http://code.google.com/p/jclouds/issues/detail?id=746

        As an workaround I'm going to try to use a larger machine and report back.

        Show
        Andrei Savu added a comment - I can't get this to work, tried with 1,2,3,4 as values for jclouds.user-threads. I guess this is too small. We should address this in jclouds for 1.3.0: http://code.google.com/p/jclouds/issues/detail?id=746 As an workaround I'm going to try to use a larger machine and report back.
        Hide
        Andrei Savu added a comment -

        Seems to be working fine with m1.small. The log file shows that jclouds is hammering the remote machine with status update requests every 1/2 seconds. I will comment out whirr.hardware-min-ram=512 from all integration tests and create a patch.

        Show
        Andrei Savu added a comment - Seems to be working fine with m1.small. The log file shows that jclouds is hammering the remote machine with status update requests every 1/2 seconds. I will comment out whirr.hardware-min-ram=512 from all integration tests and create a patch.
        Hide
        Andrei Savu added a comment -

        Attached patch as described above. I think we should commit this and close the issue. I don't see this as a release blocking issue because anyone doing something serious with Hadoop in EC2 is not going to use t1.micro.

        What do you think?

        Show
        Andrei Savu added a comment - Attached patch as described above. I think we should commit this and close the issue. I don't see this as a release blocking issue because anyone doing something serious with Hadoop in EC2 is not going to use t1.micro. What do you think?
        Hide
        David Alves added a comment -

        Nice catch andrei! it took some real digging.

        I think we should commit this and close the issue. I don't see this as a release blocking issue because anyone doing something serious with Hadoop in EC2 is not going to use t1.micro.

        +1, i totally agree

        Show
        David Alves added a comment - Nice catch andrei! it took some real digging. I think we should commit this and close the issue. I don't see this as a release blocking issue because anyone doing something serious with Hadoop in EC2 is not going to use t1.micro. +1, i totally agree
        Hide
        Andrei Savu added a comment -

        Committed workaround. We should have a permanent fix in 0.8.0 as we update to jclouds 1.3.0. Thanks guys for looking into this!

        Show
        Andrei Savu added a comment - Committed workaround. We should have a permanent fix in 0.8.0 as we update to jclouds 1.3.0. Thanks guys for looking into this!
        Hide
        Alex Heneveld added a comment -

        +1, good call to skip micro here for now & let's make sure jclouds 1.3.0 plays nicely with micros (to the extent possible!)

        Show
        Alex Heneveld added a comment - +1, good call to skip micro here for now & let's make sure jclouds 1.3.0 plays nicely with micros (to the extent possible!)

          People

          • Assignee:
            Andrei Savu
            Reporter:
            Andrei Savu
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development