Whirr
  1. Whirr
  2. WHIRR-414

whirr can have a non-zero return code and unterminated (orphaned) host instances

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.7.0
    • Component/s: core
    • Labels:
      None
    • Environment:

      EC2, commandline whirr

      Description

      Whirr can fail to completely start a cluster and indicates this with a non-zero return code. In many (currently intermittent) partial failure scenarios, there are resources still active (EC2 machine instances, in my experience) that are not cleaned up.

      The log contains "IOException: Too many instance failed while bootstrapping!" when I have seen orphaned nodes.

      A non-zero return code should guarantee that all resources are cleaned up. Without this post-condition, these failures require manual inspection and cleanup to stop useless expenses (which is why I marked this bug critical; it needs to be addressed for any kind of cron job triggered whirr).

      1. WHIRR-414-ignore-missing-instances-file.patch
        3 kB
        Andrei Savu
      2. WHIRR-414-ignore-missing-instances-file.patch
        3 kB
        Andrei Savu
      3. WHIRR-414.patch
        6 kB
        Andrei Savu
      4. WHIRR-414.patch
        8 kB
        Andrei Savu
      5. WHIRR-414.patch
        7 kB
        David Alves
      6. WHIRR-414.patch
        7 kB
        David Alves
      7. WHIRR-414.patch
        2 kB
        Andrei Savu

        Activity

        Hide
        Andrei Savu added a comment -

        Committed. Thanks David for reviewing!

        Show
        Andrei Savu added a comment - Committed. Thanks David for reviewing!
        Hide
        David Alves added a comment -

        +1 fair enough (+1 on the patch too).

        Show
        David Alves added a comment - +1 fair enough (+1 on the patch too).
        Hide
        Andrei Savu added a comment -

        Because I think that sometimes we want the method to throw an exception rather than doing a check for an empty cluster.

        Show
        Andrei Savu added a comment - Because I think that sometimes we want the method to throw an exception rather than doing a check for an empty cluster.
        Hide
        David Alves added a comment -

        Andrei sorry for not loooking into this sooner.

        Looks good, small nit: why not make load() return an empty cluster whenever there is nothing to load instead of using tryLoadOrEmpty().

        Show
        David Alves added a comment - Andrei sorry for not loooking into this sooner. Looks good, small nit: why not make load() return an empty cluster whenever there is nothing to load instead of using tryLoadOrEmpty().
        Hide
        Andrei Savu added a comment -

        Made the log message more informative.

        Show
        Andrei Savu added a comment - Made the log message more informative.
        Hide
        Andrei Savu added a comment -

        Return an empty Cluster instance if the instances file is missing. We will no longer executed destroy scripts but this should be expected on install / configure failure.

        Show
        Andrei Savu added a comment - Return an empty Cluster instance if the instances file is missing. We will no longer executed destroy scripts but this should be expected on install / configure failure.
        Hide
        Andrei Savu added a comment -

        While working on WHIRR-416 I have realised that if we get a failure in launchCluster the instances file does not exists. I'm reopening the issue to attach a patch to handle that scenario.

        Show
        Andrei Savu added a comment - While working on WHIRR-416 I have realised that if we get a failure in launchCluster the instances file does not exists. I'm reopening the issue to attach a patch to handle that scenario.
        Hide
        Andrei Savu added a comment -

        Committed. Thanks guys!

        Show
        Andrei Savu added a comment - Committed. Thanks guys!
        Hide
        Andrei Savu added a comment -

        Removed whitespace changes and fixed typo. Tom I will take you "looks good" as a +1.

        Show
        Andrei Savu added a comment - Removed whitespace changes and fixed typo. Tom I will take you "looks good" as a +1.
        Hide
        Tom White added a comment -

        Looks good.

        Nits: lots of whitespace changes; "Wether" should be "Whether".

        Show
        Tom White added a comment - Looks good. Nits: lots of whitespace changes; "Wether" should be "Whether".
        Hide
        Andrei Savu added a comment -

        Moved error handling code down the stack. Please review!

        Show
        Andrei Savu added a comment - Moved error handling code down the stack. Please review!
        Hide
        David Alves added a comment -

        +1 The idea crossed my mind, but I didn't want to deviate from the initial patch.

        Show
        David Alves added a comment - +1 The idea crossed my mind, but I didn't want to deviate from the initial patch.
        Hide
        Andrei Savu added a comment -

        I like the idea but I think that we should add this behaviour inside ClusterController.launchCluster so that when it throws an exception you can be sure there are no machines left running because Whirr is also a library and I don't think the client should duplicate this logic in his own code. I will update the patch as needed now.

        Show
        Andrei Savu added a comment - I like the idea but I think that we should add this behaviour inside ClusterController.launchCluster so that when it throws an exception you can be sure there are no machines left running because Whirr is also a library and I don't think the client should duplicate this logic in his own code. I will update the patch as needed now.
        Hide
        David Alves added a comment -

        checkstyle trouble

        Show
        David Alves added a comment - checkstyle trouble
        Hide
        David Alves added a comment -

        Andrei, sorry I pushed the wrong button (wrt to assign-to-me), oops...
        Posting a new patch (checkstyle trouble)

        Show
        David Alves added a comment - Andrei, sorry I pushed the wrong button (wrt to assign-to-me), oops... Posting a new patch (checkstyle trouble)
        Hide
        Andrei Savu added a comment -

        David thanks for taking the initiative to update the patch. I was just going to do something similar. I will review now.

        Show
        Andrei Savu added a comment - David thanks for taking the initiative to update the patch. I was just going to do something similar. I will review now.
        Hide
        David Alves added a comment -

        Patch that does what was agreed.

        Defaults on destroying the dangling instances when launch fails for some reason.

        Is possible to circumvent this by setting whirr.terminate-all-on-launch-failure to false.

        update the site to reflect the new property.

        Show
        David Alves added a comment - Patch that does what was agreed. Defaults on destroying the dangling instances when launch fails for some reason. Is possible to circumvent this by setting whirr.terminate-all-on-launch-failure to false. update the site to reflect the new property.
        Hide
        David Alves added a comment -

        terminate or destroy is a better word choice than kill.

        I was using the terms interchangeably, but you are right.

        Indeed, there must be a way to inhibit server cleanup for diagnostic reasons. I agree that default should be to clean up.

        good, we agree, i'll post a patch soon

        Show
        David Alves added a comment - terminate or destroy is a better word choice than kill. I was using the terms interchangeably, but you are right. Indeed, there must be a way to inhibit server cleanup for diagnostic reasons. I agree that default should be to clean up. good, we agree, i'll post a patch soon
        Hide
        Paul Baclace added a comment -

        Indeed, there must be a way to inhibit server cleanup for diagnostic reasons. I agree that default should be to clean up.

        terminate or destroy is a better word choice than kill.

        Show
        Paul Baclace added a comment - Indeed, there must be a way to inhibit server cleanup for diagnostic reasons. I agree that default should be to clean up. terminate or destroy is a better word choice than kill.
        Hide
        David Alves added a comment -

        Something simple such as a property should suffice at this point, like whirr.kill-all-on-launch-failure, true by default.

        I can have a stab at it if you are busy Andrei (even in case we decide just to plainly kill all I can update the patch).

        Show
        David Alves added a comment - Something simple such as a property should suffice at this point, like whirr.kill-all-on-launch-failure, true by default. I can have a stab at it if you are busy Andrei (even in case we decide just to plainly kill all I can update the patch).
        Hide
        David Alves added a comment - - edited

        I'm not saying that a possible (and even the default) behavior would not be to kill all machines.

        I'm just saying that it should be configurable, I can easily see cases where not killing all machines would be advantageous (transient provider errors, testing, development). For instance in testing/development/debugging you might want to log into the machines to see what went wrong, or if you have idempotent bootstrap/configure you might be able to add machines without having to waste those that did not fail to start, or if the machines failed in the config phase you might decide to use them for some other purpose (since you are paying for them).

        Show
        David Alves added a comment - - edited I'm not saying that a possible (and even the default) behavior would not be to kill all machines. I'm just saying that it should be configurable, I can easily see cases where not killing all machines would be advantageous (transient provider errors, testing, development). For instance in testing/development/debugging you might want to log into the machines to see what went wrong, or if you have idempotent bootstrap/configure you might be able to add machines without having to waste those that did not fail to start, or if the machines failed in the config phase you might decide to use them for some other purpose (since you are paying for them).
        Hide
        Paul Baclace added a comment -

        Another way to see this: Orphaned servers from a failed command is like a memory leak, except it is leaking entire machines.

        Show
        Paul Baclace added a comment - Another way to see this: Orphaned servers from a failed command is like a memory leak, except it is leaking entire machines.
        Hide
        Paul Baclace added a comment -

        The contract of non-zero return code is clear: the program failed. For automation purposes it is essential that this logic works correctly.

        Solving the problem of false positive DOA (dead on arrival) servers is another matter.

        For me, emitting a message about possible orphans really means send an alert and waking me up from much-needed sleep. (This is is what I currently do and it is not so nice.)

        Show
        Paul Baclace added a comment - The contract of non-zero return code is clear: the program failed. For automation purposes it is essential that this logic works correctly. Solving the problem of false positive DOA (dead on arrival) servers is another matter. For me, emitting a message about possible orphans really means send an alert and waking me up from much-needed sleep. (This is is what I currently do and it is not so nice.)
        Hide
        David Alves added a comment -

        the patch no longer applies cleanly (after WHIRR-419)

        Other than that the approach seems fine, although some time in the future I would like to see some choice. I mean if we try and launch a cluster and it was unsuccessful for some reason do we always want to kill all the machines that did start?

        Couldn't we alternatively inform the user that there are dangling instances that need to be shutdown manually.

        I'd like to see more opinions on the matter, anyone?

        Show
        David Alves added a comment - the patch no longer applies cleanly (after WHIRR-419 ) Other than that the approach seems fine, although some time in the future I would like to see some choice. I mean if we try and launch a cluster and it was unsuccessful for some reason do we always want to kill all the machines that did start? Couldn't we alternatively inform the user that there are dangling instances that need to be shutdown manually. I'd like to see more opinions on the matter, anyone?
        Hide
        Andrei Savu added a comment -

        David this should be easy to review - ~1 minute. What do you think?

        Show
        Andrei Savu added a comment - David this should be easy to review - ~1 minute. What do you think?
        Hide
        Andrei Savu added a comment -

        Paul what do you think?

        Show
        Andrei Savu added a comment - Paul what do you think?
        Hide
        Andrei Savu added a comment -

        Is this approach good enough for now? Do we want something more advanced?

        Show
        Andrei Savu added a comment - Is this approach good enough for now? Do we want something more advanced?
        Hide
        Andrei Savu added a comment -

        With this patch + default behaviour we handle 1,2 as described by David. Let me know what you think.

        Show
        Andrei Savu added a comment - With this patch + default behaviour we handle 1,2 as described by David. Let me know what you think.
        Hide
        Andrei Savu added a comment -

        Attached a patch that destroys all nodes if any exception is raised inside ClusterController.launchCluster

        Show
        Andrei Savu added a comment - Attached a patch that destroys all nodes if any exception is raised inside ClusterController.launchCluster
        Hide
        David Alves added a comment -

        +1

        I think beyond the fact that orphaned instances remain, the user should be allow to try again not discarding the instances that did boot (this happens ever more frequently for larger clusters).

        A possible solution would be to implement (and let the user choose) some policies, like:
        1 (Default) - If the cluster cannot start make sure everything is terminated
        2 - Whirr detects that not enough instances come alive until a given time and kills the slow ones and boots new ones. Bootstrap goes on for N attempts.
        3 - Do nothing, if the cluster fails to boot let the user handle it.

        what do you think?

        Show
        David Alves added a comment - +1 I think beyond the fact that orphaned instances remain, the user should be allow to try again not discarding the instances that did boot (this happens ever more frequently for larger clusters). A possible solution would be to implement (and let the user choose) some policies, like: 1 (Default) - If the cluster cannot start make sure everything is terminated 2 - Whirr detects that not enough instances come alive until a given time and kills the slow ones and boots new ones. Bootstrap goes on for N attempts. 3 - Do nothing, if the cluster fails to boot let the user handle it. what do you think?

          People

          • Assignee:
            Andrei Savu
            Reporter:
            Paul Baclace
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development