Libcloud
  1. Libcloud
  2. LIBCLOUD-157

Deployment script retries are brain-dead

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.8.0
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None

      Description

      in common/base, NodeDriver._run_deployment_script has the following retry wrapper:

      tries = 0
      while tries < max_tries:
      try:
      node = task.run(node, ssh_client)
      except Exception:
      tries += 1
      if tries >= max_tries:
      raise LibcloudError(value='Failed after %d tries'
      % (max_tries), driver=self)
      else:
      ssh_client.close()
      return node

      The except Exception swallows all errors, making debugging very hard.

      Furthermore, max_tries is effectively hard-coded in deploy_node():

      self._run_deployment_script(task=kwargs['deploy'],
      node=node,
      ssh_client=ssh_client,
      max_tries=3)

      ... forcing people who want to control retries to spin their own deploy_node().

      Suggestions:

      • at a minimum, log or warn about the error that's caught in the retry loop
      • better yet, make the catch more fine-grained, so that errors that we know won't be retry-able will fail out immediately.
      • think about making the default number of max_tries 1
      • make max_tries controllable from deploy_node

        Activity

        Mark Nottingham created issue -
        Hide
        Tomaz Muraus added a comment -

        I agree that debugging deployment issues is currently pretty hard. I had this problem myself so I have recently added some changes so now if you use LIBCLOUD_DEBUG=<file obj> this will also turn on paramiko debug mode so this way you at least see paramiko debug messages.

        In any case I like the suggestion #2, and #4. As far as the #3 goes I think max_retries=1 is too low, because in many cases node is returned in the response, but the actually server hasn't been fully started yet (SSH server is not yet listening).

        In cases like this paramiko throws a socket timeout errors and if max_retries=1 deployment would fail.

        Show
        Tomaz Muraus added a comment - I agree that debugging deployment issues is currently pretty hard. I had this problem myself so I have recently added some changes so now if you use LIBCLOUD_DEBUG=<file obj> this will also turn on paramiko debug mode so this way you at least see paramiko debug messages. In any case I like the suggestion #2, and #4. As far as the #3 goes I think max_retries=1 is too low, because in many cases node is returned in the response, but the actually server hasn't been fully started yet (SSH server is not yet listening). In cases like this paramiko throws a socket timeout errors and if max_retries=1 deployment would fail.
        Tomaz Muraus made changes -
        Field Original Value New Value
        Assignee Tomaz Muraus [ kami ]

          People

          • Assignee:
            Tomaz Muraus
            Reporter:
            Mark Nottingham
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development