Uploaded image for project: 'Ambari'
  1. Ambari
  2. AMBARI-15446

Auto-retry on failure during RU/EU

    XMLWordPrintableJSON

Details

    • Story
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.0
    • 2.4.0
    • ambari-server
    • None

    Description

      When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED or HOLDING_TIMEDOUT, want Ambari to automatically retry up to up to x mins. This is useful when a host goes down as Ambari is running a task on it.
      ambari.properties will have 1 new parameter. E.g,.
      stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present)

      If Ambari Server is restarted, it should be able to recover.

      Today, Action Scheduler increases the attempt_count whenever a task is retried, but it requires resetting the start_time to -1. Because of this, we cannot rely on the start_time property to know when to timeout after several retries.
      For the implementation, will add another thread to Ambari that will monitor failed tasks only during active RU/EU and change the status back to PENDING so that Action Scheduler can reschedule it.
      Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are blocking, so no other stages are allowed to proceed.
      In order to know when a task was first started, will add a new property to host_role_command table called original_start_time.

      For the agents, we need to ensure that they always write out a response. On the first heartbeat, it should send the status of its last command so we know it failed and Ambari can retry.

      Attachments

        1. AMBARI-15446.trunk.addendum.patch
          6 kB
          Alejandro Fernandez
        2. AMBARI-15446.trunk.patch
          47 kB
          Alejandro Fernandez

        Issue Links

          Activity

            People

              afernandez Alejandro Fernandez
              afernandez Alejandro Fernandez
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: