[AMBARI-15446] Auto-retry on failure during RU/EU - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Story
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.0
Fix Version/s: 2.4.0
Component/s: ambari-server
Labels:
None

Description

When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED or HOLDING_TIMEDOUT, want Ambari to automatically retry up to up to x mins. This is useful when a host goes down as Ambari is running a task on it.
ambari.properties will have 1 new parameter. E.g,.
stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present)

If Ambari Server is restarted, it should be able to recover.

Today, Action Scheduler increases the attempt_count whenever a task is retried, but it requires resetting the start_time to -1. Because of this, we cannot rely on the start_time property to know when to timeout after several retries.
For the implementation, will add another thread to Ambari that will monitor failed tasks only during active RU/EU and change the status back to PENDING so that Action Scheduler can reschedule it.
Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are blocking, so no other stages are allowed to proceed.
In order to know when a task was first started, will add a new property to host_role_command table called original_start_time.

For the agents, we need to ensure that they always write out a response. On the first heartbeat, it should send the status of its last command so we know it failed and Ambari can retry.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

AMBARI-15446.trunk.addendum.patch
06/Apr/16 20:23
6 kB
Alejandro Fernandez
AMBARI-15446.trunk.patch
21/Mar/16 19:04
47 kB
Alejandro Fernandez

Issue Links

links to

Review Board patch

Activity

People

Assignee:: Alejandro Fernandez

Reporter:: Alejandro Fernandez

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Mar/16 20:20

Updated:: 06/Apr/16 23:38

Resolved:: 21/Mar/16 21:00