[SAMZA-2266] Introduce a backoff when there are repeated failures for host-affinity allocations - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.3
Component/s: None
Labels:
None

Description

The issue here is that we retry allocations of dead containers (and repeatedly on subsequent failures) in a very small window of time (<1min).

It is observed that NMs take ~2mins to mark themselves as unhealthy to the RM.

If a job has host-affinity enabled, this will cause us to allocate containers on the same unhealthy host multiple times and eventually kill the application.

This ticket is to evaluate the feasibility and possibly implement a fix that involves introducing a time backoff on retries of container allocation on the same host - so we eventually get a different host when the unhealthy NM's status is updated.

We may also want to look into the possibility of abandoning host-affinity on the 8th attempt of restarting a container - so we don't kill the entire job.