[SAMZA-835] Certain Errors in AM don't cause retry of failed AM containers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

Currently, a Samza Job could fail owing to numerous reasons.
1. Successive container failures occuring within a certain time window, containers exceeding resource requests (like memory over-utilization)
2. AM failures like - AM not able to spawn a container because an NM was unreachable, Yarn exception when the AM try to execute a container on an NM, NM token expiration etc.

When there are type (2) failures, Yarn does not restart the AM. Most of these failures, can be solved by re-trying the AM attempt at a different host.

Reason: Currently, we explicitly unregister the AM from the RM when the AM shuts-down irrespective of the final app status. This causes Yarn to assume that the AM finished successfully (removing the AM from the RM state transition monitoring).

When a job starts, the state is UNDEFINED. We manipulate the state to be SUCCESS or FAILURE depending on events we receive from the RM.

When we end the job, (possibly because of (1) or (2)), The key is to not call unregister when the state is UNDEFINED. This will ensure that we will retry the AM attempt.

Attachments

Activity

People

Assignee:: Jagadish

Reporter:: Jagadish

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/Dec/15 09:18

Updated:: 08/Dec/15 09:18