[SAMZA-867] Fix job restart/shutdown in the event of a node outage. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

A number of jobs failed or restarted when we lost a couple hosts in the cluster.
The theory is that this happened because the AppMaster detects the failed
container before YARN detects the missing NM, so it tries to run the
container on that host again, but doesn't handle the connection errors from the NM properly. Switching from a synchronous NM client model to an async model is expected to help, but we need to discuss this.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SAMZA-867.patch
09/Feb/16 01:17
50 kB
Jake Maes
SAMZA-867_4.patch
24/Feb/16 02:15
66 kB
Jake Maes
SAMZA-867_3.patch
19/Feb/16 01:40
66 kB
Jake Maes
SAMZA-867_2.patch
10/Feb/16 16:00
51 kB
Jake Maes

Issue Links

is cloned by

SAMZA-893 Fix a bug with host affinity request expiration introduced in SAMZA-867

Resolved

Activity

People

Assignee:: Jake Maes

Reporter:: Jake Maes

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Feb/16 23:49

Updated:: 14/Mar/16 03:36

Resolved:: 08/Mar/16 00:54