[TEZ-3718] Better handling of 'bad' nodes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

At the moment, the default behaviour in case of a node being marked bad is to do nothing other than not schedule new tasks on this node.
The alternate, via config, is to retroactively kill every task which ran on the node, which causes far too many unnecessary re-runs.

Proposing the following changes.
1. KILL fragments which are currently in the RUNNING state (instead of relying on a timeout which leads to the attempt being marked as FAILED after the timeout interval.
2. Keep track of these failed nodes, and use this as input to the failure heuristics. Normally source tasks require multiple consumers to report failure for them to be marked as bad. If a single consumer reports failure against a source which ran on a bad node, consider it bad and re-schedule immediately. (Otherwise failures can take a while to propagate, and jobs get a lot slower).

jlowe - think you've looked at this in the past. Any thoughts/suggestions.
What I'm seeing is retroactive failures taking a long time to apply, and restart sources which ran on a bad node. Also running tasks being counted as FAILURES instead of KILLS.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TEZ-3718.1.patch
31/May/17 00:29
27 kB
Zhiyuan Yang
TEZ-3718.2.patch
27/Jun/17 10:25
54 kB
Zhiyuan Yang
TEZ-3718.3.patch
05/Jul/17 18:16
38 kB
Zhiyuan Yang
TEZ-3718.4.patch
10/Oct/17 22:39
33 kB
Zhiyuan Yang

Activity

People

Assignee:: Zhiyuan Yang

Reporter:: Siddharth Seth

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 09/May/17 20:18

Updated:: 19/Mar/19 16:14